ITS836 University of The Cumberlands Data Mining Algorithms Assignment

Content Type

User Generated

User

fevxnagufbyvcrgn

Subject

Computer Science

Description

R is a popular programming language used by a growing number of data analysts inside corporations and academia. Students will learn how to apply data mining algorithms in R programming environment.

Part I

Explain each of the following data mining techniques in terms of how the algorithm works, its strength and weakness:

1. Classification
2. Prediction
3. Clustering
4. Association
5. Correlation analysis

Give an example of each data mining functionality, using a real-life database or data set.

Part II

Using the Ruspini data set provided with the cluster package in R, perform a k-means analysis. Document the findings and justify the choice of k. Hint: Use data (Ruspini) to load the dataset into the R workspace.

While APA style is not required for the body of this assignment, solid academic writing is expected, and documentation of sources should be presented using APA formatting guidelines, which can be found in the APA Style Guide, located in the Student Success Center.

Unformatted Attachment Preview

Review of Basic Data Analytic Methods Using R Copyright © 2014 EMC Corporation. All Rights Reserved. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 1 Module 3: Basic Data Analytic Methods Using R 1 Review of Basic Data Analytic Methods Using R Analyzing and Exploring the Data During this lesson the following topics are covered: • Why visualize? • Examining a single variable • Examining pairs of variables • Indications of dirty data. • Data exploration vs. presentation Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 2 The topics for this lesson are listed. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 2 Why Visualize? Summary statistics give us some sense of the data:  Mean vs. Median.  Standard deviation.  Quartiles, Min/Max.  Correlations between variables. summary(data) x Min. :-3.05439 1st Qu.:-0.61055 Median : 0.04666 Mean :-0.01105 3rd Qu.: 0.56067 Max. : 2.60614 y Min. :-3.50179 1st Qu.:-0.75968 Median : 0.07340 Mean : 0.09383 3rd Qu.: 0.88114 Max. : 4.28693 Visualization gives us a more holistic sense Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 3 In the previous lesson, we saw how to examine data in R, including how to generate the descriptive statistics: averages, data ranges, and quartiles (which are included in the summary() report). We also saw how to compute correlations between pairs of variables of interest. These statistics do give us a sense of a data: an idea of its magnitude and range, and some obvious dirty data (missing values, values with obviously wrong magnitude or sign). Visualization, however, gives us a succinct, more holistic view of the data that we may not be able to get from the numbers and summaries alone. It is an important facet of the initial data exploration. Visualization helps you assess data cleanliness, and also gives you an idea of potentially important relationships in the data before going on to build your models. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 3 Anscombe’s Quartet 4 data sets, characterized by the following. Are they the same, or are they different? Property Values Mean of x in each case 9 Exact variance of x in each case 11 Exact mean of y in each case 7.5 (to 2 d.p) Variance of Y in each case 4.13 (to 2 d.p) Correlations between x and y in each case Linear regression line in each case 0.816 Y = 3.00 + 0.500x (to 2 d.p and 3 d.p resp.) Copyright © 2014 EMC Corporation. All Rights Reserved. i X y 10.00 8.04 8.00 6.95 13.00 7.58 9.00 8.81 11.00 8.33 14.00 9.96 6.00 7.24 4.00 4.26 12.00 10.84 7.00 4.82 5.00 5.68 ii x y 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 5.00 4.74 iii x y 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.00 5.73 iv x y 8.00 6.58 8.00 5.76 8.00 7.71 8.00 8.84 8.00 8.47 8.00 7.04 8.00 5.25 19.00 12.50 8.00 5.56 8.00 7.91 8.00 6.89 Module 3: Basic Data Analytic Methods Using R 4 Anscombe’s Quartet is a synthesized example by the statistician F. J. Anscombe. Look at the properties and values of these four data sets. Based on standard statistical measures of mean, variance, and correlation (our descriptive statistics), these data sets are identical. Or are they? Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 4 Moral: Visualize Before Analyzing! Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 5 However, if we visualize each data set using a scatterplot and a regression line superimposed over each plot, the datasets appear quite different. Dataset 1 is the best candidate for a regression line, although there is a lot of variation. Dataset 2 is definitely non-linear. Dataset 3 is a close match, but over predicts at higher value of x and has an extreme outlier. And Dataset 4 isn’t captured at all by a simple regression line. Assuming we have datasets represented by data frames s1, s2, s3, and s4, we can generate these plots in R by using the following code: R-Code plot(s1) plot(lm(s1$y ~ s1$x)) … (Yes, a loop is possible but requires more advanced data manipulation: for information, consult the R “eval” function if interested). We also must take care to overwrite the preceding graph in each instance. Code to produce these graphs is included in the script AnscombePlot.R. Note that the dataset for these plots are included in the standard R distribution. Type data() for a list of dataset included in the base distribution. data(name) will make that dataset available in your workspace. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 5 Visualizing Your Data • Examining the distribution of a single variable • Analyzing the relationship between two variables • Establishing multiple pair wise relationships between variables • Analyzing a single variable over time • Data exploration versus data presentation Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 6 In a previous lesson, we’ve looked at how you can characterize your data by using traditional statistics. But we also showed how datasets could appear identical when using descriptive statistics, and yet look completely different when visualizing the data via a plot. Using visual representations of data is the hallmark of exploratory data analysis: letting the data speak to us rather than necessarily imposing an interpretation on the data a priori. In the rest of this lesson, we are going to examine ways of displaying data so that we can better understand the underlying distributions of a single variable or the relationships between two or more variables. Although data visualization is a powerful tool, the results we obtain may not be suitable when it comes time for us to “tell a story” about the data. Our last slide will discuss what kind of presentations are most effective. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 6 Examining the Distribution of a Single Variable Graphing a single variable • plot(sort(.)) – for low volume data • hist(.) – a histogram • plot(density(.)) – densityplot  A "continuous histogram“ • Example  Frequency table of household income Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 7 R has multiple functions available to examine a single variable. Some of them are listed above. See the R documentation for each of these. Some other useful functions are barplot() and dotplot(). The example included is a frequency table of household income. We can certainly see a concentration of households in the leftmost portion of the graph. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 7 Examining the Distribution of a Single Variable Graphing a single variable • plot(sort(.)) – for low volume data • hist(.) – a histogram • plot(density(.)) – densityplot  A "continuous histogram“ • Example  Frequency table of household income  rug() plot emphasizes distribution Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 8 R has multiple functions available to examine a single variable. Some of them are listed above. See the R documentation for each of these. Some other useful functions are barplot(), dotplot() and stem(). The example included is a frequency table of log10 of household income. We can certainly see a concentration of households in the rightmost portion of the graph. The rug() function creates a 1-dimensional density plot as well: notice how it emphasizes the area under the curve. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 8 What are we looking for? A sense of the data range • If it's very wide, or very skewed, try computing the log Outliers, anomalies • Possibly evidence of dirty data Shape of the Distribution • Unimodal? Bimodal? • Skewed to left or right? • Approximately normal? Approximately lognormal? Example - Distribution of purchase size ($) • Range from 0 to > $10K, right skewed • Typical of monetary data • Plotting log of data gives better sense of distribution • Two purchasing distributions   ~ $55 ~ $2900 Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 9 When viewing the variables during the data exploration phase, you are looking for a sense of the data range, and whether the values are strongly concentrated in a certain range. If the data is very skewed, viewing the log of the data (if it's all positive) can help you detect structure that you might otherwise miss in a regularly scaled graph. This is your chance to look for obvious signs of dirty data (outliers or unlikely looking values). See if the data is unimodel or multimodal: that gives you an idea of how many distinct populations (with distinct behavior patterns) might be mixed into your overall population. Knowing if the data is approximately normal (or can be transformed to approximately normal – for example, by taking the log) is important, since many modeling techniquest assume that the data is approximately normal in distribution. For our example, we can look at the densityplot of purchase sizes (in $ US) of customers at our online retail site. The range here is extremely wide – from around $1 US to over $10,000 US. Extreme ranges like this are typical of monetary data, like income, customer value, tax liabilities, bank account sizes, etc. (In fact, all of this kind of data is often assumed to be distributed lognormally – that is, its log is a normal distribution). The data range makes it really hard for us to see much detail, so we take the log of it, and then density plot it. Now we can see that there are (at least) two distinct population in our customer base: One population that makes small to medium size purchases (median purchase size about $55 US) and one that makes larger purchases (median purchase size about $2900 US). Can you see those two populations in the top graph? The plots shown were made using the lattice package. If the data is in the vector purchase_size, then the lattice plot is: library(lattice) densityplot(purchase_size) # top plot # bottom plot as log10 is actually # easier to read, but this plot is in natural log densityplot(log(purchase_size) (The commands were actually more complicated than that, but these commands give the basic equivalent) Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 9 Evidence of Dirty Data Missing values? Mis-entered data? Inherited accounts? Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 10 Here's an example of how dirty data might manifest itself in your visualizations. We are looking at the age distribution of account holders at our bank. Mean age is about 40, approximately normally distributed with a standard deviation of about 15 years or so, which makes sense. We see a few accounts with accountholder age < 10; unusual, but plausible. These could be custodial accounts, or college savings accounts set up by the parents of young children. We probably want to keep them for our analysis. There is a huge spike of customers who are zero years old – evidence of missing data. We may need to eliminate these accounts from analysis (depending on how important we think age will be), or track down how to get the appropriate age data. The customers with negative age are probably either missing data, or mis-entered data. The customers who are older than 100 are possibly also mis-entered data, or these are accounts that have been passed down to the heirs of the original accountholders (and not updated).We may want to exclude them as well, or at least threshold the age that we will consider in the analysis. If this data is in a vector called age, then the plot is made by: hist(age, breaks=100, main="Accountholder age distribution", xlab="age", col="gray") Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 10 "Saturated" Data Do we really have no mortgages older than 10 years? Or does the year 2004 in the origination field mean "2004 or prior"? Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 11 Here's another example of dirty (or at least, "incompletely documented" data). We are looking at the age of mortgages in our bank's home loan portfolio. The age is calculated by subtracting the origination date of the loan from "today" (2013). The first thing we notice is that we don't seem to have loans older than 10 years old – and we also notice that we have a disproportionate number of ten year old loans, relative to the age distribution of the other loans. One possible reason for this is that the date field for loan origination may have been "overloaded" so that "2004" is actually a beacon value that means "2004 or prior" rather than literally 2004. (This sometimes happens when data is ported from one system to another, or because someone, somewhere, decided that origination dates prior to 2004 are not relevant). What would we do about this? If we are analyzing probability of default, it is probably safe to eliminate the data (or keep the assumption that the loans are 10 years old), since 10 year old mortgages default quite rarely (most defaults occur before about the 4 th year). For different analyses, we may need to search for a source of valid origination dates (if that is possible). If the data is in the vector mortgage, the plot is made by: hist(mortgage, breaks=10, main="Portfolio Distribution, Years since origination", xlab="Mortgage Age", col="grey") Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 11 Analyzing the Relationship Between Two Variables How? • Two Continuous Variables (or two discrete variables)     Scatterplots LOESS (fit smoothed line to the data) Linear models: graph the correlation Binplots, hexbin plots  More legible color-based plots for high volume data • Continuous vs. Discrete Variable  Jitter, Box and whisker plots, Dotplot or barchart Example: • Household income by region (ZIP1) • Scatterplot with jitter, with box-and-whisker overlaid • New England (0) and West Coast (9) have highest mean household income Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 12 Scatterplots are a good first visualization for the relationship between two variables, especially two continuous variables. Since you are looking for the relationship between the two variables, it can often be helpful to fit a smoothing curve through the data, for example loess or a linear regression. We'll see an example of that a little later on. For very high volume data, scatterplots are problematic; with too much data on the page, the details can get lost. Sometime the jitter() function can create enough (uniform) variation to see the associations more clearly. Hexbin plots are a good alternative: you can think of hexbin plots as two dimensional histograms that use color or grayscale to encode bin heights. There are other alternatives for plotting continuous vs. discrete variables. Dotplots and barcharts plot the continuous value as a function of the discrete value when the relationship is one-to-one. Box and whisker plots show the distribution of the continuous variable for each value of the discrete variable. The example here is of logged household incomes as a function of region (first digit of the zip). (Logged in this case means data that uses the logarithm of the value instead of the value itself.) In this example, we have also plotted the scatterplot beneath the box-and-whisker, with some jittering so each line of points widens into a strip. The "box" of the box and whisker shows the range that contains the central 50% of the data; the line inside the box is the location of the median. The "whiskers" give you an idea of the entire range of the data. Usually, box and whiskers also show "outliers" that lie beyond the whiskers, but they are turned off in this graph. This graphs shows how household income varies by region. The highest median incomes are in New England (region 0) and on the West Coast (region 9). New England is slightly higher, but the boxes for the two regions overlap enough that the difference between the two regions probably is not significant. The lowest household incomes tend to be in region 7 (TX, OK, Ark, LA). Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 12 Two Variables: What are we looking for? • Is there a relationship between the two variables?  Linear? Quadratic?  Exponential?  Try semi-log or log-log plots  Is it a cloud?  Round? Concentrated? Multiple Clusters? • How?  Scatterplots • Example  Red line: linear fit  Blue line: LOESS  Fairly linear relationship, but with wide variance Module 3: Basic Data Analytic Methods Using R 14 Copyright © 2014 EMC Corporation. All Rights Reserved. We are looking for a relationship between the two variables. If the functional relationship between the variables is somewhat pronounced, the data lies roughly along a curve: a straight line, a parabola, or an exponential curve. If y is related exponentially to x, then the plot of (x, log(y)) will be approximately linear. If the data is more like a cloud, the relationship is weaker. In the example here, the relationship seems approximately linear; we've plotted the regression line in red. There are times when a standard regression line just doesn’t capture the relationship. In this case, the loess() function in R (also lowess()) will fit a non-linear line to the data. Here we've drawn the loess curve in blue. R-Code Assume a dataset named ds with variables cesd and mcs. The R code to generate the above plot is as follows. with(ds, { plot(mcs ~ cesd) abline(lm(mcs ~ cesd), lcol=“red”) lines(lowess(mcs ~ cesd), lcol=“blue”) } ) Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 14 Two Variables: High Volume Data - Plotting Scatterplot: Overplotting makes it difficult to see structure Copyright © 2014 EMC Corporation. All Rights Reserved. Hexbinplot: Now we see where the data is concentrated. Module 3: Basic Data Analytic Methods Using R 15 When we have too much data, the structure becomes difficult to see in a scatterplot. Here, we are plotting logged household income against years of education. The "blob" that we get on the scatterplot on the left suggests a somewhat linear relationship (this suggests, but the way, that an extra year of education multiplies your expected income by 10^M, where M is the slope of the regression line). However, we can't really see the structure of how the data is distributed. On the right we have plotted the same data using a hexbinplot. Hexbinplots are a bit like 2-d histograms, where shading tells us how populated the bin is. Now we can see that the data is more densely clustered in a streak that runs through the center of the data cloud, roughly along the regression line. The biggest concentration is around 12 years of education, extending about to about 15 years. Notice also the outlier data at MeanEducation = 0. Missing data perhaps? Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 15 Establishing Multiple Pairwise Relationships Between Variables • Why?  Examine many two-way relationships quickly • How?  pairs(ds) can generate a plot of each pairs of variables • Example Iris Characteristics  Strong linear relationship between petal length and width  Petal dimensions discriminate species more strongly than sepal dimensions Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 17 There are times when it’s useful to see multiple values of a dataset in context in order to visually represent data relationships so as to magnify differences or to show patterns hidden within the data that summary statistics don’t reveal. In the graphic represented above, the variable sepal length, sepal width, petal length and petal width are compared with three species of irises (the key is not listed in the graphic). Colors are used to represent the different species, allowing us to compare differences across species for a particular combination of variables. Consider the values encoded in the second square from the top right, where sepal length is compared with petal length. Values for petal length are encoded across the bottom; values for sepal length are encoded on the right hand side of the graphic. We can observe that the green and blue species are well matched, although the blue species has longer petals in the main. The petal length for the red species, however, remain markedly the same, and vary only in the lower half of sepal length values. As an exercise, imagine fitting a regression line to each of these individual graphs. What would you make of the relationship between sepal length and sepal width? The R code for generating the plot is: pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)] ) and uses the iris dataset included with the R standard distribution. Here colors include the species, as well as proving the spirit of APL is alive and well. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 17 Analyzing a Single Variable over Time What? • Looking for …  Data range  Trends  Seasonality How? • Use time series plot Example •International air travel (1949-1960) • Upward trend: growth appears superlinear • Seasonality  Peak air travel around Nov. with smaller peaks near Mar. and June Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 18 Visualizing a variable over time is the same as visualizing any pair of variables, but in this case we are looking for some specific patterns. Data range, of course, tells us how much our y variable has increased or decreased over the period of time we are considering. We want to get a feeling for the growth rate, and whether or not we see and changes in that growth rate. We are also looking for seasonality: a regular pattern in the fluctuations over a fixed period of time. We can think of those patterns as marking "seasons“. In the air travel data example that we show, we can see that air travel peaks regularly around Nov/Dec (the holiday season), with a smaller peak around the middle of the year (summer travel) and an even smaller one near the beginning of the year (spring break?). We can also see that the number of air passengers increased steadily from 1949 to 1960, and that the growth appears to be faster than linear, at least during peak travel season. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 18 Data Exploration vs. Presentation Data Exploration: This tells you what you need to know. Presentation: This tells the stakeholders what they need to know. Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 19 Finally, we want to touch on the difference between using visualization for data exploration, and for presenting results to stakeholders. The plots and tips that we've discussed try to make the details of the data as clear as possible for the data scientist to see structure and relationships. These technical graphs don't always effectively convey the information that needs to be conveyed to non-technical stakeholders. For them, we want crisp graphics that focus on the message we want to convey. We will touch more on this topic in Module 6, but for right now we'll share a small example. The top graph shows the density plot of logged account values for our bank. This graph gives us, as data scientists, information that can be relevant to downstream analysis. The account values are distributed approximately lognormally, in the range from 100 to 10M dollars. The median account value is in the area of $30,000 (10^4.5), with the bulk of the accounts between $1000 US and $1M US dollars. It would be hard to explain this graph to stakeholders. For one thing, densityplots are fairly technical, and for another, it is awkward to explain why you are logging the data before showing it. You can convey essentially the same information by partitioning the data into "log-like" bins, and presenting the histogram of those bins, as we do in the bottom plot. Here, we can see that the bulk of the accounts are in the 1000-1M range, with the peak concentration in the 10-50K range, extending out to about 500K. This gives the stakeholders a better sense of the customer base than the top graphic would. [Note – the reason that the lower graph isn't symmetric like the upper graph is because the bins are only "log-like". They aren't truly log10 scaled. Log10 scaled bins would be closer to: 1-3K, 3K-10K, 10K30K..... As an exercise, we could try splitting the bins that way, and we would see that the resulting bar chart would be symmetric. The bins we chose, however, might seem more "natural" to the stakeholders.] Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 19 Check Your Knowledge • Do you think the regression line sufficiently captures the relationship between the two variables? What might you do differently? • In the Iris slide example, how would you characterize the relationship between sepal width and sepal length? • Did you notice the use of color in the Iris slide? Was it effective? Why or why not? Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 21 Please take a moment to answer these questions. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 21 Review of Basic Data Analytic Methods Using R: Analysis Summary During this lesson the following topics were covered: • Justifying why we visualize data • Using plots and graphs to determine: • Shape of a single variable • “dirty” data or “saturated” data • Relationship between two or more variables • Relationship between multiple variables • A single variable over time • Data exploration versus Presentation Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 22 This slide captures the key topics from this lesson. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 22 Review of Basic Data Analytic Methods Using R Copyright © 2014 EMC Corporation. All Rights Reserved. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 1 Module 3: Basic Data Analytic Methods Using R 1 Review of Basic Data Analytic Methods Using R Statistics for Model Building and Evaluation During this lesson the following topics are covered: • Statistics in the Analytic Lifecycle • Hypothesis Testing • Difference of means • Significance, Power, Effect Size • ANOVA • Confidence Intervals Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 2 In this lesson, we’ll be concentrating on model building and evaluation, using the topics described. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 2 Statistics in the Analytic Lifecycle • Model Building and Planning  Can I predict the outcome with the inputs that I have?  Which inputs? • Model Evaluation  Is the model accurate?  Does it perform better than "the obvious guess"  Does it perform better than another candidate model? • Model Deployment  Do my predictions make a difference?  Are we preventing customer churn?  Have we raised profits? Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 3 As Data Scientists. we use statistical techniques not only within our modeling algorithms but also during the early model building stages, when we evaluate our final models, and when we assess how our models improve the situation when deployed in the field. In this section we'll discuss techniques that help us answer questions such as those listed above? Visualization will help with the first question, at least as a first pass. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 3 Hypothesis Testing • Fundamental question: "Is there a difference between the populations based on samples?“  Examples : Mean, Variance • Null hypothesis : There is no difference • Alternate hypothesis : There is a difference Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 4 When conducting statistical tests, such as a model or benchmarking the difference between two populations of data, a common technique to assess the difference or significance is Hypothesis Testing. The basic concept is to come up with ideas that can be proved or disproved with data. When performing these tests, the operating assumption is that there is no difference between two samples or populations. Statisticians refer to this as "the null hypothesis". The “alternate hypothesis” is that there is a difference between two models, samples, or populations. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 4 Null and Alternative Hypotheses: Examples Null Hypothesis Alternative Hypothesis The best estimate of the outcome is the average observed value: • The mean is the "Null Model” The model predicts better than the null model: • The average prediction error from the model is smaller than that of the null model This variable does not affect the outcome: • The coefficient value is zero The variable does affect outcome: The model predictions do not improve revenue: • Revenue is the same with or without intervention Interventions based on model predictions improve revenue: • A/B Testing, ANOVA Copyright © 2014 EMC Corporation. All Rights Reserved. • Coefficient value is non-zero Module 3: Basic Data Analytic Methods Using R 5 Here are some examples of null and alternative hypotheses that we would be answering during the analytic lifecycle. 1. Once we have fit a model – does it predict better than always predicting the mean value of the training data? If we call the mean value of the training data "the null model", then the null hypothesis is that the average squared prediction error from the model is the same as the average squared prediction error from the null model. The alternative is that the model's squared prediction error is less than that of the null model. A variation of that is to determine whether your "new" model predicts better than some "old" model. In that case, your null model is the "old" model, and the null and alternative hypotheses are the same as describe above. 2. When we are evaluating a model, we sometimes want to know whether or not a given input is actually contributing to the prediction. If we are doing a regression, for example, this is the same as asking if the regression coefficient for a variable is zero. The null hypothesis is that the coefficient is zero; the alternative is that the coefficient is non-zero. 3. Once we have settled on and deployed a model, we are now making decisions based on its predictions. For example, the model may help us make decisions that are supposed to improve revenue. We can test if the model is improving revenue by doing what are referred to as "A/B tests”. Suppose the model tells us whether or not to make a customer a special offer. Over the next few days, every customer who comes to us is randomly put into the "A" group, or the "B" group. Customers in the A group get special offers (or not) depending on the output of the model. Customers in the B group get special offers (or not) depending on the output of the model. Customers in the B group get special offers "the old way" – either they don't get them at all, or they get them by whatever algorithm we used before. If the model and the intervention are successful, then group A should generate higher revenue than group B. If group A does not generate higher revenue than group B (if we accept the null hypothesis that A and B generate the same revenue), then we have to determine if the problem is whether the model makes incorrect predictions, or whether our intervention is ineffective. If we are testing more than one intervention at the same time (A, B, and C), then we can do an ANOVA analysis to see if there is a difference in revenue between the groups. We will talk about ANOVA in a bit. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 5 Intuition: Difference of Means m1 m2 If m1  m 2 , this area is large Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 6 For examples 1 and 3 on the previous slide, we can think of verifying the null hypothesis as verifying whether the mean values of two different groups is the same. If they are not the same, then the alternative hypothesis is true: the introduction of new behavior did have an effect. Suppose both group1 and group2 are normally distributed, with the same standard deviation, sigma. We have n1 samples from group1 and n2 samples from group2. It happens to be true that the empirical estimate of the population means m1 and m2 are also normally distributed with standard deviations sigma/n1 and sigma/n2. In other words, the more samples we have, the better our estimate of the mean. If the means are really the same, then the distributions of m1 and m2 will overlap substantially. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 6 Welch’s t-test t-statistic: (this is the t-statistic for the Welch t-test) > x = rnorm(10) # distribution centered at 0 > y = rnorm(10,2) # distribution centered at 2 > t.test(x,y) Welch Two Sample t-test data: x and y t = -7.2643, df = 15.05, p-value = 2.713e-06 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.364243 -1.291811 sample estimates: mean of x mean of y 0.5449713 2.3729984 Copyright © 2014 EMC Corporation. All Rights Reserved. -t 0 t p-value: area under the tails of the appropriate student's distribution if p-value is small (say < 0.05), then reject the null hypothesis and assume that m1 m2 m1 and m2 are "significantly different" Module 3: Basic Data Analytic Methods Using R 7 In practice, we don’t calculate the area directly. Instead we calculate the t-statistic, which is the difference in the observed means, divided by a quantity that is a function of the observed standard deviations, and the number of observations. If the null hypothesis is true (m1 = m2) then t should be "about zero". Specifically, t is distributed in a bell shaped curve around 0 called the Student's t distribution – the specific shape of the distribution is a function of the number of observations. For a very large number of observations, the Student's t distribution converges to the normal distribution. How do we tell if the t-statistic that we observed is "about zero"? We calculate the probability of observing a t of that magnitude or larger under the null hypothesis – this probability is the area under the tails of the appropriate student distribution. If the alternative hypothesis is that m1 m2, then we look at the area under both tails. If the alternative hypothesis is that m1 > m2 (or m2 > m1), then we look at the area under one tail. This area is called the "p-value". If p is small, then the probability of seeing our observed t under the null hypothesis is small, and we can go ahead and accept the alternative hypothesis. [Note – Welch's t-test does not assume equal variance, and is a more robust variation of Student's t-test] Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 7 Wilcoxon Rank Sum Test • t-test assumes that the populations are normally distributed  Sometimes this is close to true, sometimes not • Wilcoxon Rank Sum test  Makes no assumption about the distributions of the populations  More robust test for difference of means  if p-value is small: reject the null hypothesis (equal means) > mean(x) [1] 0.5449713 > mean(y) [1] 2.372998 > wilcox.test(x, y) wilcoxon rank sum test data: x and y W = 2, p-value = 4.33e-05 alternative hypothesis: true location shift is not equal to 0 Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 8 A t-test represents a parametric test. Student's t-test assumes that both populations are normally distributed with the same variance. Welch's t-test (the t.test() function in R is Welch's t-test by default) does not assume equal variance, but it does assume normality. Sometimes, this is approximately true (true enough to use a t-test), and sometimes, it isn't. If we can't make the normality assumption, then we should use a nonparametric test. The Wilcoxon Rank Sum test will test for difference of means without making the normality assumption. Without getting into the details, Wilcoxon's test uses the fact that if two populations are centered in the same place, then if we merge the observations from each population, sort them, and rank them, then the observations of each population should "mix together". Specifically, if we sum the resulting ranks for each population, the sum should be "about the same". Since Wilcoxon's test doesn't assume anything about the population distribution, it is strictly weaker than t-test when it is applied to normally distributed data. Here, we show the results of wilcox.test() on the same (normally distributed) data from the previous slide. wilcox.test() does reject the null hypothesis, but the p-value is an order of magnitude larger than it is with the ttest. So if you know that you can assume the data is near normally distributed, then you should use the t-test. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 8 Hypothesis Testing: Summary • Calculate the test statistic  Different hypothesis tests are appropriate, in different situations • Calculate the p-value on the test statistic • If p-value is "small" then reject the null hypothesis  "small" is often p < 0.05 by convention (95% confidence)  Many data scientists prefer a smaller threshold. Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 9 Every hypothesis test calculates a test statistic that is assumed to be distributed a certain way if the null hypothesis is true. • Usually around 0 for difference, or around 1 for ratios • Different hypothesis tests are appropriate in different situations: check the assumptions of the test, and whether they are valid (enough) for your situation. The p-value is the probability of observing a value of the test statistic like the value that you saw if the null hypothesis is true. The p-value depends on how the test statistic is assumed to be distributed. If p-value is "small" then reject the null hypothesis • "small" is often p < 0.05 by convention (95% confidence) • Many data scientists prefer a smaller threshold, often 0.01, or 0.001 Of course, most statistical packages have functions that will do steps 1 and 2 automatically, for you. Sometimes, you have to find the appropriate distribution and do it by hand. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 9 Generating a Hypothesis: Type I and Type II Error If H0 is X, and we …: Null hypothesis(H0) is true Null hypothesis(H0) is false Fail to accept the Null Hypothesis  we claim something happened Type I error False positive  Correct Outcome True positive We reject the Null hypothesis Fail to reject the null hypothesis Correct outcome  we claim nothing happened. True negative Accept the NULL hypothesis Type II error False negative  Example: Ham or Spam? H0: it’s Ham HA: it’s Spam It’s Really - Ham Spam > we say it’s  Spam Ham Type I – false positive OK – true positive OK – true negative Type II – false negative • Goal: Identify Spam • Which error is worse? Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 10 So, we have developed our null hypothesis and its alternate. Once we collect the data and begin our analysis, what kind of errors might we make? There are two kinds: type I errors and (oddly enough) type II errors, based on whether we fail to accept the null hypothesis or fail to reject the null hypothesis. Type I error is the failure to accept the null hypothesis. This is a “False positive” -- finding significance where none exists. Type II error is the failure to reject the null hypothesis, thereby create a “false negative”. This means that we have failed to find significance when it does exist. Let’s use the example of SPAM filtering (spam refers to “unsolicited commercial email”.) Here our H0 is that the email is legitimate (also know as “ham” [that is; not spam]); our alternate hypothesis is that it’s not legitimate (it’s “spam”). A false positive means that we treat legitimate email as spam; a false negative implies that we treat spam messages as legitimate. We could frame the following question: using this Email filter, how often will we identify a valid email message (ham) as spam? We consider this to be a more serious error than labeling a spam email as valid , since spam messages can be filtered from the user’s mailbox, whereas a message incorrectly labeled as spam may contain information critical to the recipient. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 10 Significance, Power and Effect Size • Significance: the probability of a false positive (α)  p-value is your significance • Power: probability of a true positive (1 - β) • Effect size: the size of the observed difference  The actual difference in means, for example Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 12 The significance of a result is the probability of a false positive – rejecting the null hypothesis when it should be accepted. This is exactly the p-value of the result. The threshold of p-values that you will accept depends on how much you are willing to tolerate a false positive. So a p-value threshold of 0.05 means that you are willing to have a false positive 5% of the time. The power of a result is the probability of a true positive – correctly accepting the alternative hypothesis. The desired power is usually used to decide how big a sample to use. Effect size is the actual magnitude of the result: the actual difference between the means, for example. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 12 Always Keep Effect Size in Mind! moderate sample size Both power and significance increase with larger sample sizes. larger So you can observe an effect size that is statistically significant, but practically insignificant! sample size Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 13 For a fixed effect size (delta in the above diagrams), both power and significance increase with larger sample sizes. This is because, for a difference in means (assuming normal distributions), the estimate of the mean gets tighter as the sample size increases. So even if the difference between the means stays the same, the normal distributions around each mean overlap less, and the t-statistic gets larger, which pushes it further out on the tail of the t-distribution. Since there is no limit on how tight the normal distribution can get, you can make any effect size appear statistically significant, even if, for all practical purposes, the difference is "insignificant" (in English terms). So always take into consideration whether or not the effect size you observe truly means "a difference" in your domain. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 13 Hypothesis Testing: ANOVA ANOVA is a generalization of the difference of means • One-way ANOVA  k populations ("treatment groups")  ni samples each – total N subjects  Null hypothesis: ALL the population means are equal Population ni: # offers made mi: avg purchase size Offer 1 100 $55 Offer 2 102 $50 No intervention 99 $25 Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 14 ANOVA (Analysis of Variance) is a generalization of the difference of means. Here we have multiple populations, and we want to see if any of the population means are different from the others. That means that the null hypothesis is that ALL the population means are equal. An example: suppose everyone who visits our retail website either gets one of two promotional offers, or no promotion at all. We want to see if making the promotional offers makes a difference. (The null hypothesis is that neither promotion makes a difference. If we want to check if offer 1 is better than offer 2, that's a different question). We can do multi-way ANOVA (MANOVA) as well. For instance if we want to analyze offers and day of week simultaneously, that would be a two-way ANOVA. Multi-way AVNOVA is usually done by doing a linear regression on the outcome, using each of the (categorical) treatments as an input variable. Here, we will only talk about 1-way ANOVA. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 14 ANOVA: Understanding the F statistic Test statistic: Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 15 The first thing to calculate is the test statistic. Here we sketch the intuition behind the test statistic for ANOVA. Essentially, we want to test whether or not the clusters formed by each population are more tightly grouped than the spread across all of the populations. The between-groups mean sum of squares, sB2, is an estimate of the between-groups variance. It is a measure of how the population means vary with respect to the grand mean – the "spread across all of the populations". The within-group mean sum of squares, sW2 , is an estimate of the within-group variance: It is a measure of the “average population variance” – the average "spread" of each cluster. If the null hypothesis is true, then sB2 should be about equal to sW2 – that is, the populations are about as wide as they are far apart – they overlap. Their ratio, the test statistic F, will then be distributed as the F distribution with k-1, N-k degrees of freedom, which is right skewed and has its mode near 1. In the equations above, k is the number of populations, ni is the number of samples in the ith population, and N is the total number of samples. If we observe that F < 1, then the populations clusters are wider than the between group spread, so we can just accept the null hypothesis (no differences). Otherwise, we only need to consider the area under the right tail of the F distribution. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 15 R Example: ANOVA 3 different offers, and their outcomes Use lm() to do the ANOVA >offers = sample(c("noffer", "offer1", "offer2"), size=500, replace=T) >purchasesize = ifelse(offers=="noffer", rlnorm(500, meanlog=log(25)), ifelse(offers=="offer1", rlnorm(500, meanlog=log(50)), rlnorm(500, meanlog=log(55)))) >offertest = data.frame(offer=as.factor(offers), purchase_amt=purchasesize) > model = lm(log10(purchase_amt) ~ as.factor(offers), data=offertest) >summary(model) Residuals: Min 1Q Median 3Q Max -1.1940 -0.2837 0.0135 0.2863 1.3374 Coefficients: offer1-nooffer offer2-nooffer F-statistic: reject the null hypothesis Tukey's test: all pair-wise tests for difference of means 95% confidence intervals for difference between means Estimate Std. Error t value (Intercept) 1.49092 0.03240 46.011 as.factor(offers)offer1 0.20424 0.04706 4.340 as.factor(offers)offer2 0.22371 0.04596 4.867 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Pr(>|t|) < 2e-16 *** 1.73e-05 *** 1.52e-06 *** Residual standard error: 0.4262 on 497 degrees of freedom Multiple R-squared: 0.05479, Adjusted R-squared: 0.05098 F-statistic: 14.4 on 2 and 497 DF, p-value: 8.304e-07 > TukeyHSD(aov(model)) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = model) $offers .No appreciable difference between offer1 and offer2 Copyright © 2014 EMC Corporation. All Rights Reserved. diff lwr upr p adj offer1-noffer 0.20424099 0.09361976 0.3148622 0.0000512 offer2-noffer 0.22370761 0.11566775 0.3317475 0.0000045 offer2-offer1 0.01946663 -0.09146092 0.1303942 0.9104871 Module 3: Basic Data Analytic Methods Using R 16 Here is an example of how to do one-way ANOVA in R. We have a data frame with the outcomes under the three different offer scenarios you saw previously. We can use the linear regression function lm() to do the ANOVA calculations for us. The F-statistic on the linear regression model tells us that we can reject the null hypothesis – at least one of the populations is different from the others. Since we used lm() to do the ANOVA, we have additional information: The intercept of the model is the mean outcome for nooffer. The coefficients for offer1 and offer2 are the difference of means of offer1 and offer2 respectively, from nooffer. The lm() function does a Wald test on each of the coefficients for the null hypothesis that the coefficient value is really zero. We can see from the p-values that the null hypothesis was rejected for both coefficients, with highly significant p-values. So, we can assume that both offer1 and offer2 are significantly different from nooffer. However – we don't know whether or not offer1 is different from offer2. That requires additional tests. Tukey's test does all pair-wise tests for difference of means. We can see the 95% confidence interval for the difference of each pair of means, and the p-value for the test on the difference. A p-value of 0.9104871 for offer1 and offer2 suggests that we really can’t tell the difference between them. A small p-value (p = 0.049) demonstrates statistical vs. practical significance – with more data, the difference gets more statistically significant, but the effect size is still fairly small. Is the effect practically significant? ___________ More references (2-way anova, etc): Practical Regression and ANOVA using R, Julian Faraway (you can get a .pdf file of an old edition of the book online from ) Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 16 Confidence Intervals Example: • Normal data N(μ, σ) • x is the estimate of μ • based on n samples μ falls in the interval x ± 2σ/√n with approx. 95% probability ("95% confidence") If x is your estimate of some unknown value μ, the P% confidence interval is the interval around x that μ will fall in, with probability P. Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 17 The confidence interval for an estimate x of an unknown value mu is the interval that should contain the true value mu, to a desired probability. For example, if you are estimating the mean value (mu) of a normal distribution with std. dev sigma, and your estimate after n samples is X, then mu falls within +/- 2* sigma/sqrt(n) with about 95% probability. Of course, you probably don't know sigma, but you do know the empirical standard deviation of your n samples, s. So you would estimate the 95% confidence interval as x +/- 2*s. In practice, most people estimate the 95% confidence interval as the mean plus/minus twice the standard deviation. This is really only true if the data is normally distributed, but it is a helpful rule of thumb. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 17 Example The defect rate of a disk drive manufacturing process is within 0.9% 1.7%, with 98% confidence. We inspect a sample of 1000 drives from one of our plants. • We observe 13 defects in our sample. • Should we inspect the plant for problems? • What if we observe 25 defects in the sample? Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 18 Suppose we know that a properly functioning disk drive manufacturing process will produce between 9 and 17 defective disk drives per 1000 disk drives manufactured, 98% of the time. On one of our regularly scheduled inspections of a plant, we inspect 1000 randomly selected drives. If we find 13 defective drives, we can't reject the assumption that the plant is functioning properly, because 13 defects is "in bounds" for our process. What if we find 25 defects? We know that this would happen less than 2% of the time in a properly functioning plant, so we should accept the alternate hypothesis that the plant is not functioning properly, and inspect it for problems. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 18 Check Your Knowledge • Refer back to the ANOVA example on an earlier slide. What do you think? Does the difference between offer1 and offer2 make a practical difference? Should we go ahead and implement one of them? • If yes, and the costs were US $25 for each offer1 and US $10 for offer2, would you still make the same decision? • In our manufacturing plant example, assuming you would check the plant for problems in the manufacturing process, how might you justify this decision financially? Copyright © 2014 EMC Corporation. All Rights Reserved. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 19 Module 3: Basic Data Analytic Methods Using R 19 Review of Basic Data Analytic Methods Using R Summary During this lesson the following topics were covered: • The role of Statistics in the Analytic Lifecycle • Developing a model and generating the null and the alternative hypothesis • Difference between means • Difference between significance, power and effect size, and how they relate to Type I and Type II errors • Applying ANOVA and determining whether the results are significant • Defining confidence intervals and applying them Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 20 These are the key points covered in this lesson. Copyright © 2014 EMC Corporation. All rights reserved. Module 3: Basic Data Analytic Methods Using R 20
Purchase answer to see full attachment

Tags: APA format Correlation analysis R workspace Ruspini data mining algorithms R programming environment

User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Find the attached

Running head: DATA MINING TECHNIQUES

Data Mining Techniques
Name
Institution

1

DATA MINING TECHNIQUES

2
Data Mining Techniques

Classification
Classification is used in analyzing data set as well as generating a set of grouping rules
that helps in clustering future data. More so, classification is helpful when it comes to
descriptions and distinguishing data concepts and classes. For instance, before initiating any
project its feasibility. As a result, it is the classifier that helps in predicting class labels that may
include ‘Risky’ or ‘Safe’.
Strengths
➢ It enhances probabilistic interpretation in outputs
➢ In performance, the tree ensemble does well in practice
➢ It also does well in classifying text, audio and image data.
Weaknesses
➢ In the non-linear decision, it usually underperforms.
➢ Prone to overfitting.
➢ In training, it requires a large amount of data.
Prediction
Prediction is used in noting data points based on other data values that are related. Even
though its variables are unknown, it has no relation to future events. As such, ...