Nilamber Pitamber University Data Mining Discussion

User Generated

Xvena234

Writing

Nilamber Pitamber University

Description

Discuss the major issues in classification model overfitting. Give some examples to illustrate your points.

Unformatted Attachment Preview

Anomaly Detection Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar 4/12/2020 Introduction to Data Mining, 2nd Edition 1 Anomaly/Outlier Detection What are anomalies/outliers? – The set of data points that are considerably different than the remainder of the data Natural implication is that anomalies are relatively rare – One in a thousand occurs often if you have lots of data – Context is important, e.g., freezing temps in July Can be important or a nuisance – 10 foot tall 2 year old – Unusually high blood pressure 4/12/2020 Introduction to Data Mining, 2nd Edition 2 Importance of Anomaly Detection Ozone Depletion History In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! 4/12/2020 Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html Introduction to Data Mining, 2nd Edition 3 Causes of Anomalies Data from different classes – Measuring the weights of oranges, but a few grapefruit are mixed in Natural variation – Unusually tall people Data errors – 200 pound 2 year old 4/12/2020 Introduction to Data Mining, 2nd Edition 4 Distinction Between Noise and Anomalies Noise is erroneous, perhaps random, values or contaminating objects – Weight recorded incorrectly – Grapefruit mixed in with the oranges Noise doesn’t necessarily produce unusual values or objects Noise is not interesting Anomalies may be interesting if they are not a result of noise Noise and anomalies are related but distinct concepts 4/12/2020 Introduction to Data Mining, 2nd Edition 5 General Issues: Number of Attributes Many anomalies are defined in terms of a single attribute – Height – Shape – Color Can be hard to find an anomaly using all attributes – Noisy or irrelevant attributes – Object is only anomalous with respect to some attributes However, an object may not be anomalous in any one attribute 4/12/2020 Introduction to Data Mining, 2nd Edition 6 General Issues: Anomaly Scoring Many anomaly detection techniques provide only a binary categorization – An object is an anomaly or it isn’t – This is especially true of classification-based approaches Other approaches assign a score to all points – This score measures the degree to which an object is an anomaly – This allows objects to be ranked In the end, you often need a binary decision – Should this credit card transaction be flagged? – Still useful to have a score How many anomalies are there? 4/12/2020 Introduction to Data Mining, 2nd Edition 7 Other Issues for Anomaly Detection Find all anomalies at once or one at a time – Swamping – Masking Evaluation – How do you measure performance? – Supervised vs. unsupervised situations Efficiency Context – Professional basketball team 4/12/2020 Introduction to Data Mining, 2nd Edition 8 Variants of Anomaly Detection Problems Given a data set D, find all data points x  D with anomaly scores greater than some threshold t Given a data set D, find all data points x  D having the top-n largest anomaly scores Given a data set D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D 4/12/2020 Introduction to Data Mining, 2nd Edition 9 Model-Based Anomaly Detection Build a model for the data and see – Unsupervised ◆ Anomalies are those points that don’t fit well ◆ Anomalies are those points that distort the model ◆ Examples: – – – – – Statistical distribution Clusters Regression Geometric Graph – Supervised 4/12/2020 ◆ Anomalies are regarded as a rare class ◆ Need to have training data Introduction to Data Mining, 2nd Edition 10 Additional Anomaly Detection Techniques Proximity-based – Anomalies are points far away from other points – Can detect this graphically in some cases Density-based – Low density points are outliers Pattern matching – Create profiles or templates of atypical but important events or objects – Algorithms to detect these patterns are usually simple and efficient 4/12/2020 Introduction to Data Mining, 2nd Edition 11 Visual Approaches Boxplots or scatter plots Limitations – Not automatic – Subjective 4/12/2020 Introduction to Data Mining, 2nd Edition 12 Statistical Approaches Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data. Usually assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on – Data distribution – Parameters of distribution (e.g., mean, variance) – Number of expected outliers (confidence limit) Issues – Identifying the distribution of a data set ◆ Heavy tailed distribution – Number of attributes – Is the data a mixture of distributions? 4/12/2020 Introduction to Data Mining, 2nd Edition 13 Normal Distributions One-dimensional Gaussian 8 7 0.1 6 0.09 5 0.08 4 0.07 3 0.06 y 2 0.05 1 0 0.04 -1 0.03 -2 0.02 -3 0.01 Two-dimensional Gaussian -4 probability density -5 -4 -3 -2 -1 0 1 2 3 4 5 x 4/12/2020 Introduction to Data Mining, 2nd Edition 14 Grubbs’ Test Detect outliers in univariate data Assume data comes from normal distribution Detects one outlier at a time, remove the outlier, and repeat – H0: There is no outlier in data – HA: There is at least one outlier Grubbs’ test statistic: Reject H0 if: 4/12/2020 ( N − 1) G N G= max X − X s t (2 / N , N −2 ) N − 2 + t (2 / N , N −2 ) Introduction to Data Mining, 2nd Edition 15 Statistical-based – Likelihood Approach Assume the data set D contains samples from a mixture of two probability distributions: – M (majority distribution) – A (anomalous distribution) General Approach: – Initially, assume all the data points belong to M – Let Lt(D) be the log likelihood of D at time t – For each point xt that belongs to M, move it to A ◆ Let Lt+1 (D) be the new log likelihood. ◆ Compute the difference,  = Lt(D) – Lt+1 (D) If  > c (some threshold), then xt is declared as an anomaly and moved permanently from M to A ◆ 4/12/2020 Introduction to Data Mining, 2nd Edition 16 Statistical-based – Likelihood Approach Data distribution, D = (1 – ) M +  A M is a probability distribution estimated from data – Can be based on any modeling method (naïve Bayes, maximum entropy, etc) A is initially assumed to be uniform distribution Likelihood at time t:   |At |  |M t | Lt ( D ) =  PD ( xi ) =  (1 −  )  PM t ( xi )    PAt ( xi )  i =1 xi M t   xi At  LLt ( D ) = M t log( 1 −  ) +  log PM t ( xi ) + At log  +  log PAt ( xi ) N xi M t 4/12/2020 Introduction to Data Mining, 2nd Edition xi At 17 Strengths/Weaknesses of Statistical Approaches Firm mathematical foundation Can be very efficient Good results if distribution is known In many cases, data distribution may not be known For high dimensional data, it may be difficult to estimate the true distribution Anomalies can distort the parameters of the distribution 4/12/2020 Introduction to Data Mining, 2nd Edition 18 Distance-Based Approaches Several different techniques An object is an outlier if a specified fraction of the objects is more than a specified distance away (Knorr, Ng 1998) – Some statistical definitions are special cases of this The outlier score of an object is the distance to its kth nearest neighbor 4/12/2020 Introduction to Data Mining, 2nd Edition 19 One Nearest Neighbor - One Outlier D 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 Outlier Score 4/12/2020 Introduction to Data Mining, 2nd Edition 20 One Nearest Neighbor - Two Outliers 0.55 D 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Outlier Score 4/12/2020 Introduction to Data Mining, 2nd Edition 21 Five Nearest Neighbors - Small Cluster 2 D 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 Outlier Score 4/12/2020 Introduction to Data Mining, 2nd Edition 22 Five Nearest Neighbors - Differing Density D 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 Outlier Score 4/12/2020 Introduction to Data Mining, 2nd Edition 23 Strengths/Weaknesses of Distance-Based Approaches Simple Expensive – O(n2) Sensitive to parameters Sensitive to variations in density Distance becomes less meaningful in highdimensional space 4/12/2020 Introduction to Data Mining, 2nd Edition 24 Density-Based Approaches Density-based Outlier: The outlier score of an object is the inverse of the density around the object. – Can be defined in terms of the k nearest neighbors – One definition: Inverse of distance to kth neighbor – Another definition: Inverse of the average distance to k neighbors – DBSCAN definition If there are regions of different density, this approach can have problems 4/12/2020 Introduction to Data Mining, 2nd Edition 25 Relative Density Consider the density of a point relative to that of its k nearest neighbors 4/12/2020 Introduction to Data Mining, 2nd Edition 26 Relative Density Outlier Scores 6.85 6 C 5 4 1.40 D 3 1.33 2 A 1 Outlier Score 4/12/2020 Introduction to Data Mining, 2nd Edition 27 Density-based: LOF approach For each point, compute the density of its local neighborhood Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors Outliers are points with largest LOF value In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers  p2  4/12/2020 p1 Introduction to Data Mining, 2nd Edition 28 Strengths/Weaknesses of Density-Based Approaches Simple Expensive – O(n2) Sensitive to parameters Density becomes less meaningful in highdimensional space 4/12/2020 Introduction to Data Mining, 2nd Edition 29 Clustering-Based Approaches Clustering-based Outlier: An object is a cluster-based outlier if it does not strongly belong to any cluster – For prototype-based clusters, an object is an outlier if it is not close enough to a cluster center – For density-based clusters, an object is an outlier if its density is too low – For graph-based clusters, an object is an outlier if it is not well connected Other issues include the impact of outliers on the clusters and the number of clusters 4/12/2020 Introduction to Data Mining, 2nd Edition 30 Distance of Points from Closest Centroids 4.5 4.6 4 C 3.5 3 2.5 D 0.17 2 1.5 1.2 1 A 0.5 Outlier Score 4/12/2020 Introduction to Data Mining, 2nd Edition 31 Relative Distance of Points from Closest Centroid 4 3.5 3 2.5 2 1.5 1 0.5 Outlier Score 4/12/2020 Introduction to Data Mining, 2nd Edition 32 Strengths/Weaknesses of Distance-Based Approaches Simple Many clustering techniques can be used Can be difficult to decide on a clustering technique Can be difficult to decide on number of clusters Outliers can distort the clusters 4/12/2020 Introduction to Data Mining, 2nd Edition 33 Data Mining: Avoiding False Discoveries Lecture Notes for Chapter 10 Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar 02/14/2018 Introduction to Data Mining, 2nd Edition 1 Outline Statistical Background Significance Testing Hypothesis Testing Multiple Hypothesis Testing 02/14/2018 Introduction to Data Mining, 2nd Edition 2 Motivation An algorithm applied to a set of data will usually produce some result(s) – There have been claims that the results reported in more than 50% of published papers are false. (Ioannidis) Results may be a result of random variation – Any particular data set is a finite sample from a larger population – Often significant variation among instances in a data set or heterogeneity in the population – Unusual events or coincidences do happen, especially when looking at lots of events – For this and other reasons, results may not replicate, i.e., generalize to other samples of data Results may not have domain significance – Finding a difference that makes no difference Data scientists need to help ensure that results of data analysis are not false discoveries, i.e., not meaningful or reproducible 02/14/2018 Introduction to Data Mining, 2nd Edition 3 Statistical Testing Statistical approaches are used to help avoid many of these problems Statistics has well-developed procedures for evaluating the results of data analysis – Significance testing – Hypothesis testing Domain knowledge, careful data collection and preprocessing, and proper methodology are also important – Bias and poor quality data – Fishing for good results – Reporting how analysis was done Ultimate verification lies in the real world 02/14/2018 Introduction to Data Mining, 2nd Edition 4 Probability and Distributions Variables are characterized by a set of possible values – Called the domain of the variable – Examples: ◆ True or False for binary variables ◆ Subset of integers for variables that are counts, such as number of students in a class ◆ Range of real numbers for variables such as weight or height A probability distribution function describes the relative frequency with which the values are observed Call a variable with a distribution a random variable 02/14/2018 Introduction to Data Mining, 2nd Edition 5 Probability and Distributions .. For a discrete variable we define a probability distribution by the relative frequency with which each value occurs – Let X be a variable that records the outcome flipping a fair coin: heads (1) or tails (0) – P(X =1) = P(X =0) = 0.5 (P stands for “probability”) – If 𝑓is the distribution of X, 𝑓(1) = 𝑓 0 = 0.5 Probability distribution function has the following properties – Minimum value 0, maximum value 1 – Sums to 1, i.e., σ𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑋 𝑓 𝑋 = 1 02/14/2018 Introduction to Data Mining, 2nd Edition 6 Binomial Distribution Number of heads in a sequence of n coin flips – Let R be the number of heads – R has a binomial distribution 𝑛 – 𝑃 𝑅=𝑘 = 𝑃 𝑋 = 1 𝑘 𝑃 𝑋 = 0 𝑛−𝑘 𝑘 – What is 𝑃 𝑅 = 𝑘 given n = 10 and 𝑃 𝑋 = 1 =0.5 ? 02/14/2018 k P(R= k) 0 1 2 3 4 5 6 7 8 9 10 0.001 0.01 0.044 0.117 0.205 0.246 0.205 0.117 0.044 0.01 0.001 Introduction to Data Mining, 2nd Edition 7 Probability and Distributions .. For a continuous variable we define a probability distribution by using density function – Probability of any specific value is 0 – Only intervals of values have non-zero probability ◆ Examples: P (X > 3), P(X < -3), P (-1 < X < 1) ◆ If 𝑓is the distribution of X, P (X > 3)= ‫׬‬3 𝑓 (𝑋) 𝑑𝑥 ∞ Probability density has the following properties – Minimum value 0 – Integrates to 1, i.e., 02/14/2018 ∞ ‫׬‬−∞ 𝑓 𝑋 =1 Introduction to Data Mining, 2nd Edition 8 \mathcal{N} \mathcal{N} Gaussian Distribution The Gaussian (normal) distribution is the most commonly used (𝑥−𝜇)2 1 − – 𝑓 𝑋 = 𝑒 2𝜎2 2𝜋𝜎 – Where 𝜇 and 𝜎 are the mean and standard distribution of the distribution ∞ ∞ – 𝜇 = ‫׬‬−∞ 𝑋𝑓 𝑋 𝑑𝑥 and 𝜎 = ‫׬‬−∞ 𝑋 − 𝜇 2 𝑓 𝑋 𝑑𝑥 𝜇 = 0 and 𝜎 = 1, i.e., 𝒩(0,1) http://www.itl.nist.gov/div898/handbook/eda/section3/eda3661.htm http://www.itl.nist.gov/div898/handbook/index.htm 02/14/2018 Introduction to Data Mining, 2nd Edition 9 Statistical Testing … Make inferences (decisions) about that validity of a result For statistical inference (testing), we need two thing: – A statement that we want to disprove ◆ Called the null hypothesis (H0) ◆ The null hypothesis is typically a statement that the result is merely due to random variation ◆ It is typically the opposite of what we would like to show – A random variable, 𝑅, called a test statistic, for which we know or can determine a distribution if H0 is true. ◆ The distribution of 𝑅 under H0 is called the null distribution ◆ The value of 𝑅 is obtained from the result and is typically numeric 02/14/2018 Introduction to Data Mining, 2nd Edition 10 Examples of Null Hypotheses A coin or a die is a fair coin. The difference between the means of two samples is 0 The purchase of a particular item in a store is unrelated to the purchase of a second item, e.g., the purchase of bread and milk are unconnected The accuracy of a classifier is no better than random 02/14/2018 Introduction to Data Mining, 2nd Edition 11 Significance Testing – Significance testing was devised by the statistician Fisher – Only interested in whether null hypothesis is true – Significance testing was intended only for exploratory analyses of the null hypothesis in the preliminary stages of a study ◆ For example, to refine the null hypothesis or modify future experiments – For many years, significance testing has been a key approach for justifying the validity of scientific results – Introduced the concept of p-value, which is widely used and misused 02/14/2018 Introduction to Data Mining, 2nd Edition 12 How Significance Testing Works Analyze the data to obtain a result – For example, data could be from flipping a coin 10 times to test its fairness The result is expressed as a value of the test statistic, 𝑅 – For example, let 𝑅 be the number of heads in 10 flips Compute the probability of seeing the current value of 𝑅 or something more extreme – This probability is known as the p-value of the test statistic 02/14/2018 Introduction to Data Mining, 2nd Edition 13 How Significance Testing Works … If the p-value is sufficiently small, we reject the null hypothesis, H0 and say that the result is statistically significant – We say we reject the null hypothesis, H0 – A threshold on the p-value is called the significance level, 𝜶 ◆ Often the significance level is 0.01 or 0.05 If the p-value is not sufficiently small, we say that we fail to reject the null hypothesis – Sometimes we say that we accept the null hypothesis but a high p-value does not necessarily imply the null hypothesis is true 02/14/2018 Introduction to Data Mining, 2nd Edition 14 Example: Testing a coin for fairness H0: P(X =1) = P(X =0) = 0.5 Define the test statistic 𝑅 to be the number of heads in 10 flips Set the significance level 𝛼 to be 0.05 The number of heads 𝑅 has a binomial distribution For which values of 𝑅 would you reject H0? 02/14/2018 Introduction to Data Mining, 2nd Edition k P(S = k) 0 1 2 3 4 5 6 7 8 9 10 0.001 0.01 0.044 0.117 0.205 0.246 0.205 0.117 0.044 0.01 0.001 15 One-sided and Two-sided Tests More extreme can be interpreted in different ways For example, an observed value of the test statistic, 𝑅𝑜𝑏𝑠 , can be considered extreme if – it is greater than or equal to a certain value, 𝑅𝐻 , – smaller than or equal to a certain value, 𝑅𝐿 , or – outside a specified interval, [𝑅𝐻 , 𝑅𝐿 ]. The first two cases are “one-sided tests” (righttailed and left-tailed, respectively), The last case results in a “two-sided test.” 02/14/2018 Introduction to Data Mining, 2nd Edition 16 One-sided and Two-sided Tests … Example of one-tailed and two tailed tests for a test statistic 𝑅 that is normally distributed for a roughly 5% significance level. s 02/14/2018 Introduction to Data Mining, 2nd Edition 17 Neyman-Pearson Hypothesis Testing Devised by statisticians Neyman and Pearson in response to perceived shortcomings in significance testing – Explicitly specifies an alternative hypothesis, H1 – Significance testing cannot quantify how an observed results supports H1 – Define an alternative distribution which is the distribution of the test statistic if H1 is true – We define a critical region for the test statistic 𝑅 If the value of 𝑅 falls in the critical region, we reject H0 ◆ We may or may not accept H1 if H0 is rejected ◆ – The significance level, 𝜶, is the probability of the critical region under H0 02/14/2018 Introduction to Data Mining, 2nd Edition 18 Hypothesis Testing … Type I Error (𝜶): Error of incorrectly rejecting the null hypothesis for a result. – It is equal to the probability of the critical region under H0, i.e., is the same as the significance level, 𝜶. – Formally, α = P(𝑅 ∈ Critical Region | H0) Type II Error (β): Error of falsely calling a result as not significant when the alternative hypothesis is true. – It is equal to the probability of observing test statistic values outside the critical region under H1 – Formally, β = P(𝑅 ∉ Critical Region | H1). 02/14/2018 Introduction to Data Mining, 2nd Edition 19 Hypothesis Testing … Power: which is the probability of the critical region under H1, i.e., 1−β. – Power indicates how effective a test will be at correctly rejecting the null hypothesis. – Low power means that many results that actually show the desired pattern or phenomenon will not be considered significant and thus will be missed. – Thus, if the power of a test is low, then it may not be appropriate to ignore results that fall outside the critical region. 02/14/2018 Introduction to Data Mining, 2nd Edition 20 Example: Classifying Medical Results The value of a blood test is used as the test statistic, 𝑅, to identify whether a patient has a particular disease or not. – H0: For patients 𝐧𝐨𝐭 having the disease, 𝑅 has distribution 𝒩(40, 5) – H1: For patients having the disease, 𝑅 has distribution 𝒩(60, 5) 𝛼= 𝑅−𝑢 2 ∞ 1 − ‫׬‬50 2𝜋𝜎2 𝑒 2𝜎2 𝛽= 𝑅−𝑢 2 50 1 − ‫׬‬−∞ 2𝜋𝜎2 𝑒 2𝜎2 𝑑𝑅 = 𝑅−40 2 ∞ 1 − ‫׬‬50 50𝜋 𝑒 50 𝑑𝑅 = 𝑑𝑅 = 0.023, µ = 40,  = 5 𝑅−60 2 50 1 − ‫׬‬−∞ 50𝜋 𝑒 50 𝑑𝑅 = 0.023, µ = 60,  = 5 Power = 1 - 𝛽 = 0.977 – See figures on the next page 02/14/2018 Introduction to Data Mining, 2nd Edition 21 0.08 0.08 0.07 0.07 0.06 0.06 Probability Density Probability Density 𝜶, 𝜷 𝐚𝐧𝐝 𝐏𝐨𝐰𝐞𝐫 𝐟𝐨𝐫 𝐌𝐞𝐝𝐢𝐜𝐚𝐥 𝐓𝐞𝐬𝐭𝐢𝐧𝐠 𝐄𝐱𝐚𝐦𝐩𝐥𝐞 0.05 0.04 0.03 0.05 0.04 0.03 0.02 0.02 0.01 0.01 0 20 30 40 50 60 70 0 20 80 30 40 R 50 60 70 80 R 0.08 0.08 0.07 0.07 0.06 0.06 Probability Density Probability Density Distribution of test statistic for the alternative hypothesis (rightmost density curve) and null hypothesis (leftmost density curve). Shaded region in right subfigure is 𝛼. 0.05 0.04 0.03 0.05 0.04 0.03 0.02 0.02 0.01 0.01 0 20 30 40 50 R 60 70 80 0 20 30 40 50 60 70 80 R Shaded region in left subfigure is β and shaded region in right subfigure is power. 02/14/2018 Introduction to Data Mining, 2nd Edition 22 Hypothesis Testing: Effect Size Many times we can find a result that is statistically significant but not significant from a domain point of view – A drug that lowers blood pressure by one percent Effect size measures the magnitude of the effect or characteristic being evaluated, and is often the magnitude of the test statistic. – Brings in domain considerations The desired effect size impacts the choice of the critical region, and thus the significance level and power of the test 02/14/2018 Introduction to Data Mining, 2nd Edition 23 Effect Size: Example Problem Consider several new treatments for a rare disease that have a particular probability of success. If we only have a sample size of 10 patients, what effect size will be needed to clearly distinguish a new treatment from the baseline which has is 60 % effective? R/p(X=1) 0 1 2 3 4 5 6 7 8 9 10 02/14/2018 0.60 0.0001 0.0016 0.0106 0.0425 0.1115 0.2007 0.2508 0.2150 0.1209 0.0403 0.0060 0.70 0.0000 0.0001 0.0014 0.0090 0.0368 0.1029 0.2001 0.2668 0.2335 0.1211 0.0282 0.80 0.0000 0.0000 0.0001 0.0008 0.0055 0.0264 0.0881 0.2013 0.3020 0.2684 0.1074 0.90 0.0000 0.0000 0.0000 0.0000 0.0001 0.0015 0.0112 0.0574 0.1937 0.3874 0.3487 Introduction to Data Mining, 2nd Edition 24 Multiple Hypothesis Testing Arises when multiple results are produced and multiple statistical tests are performed The tests studied so far are for assessing the evidence for the null (and perhaps alternative) hypothesis for a single result A regular statistical test does not suffice – For example, getting 10 heads in a row for a fair coin is unlikely for one such experiment ◆ probability = 1 10 2 = 0.001 – But, for 10,000 such experiments we would expect 10 such occurrences 02/14/2018 Introduction to Data Mining, 2nd Edition 25 Summarizing the Results of Multiple Tests The following confusion table defines how results of multiple tests are summarized – – We assume the results fall into two classes, + and –, which, follow the alternative and null hypotheses, respectively. The focus is typically on the number of false positives (FP), i.e., the results that belong to the null distribution (– class) but are declared significant (+ class). Confusion table for summarizing multiple hypothesis testing results. Declared significant (+ prediction) Declared not significant (– prediction) Total H1 True (actual +) True Positive (TP) False Negative (FN) type II error Positives (m1 ) H0 True (actual –) False Positive (FP) type I error True Negative (TN) Negatives (m0 ) Positive Predictions (Ppred) Negative Predictions (Npred) m 02/14/2018 Introduction to Data Mining, 2nd Edition 26 Family-wise Error Rate By family, we mean a collection of related tests family-wise error rate (FWER) is the probability of observing even a single false positive (type I error) in an entire set of m results. – FWER = P(FP > 0). Suppose your significance level is 0.05 for a single test – – – – Probability of no error for one test is 1 − 0.05 = 0.95. Probability of no error for m tests is 0.95𝑚 FWER = P(FP > 0) = 1 – 0.95𝑚 If m = 10, FWER = 0.60 02/14/2018 Introduction to Data Mining, 2nd Edition 27 Bonferroni Procedure Goal of FWER is to ensure that FWER < 𝛼, where 𝛼 is often 0.05 Bonferroni Procedure: – m results are to be tested – Require FWER < α – set the significance level, 𝛼 ∗ for every test to be 𝛼 ∗ = 𝛼/m. If m = 10 and 𝛼 = 0.05 then 𝛼 ∗ = 0.05/10=0.005 02/14/2018 Introduction to Data Mining, 2nd Edition 28 Example: Bonferroni versus Naïve approach Naïve approach is to evaluate statistical significance for each result without adjusting the significance level. 1 Family−wise Error Rate (FWER) 0.9 Naive Approach 0.8 Bonferroni 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.05 0 0 10 20 30 40 50 60 Number of Results, m 70 80 90 100 The family wise error rate (FWER) curves for the naïve approach and the Bonferroni procedure as a function of the number of results, m. 𝜶 = 𝟎. 𝟎𝟓. 02/14/2018 Introduction to Data Mining, 2nd Edition 29 False Discovery Rate FWER controlling procedures seek a low probability for obtaining any false positives – Not the appropriate tool when the goal is to allow some false positives in order to get more true positives The false discovery rate (FDR) measures the rate of false positives, which are also called false discoveries 𝐹𝑃 𝐹𝑃 𝑄= = if 𝑃𝑝𝑟𝑒𝑑 > 0 𝑃𝑝𝑟𝑒𝑑 𝑇𝑃 + 𝐹𝑃 = 0 if 𝑃𝑝𝑟𝑒𝑑 = 0, where 𝑃𝑝𝑟𝑒𝑑 is the number of predicted positives If we know FP, the number of actual false positives, then FDR = FP. – Typically we don’t know FP in a testing situation Thus, FDR = 𝑄 P(𝑃𝑝𝑟𝑒𝑑>0) = E(𝑄), the expected value of Q. 02/14/2018 Introduction to Data Mining, 2nd Edition 30 Benjamini-Hochberg Procedure An algorithm to control the false discovery rate (FDR) Benjamini-Hochberg (BH) FDR algorithm. 1: 2: 3: 4: 5: Compute p-values for the m results. Order the p-values from smallest to largest (p1 to pm ). α Compute the significance level for pi as αi = i × m . Let k be the largest index such that pk ≤ αk . Reject H0 for all results corresponding to the first k p-values, pi , 1 ≤ i ≤ k. This procedure first orders the p-values from smallest to largest Then it uses a separate significance level for each test – 𝛼𝑖 = 𝑖 × 02/14/2018 𝛼 𝑖 Introduction to Data Mining, 2nd Edition 31 FDR Example: Picking a stockbroker Suppose we have a test for determining whether a stockbroker makes profitable stock picks. This test, applied to an individual stockbroker, has a significance level, 𝛼 = 0.05. We use the same value for our desired false discovery rate. – Normally, we set the desired FDR rate higher, e.g., 10% or 20% The following figure compares the naïve approach, Bonferroni, and the BH FDR procedure with respect to the false discovery rate for various numbers of tests, m. 1/3 of the sample were from the alternative distribution. 0.12 False Discovery Rate (FDR) 0.1 Naive Approach Bonferroni BH Procedure 0.08 0.06 0.05 0.04 0.02 0 0 10 20 30 40 50 60 70 80 90 100 Number of stockbrokers, m False Discovery Rate as a function of m. 02/14/2018 Introduction to Data Mining, 2nd Edition 32 FDR Example: Picking a stockbroker … The following figure compares the naïve approach, Bonferroni, and the BH FDR procedure with respect to the power for various numbers of tests, m. 1/3 of the sample were from the alternative distribution. 1 Expected Power (True Positive Rate) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Naive Approach 0.2 Bonferroni BH Procedure 0.1 0 0 10 20 30 40 50 60 70 Number of stockbrokers, m 80 90 100 Expected Power as function o f m. 02/14/2018 Introduction to Data Mining, 2nd Edition 33 Comparison of FWER and FDR FWER is appropriate when it is important to avoid any error. – But an FWER procedure such as Bonferroni makes many Type II errors and thus, has poor power. – An FWER approach has very a very false discovery rate FDR is appropriate when it is important to identity positive results, i.e., those belonging to the alternative distribution. – By construction, the false discovery rate is good for an FDR procedure such as the BH approach – An FDR approach also has good power 02/14/2018 Introduction to Data Mining, 2nd Edition 34
Purchase answer to see full attachment
Explanation & Answer:
1 Page
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Attached.

Running head: DATA MINING DISCUSSION

DATA MINING DISCUSSION
Name:
Course:
Instructor:
Date:

1

DATA MINING DISCUSSION

2
DATA MINING DISCUSSION

Major issues in classification model overfitting with illustrations
Model overfitting is a common phenomenon in data mining. It is based on the principle
of having a model that can learn the training data so well that it becomes an issue, especially
when a new set of data is introduced. Overfitting is an error in simulation occurring whenever a
particular function is firmly fixed to a limited collection of data. Generally, overfitting takes the
form of creating an incredibly complex model that is capable of explaining the peculiarity
present in the data under analysis. Kenton (2019) posits that the data under investigation contains
some degree of noise in it. Therefore any attempts made on the model to fit too close to
inaccurate and unreliable data will only make the model contain errors tha...


Anonymous
Really great stuff, couldn't ask for more.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Similar Content

Related Tags