Anomaly Detection
Lecture Notes for Chapter 9
Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar
4/12/2020
Introduction to Data Mining, 2nd Edition
1
Anomaly/Outlier Detection
What are anomalies/outliers?
– The set of data points that are
considerably different than the
remainder of the data
Natural implication is that anomalies are
relatively rare
– One in a thousand occurs often if you have lots of data
– Context is important, e.g., freezing temps in July
Can be important or a nuisance
– 10 foot tall 2 year old
– Unusually high blood pressure
4/12/2020
Introduction to Data Mining, 2nd Edition
2
Importance of Anomaly Detection
Ozone Depletion History
In 1985 three researchers (Farman,
Gardinar and Shanklin) were
puzzled by data gathered by the
British Antarctic Survey showing that
ozone levels for Antarctica had
dropped 10% below normal levels
Why did the Nimbus 7 satellite,
which had instruments aboard for
recording ozone levels, not record
similarly low ozone concentrations?
The ozone concentrations recorded
by the satellite were so low they
were being treated as outliers by a
computer program and discarded!
4/12/2020
Sources:
http://exploringdata.cqu.edu.au/ozone.html
http://www.epa.gov/ozone/science/hole/size.html
Introduction to Data Mining, 2nd Edition
3
Causes of Anomalies
Data from different classes
– Measuring the weights of oranges, but a few grapefruit
are mixed in
Natural variation
– Unusually tall people
Data errors
– 200 pound 2 year old
4/12/2020
Introduction to Data Mining, 2nd Edition
4
Distinction Between Noise and Anomalies
Noise is erroneous, perhaps random, values or
contaminating objects
– Weight recorded incorrectly
– Grapefruit mixed in with the oranges
Noise doesn’t necessarily produce unusual values or
objects
Noise is not interesting
Anomalies may be interesting if they are not a result of
noise
Noise and anomalies are related but distinct concepts
4/12/2020
Introduction to Data Mining, 2nd Edition
5
General Issues: Number of Attributes
Many anomalies are defined in terms of a single attribute
– Height
– Shape
– Color
Can be hard to find an anomaly using all attributes
– Noisy or irrelevant attributes
– Object is only anomalous with respect to some attributes
However, an object may not be anomalous in any one
attribute
4/12/2020
Introduction to Data Mining, 2nd Edition
6
General Issues: Anomaly Scoring
Many anomaly detection techniques provide only a binary
categorization
– An object is an anomaly or it isn’t
– This is especially true of classification-based approaches
Other approaches assign a score to all points
– This score measures the degree to which an object is an anomaly
– This allows objects to be ranked
In the end, you often need a binary decision
– Should this credit card transaction be flagged?
– Still useful to have a score
How many anomalies are there?
4/12/2020
Introduction to Data Mining, 2nd Edition
7
Other Issues for Anomaly Detection
Find all anomalies at once or one at a time
– Swamping
– Masking
Evaluation
– How do you measure performance?
– Supervised vs. unsupervised situations
Efficiency
Context
– Professional basketball team
4/12/2020
Introduction to Data Mining, 2nd Edition
8
Variants of Anomaly Detection Problems
Given a data set D, find all data points x D with
anomaly scores greater than some threshold t
Given a data set D, find all data points x D
having the top-n largest anomaly scores
Given a data set D, containing mostly normal (but
unlabeled) data points, and a test point x,
compute the anomaly score of x with respect to D
4/12/2020
Introduction to Data Mining, 2nd Edition
9
Model-Based Anomaly Detection
Build a model for the data and see
– Unsupervised
◆
Anomalies are those points that don’t fit well
◆
Anomalies are those points that distort the model
◆
Examples:
–
–
–
–
–
Statistical distribution
Clusters
Regression
Geometric
Graph
– Supervised
4/12/2020
◆
Anomalies are regarded as a rare class
◆
Need to have training data
Introduction to Data Mining, 2nd Edition
10
Additional Anomaly Detection Techniques
Proximity-based
– Anomalies are points far away from other points
– Can detect this graphically in some cases
Density-based
– Low density points are outliers
Pattern matching
– Create profiles or templates of atypical but important
events or objects
– Algorithms to detect these patterns are usually simple
and efficient
4/12/2020
Introduction to Data Mining, 2nd Edition
11
Visual Approaches
Boxplots or scatter plots
Limitations
– Not automatic
– Subjective
4/12/2020
Introduction to Data Mining, 2nd Edition
12
Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that
has a low probability with respect to a probability distribution
model of the data.
Usually assume a parametric model describing the distribution
of the data (e.g., normal distribution)
Apply a statistical test that depends on
– Data distribution
– Parameters of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)
Issues
– Identifying the distribution of a data set
◆ Heavy
tailed distribution
– Number of attributes
– Is the data a mixture of distributions?
4/12/2020
Introduction to Data Mining, 2nd Edition
13
Normal Distributions
One-dimensional
Gaussian
8
7
0.1
6
0.09
5
0.08
4
0.07
3
0.06
y
2
0.05
1
0
0.04
-1
0.03
-2
0.02
-3
0.01
Two-dimensional
Gaussian
-4
probability
density
-5
-4
-3
-2
-1
0
1
2
3
4
5
x
4/12/2020
Introduction to Data Mining, 2nd Edition
14
Grubbs’ Test
Detect outliers in univariate data
Assume data comes from normal distribution
Detects one outlier at a time, remove the outlier,
and repeat
– H0: There is no outlier in data
– HA: There is at least one outlier
Grubbs’ test statistic:
Reject H0 if:
4/12/2020
( N − 1)
G
N
G=
max X − X
s
t (2 / N , N −2 )
N − 2 + t (2 / N , N −2 )
Introduction to Data Mining, 2nd Edition
15
Statistical-based – Likelihood Approach
Assume the data set D contains samples from a
mixture of two probability distributions:
– M (majority distribution)
– A (anomalous distribution)
General Approach:
– Initially, assume all the data points belong to M
– Let Lt(D) be the log likelihood of D at time t
– For each point xt that belongs to M, move it to A
◆
Let Lt+1 (D) be the new log likelihood.
◆
Compute the difference, = Lt(D) – Lt+1 (D)
If > c (some threshold), then xt is declared as an anomaly
and moved permanently from M to A
◆
4/12/2020
Introduction to Data Mining, 2nd Edition
16
Statistical-based – Likelihood Approach
Data distribution, D = (1 – ) M + A
M is a probability distribution estimated from data
– Can be based on any modeling method (naïve Bayes,
maximum entropy, etc)
A is initially assumed to be uniform distribution
Likelihood at time t:
|At |
|M t |
Lt ( D ) = PD ( xi ) = (1 − ) PM t ( xi ) PAt ( xi )
i =1
xi M t
xi At
LLt ( D ) = M t log( 1 − ) + log PM t ( xi ) + At log + log PAt ( xi )
N
xi M t
4/12/2020
Introduction to Data Mining, 2nd Edition
xi At
17
Strengths/Weaknesses of Statistical Approaches
Firm mathematical foundation
Can be very efficient
Good results if distribution is known
In many cases, data distribution may not be known
For high dimensional data, it may be difficult to estimate
the true distribution
Anomalies can distort the parameters of the distribution
4/12/2020
Introduction to Data Mining, 2nd Edition
18
Distance-Based Approaches
Several different techniques
An object is an outlier if a specified fraction of the
objects is more than a specified distance away
(Knorr, Ng 1998)
– Some statistical definitions are special cases of this
The outlier score of an object is the distance to
its kth nearest neighbor
4/12/2020
Introduction to Data Mining, 2nd Edition
19
One Nearest Neighbor - One Outlier
D
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
Outlier Score
4/12/2020
Introduction to Data Mining, 2nd Edition
20
One Nearest Neighbor - Two Outliers
0.55
D
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
Outlier Score
4/12/2020
Introduction to Data Mining, 2nd Edition
21
Five Nearest Neighbors - Small Cluster
2
D
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
Outlier Score
4/12/2020
Introduction to Data Mining, 2nd Edition
22
Five Nearest Neighbors - Differing Density
D
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
Outlier Score
4/12/2020
Introduction to Data Mining, 2nd Edition
23
Strengths/Weaknesses of Distance-Based Approaches
Simple
Expensive – O(n2)
Sensitive to parameters
Sensitive to variations in density
Distance becomes less meaningful in highdimensional space
4/12/2020
Introduction to Data Mining, 2nd Edition
24
Density-Based Approaches
Density-based Outlier: The outlier score of an
object is the inverse of the density around the
object.
– Can be defined in terms of the k nearest neighbors
– One definition: Inverse of distance to kth neighbor
– Another definition: Inverse of the average distance to k
neighbors
– DBSCAN definition
If there are regions of different density, this
approach can have problems
4/12/2020
Introduction to Data Mining, 2nd Edition
25
Relative Density
Consider the density of a point relative to that of
its k nearest neighbors
4/12/2020
Introduction to Data Mining, 2nd Edition
26
Relative Density Outlier Scores
6.85
6
C
5
4
1.40
D
3
1.33
2
A
1
Outlier Score
4/12/2020
Introduction to Data Mining, 2nd Edition
27
Density-based: LOF approach
For each point, compute the density of its local neighborhood
Compute local outlier factor (LOF) of a sample p as the average of
the ratios of the density of sample p and the density of its nearest
neighbors
Outliers are points with largest LOF value
In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers
p2
4/12/2020
p1
Introduction to Data Mining, 2nd Edition
28
Strengths/Weaknesses of Density-Based Approaches
Simple
Expensive – O(n2)
Sensitive to parameters
Density becomes less meaningful in highdimensional space
4/12/2020
Introduction to Data Mining, 2nd Edition
29
Clustering-Based Approaches
Clustering-based Outlier: An
object is a cluster-based outlier if
it does not strongly belong to any
cluster
– For prototype-based clusters, an
object is an outlier if it is not close
enough to a cluster center
– For density-based clusters, an object
is an outlier if its density is too low
– For graph-based clusters, an object
is an outlier if it is not well connected
Other issues include the impact of
outliers on the clusters and the
number of clusters
4/12/2020
Introduction to Data Mining, 2nd Edition
30
Distance of Points from Closest Centroids
4.5
4.6
4
C
3.5
3
2.5
D
0.17
2
1.5
1.2
1
A
0.5
Outlier Score
4/12/2020
Introduction to Data Mining, 2nd Edition
31
Relative Distance of Points from Closest Centroid
4
3.5
3
2.5
2
1.5
1
0.5
Outlier Score
4/12/2020
Introduction to Data Mining, 2nd Edition
32
Strengths/Weaknesses of Distance-Based Approaches
Simple
Many clustering techniques can be used
Can be difficult to decide on a clustering
technique
Can be difficult to decide on number of clusters
Outliers can distort the clusters
4/12/2020
Introduction to Data Mining, 2nd Edition
33
Data Mining: Avoiding False Discoveries
Lecture Notes for Chapter 10
Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar
02/14/2018
Introduction to Data Mining, 2nd Edition
1
Outline
Statistical Background
Significance Testing
Hypothesis Testing
Multiple Hypothesis Testing
02/14/2018
Introduction to Data Mining, 2nd Edition
2
Motivation
An algorithm applied to a set of data will usually produce some
result(s)
– There have been claims that the results reported in more than 50%
of published papers are false. (Ioannidis)
Results may be a result of random variation
– Any particular data set is a finite sample from a larger population
– Often significant variation among instances in a data set or
heterogeneity in the population
– Unusual events or coincidences do happen, especially when looking
at lots of events
– For this and other reasons, results may not replicate, i.e., generalize
to other samples of data
Results may not have domain significance
– Finding a difference that makes no difference
Data scientists need to help ensure that results of data analysis
are not false discoveries, i.e., not meaningful or reproducible
02/14/2018
Introduction to Data Mining, 2nd Edition
3
Statistical Testing
Statistical approaches are used to help avoid many
of these problems
Statistics has well-developed procedures for
evaluating the results of data analysis
– Significance testing
– Hypothesis testing
Domain knowledge, careful data collection and
preprocessing, and proper methodology are also
important
– Bias and poor quality data
– Fishing for good results
– Reporting how analysis was done
Ultimate verification lies in the real world
02/14/2018
Introduction to Data Mining, 2nd Edition
4
Probability and Distributions
Variables are characterized by a set of possible
values
– Called the domain of the variable
– Examples:
◆ True
or False for binary variables
◆ Subset of integers for variables that are counts, such as
number of students in a class
◆ Range of real numbers for variables such as weight or height
A probability distribution function describes
the relative frequency with which the values are
observed
Call a variable with a distribution a random
variable
02/14/2018
Introduction to Data Mining, 2nd Edition
5
Probability and Distributions ..
For a discrete variable we define a probability
distribution by the relative frequency with which
each value occurs
– Let X be a variable that records the outcome flipping
a fair coin: heads (1) or tails (0)
– P(X =1) = P(X =0) = 0.5 (P stands for “probability”)
– If 𝑓is the distribution of X, 𝑓(1) = 𝑓 0 = 0.5
Probability distribution function has the following
properties
– Minimum value 0, maximum value 1
– Sums to 1, i.e., σ𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑋 𝑓 𝑋 = 1
02/14/2018
Introduction to Data Mining, 2nd Edition
6
Binomial Distribution
Number of heads in a sequence of n coin flips
– Let R be the number of heads
– R has a binomial distribution
𝑛
– 𝑃 𝑅=𝑘 =
𝑃 𝑋 = 1 𝑘 𝑃 𝑋 = 0 𝑛−𝑘
𝑘
– What is 𝑃 𝑅 = 𝑘 given n = 10 and 𝑃 𝑋 = 1 =0.5 ?
02/14/2018
k
P(R= k)
0
1
2
3
4
5
6
7
8
9
10
0.001
0.01
0.044
0.117
0.205
0.246
0.205
0.117
0.044
0.01
0.001
Introduction to Data Mining, 2nd Edition
7
Probability and Distributions ..
For a continuous variable we define a probability
distribution by using density function
– Probability of any specific value is 0
– Only intervals of values have non-zero probability
◆
Examples: P (X > 3), P(X < -3), P (-1 < X < 1)
◆
If 𝑓is the distribution of X, P (X > 3)= 3 𝑓 (𝑋) 𝑑𝑥
∞
Probability density has the following properties
– Minimum value 0
– Integrates to 1, i.e.,
02/14/2018
∞
−∞ 𝑓
𝑋 =1
Introduction to Data Mining, 2nd Edition
8
\mathcal{N}
\mathcal{N}
Gaussian Distribution
The Gaussian (normal) distribution is the most commonly used
(𝑥−𝜇)2
1
−
– 𝑓 𝑋 =
𝑒 2𝜎2
2𝜋𝜎
– Where 𝜇 and 𝜎 are the mean and standard distribution of the
distribution
∞
∞
– 𝜇 = −∞ 𝑋𝑓 𝑋 𝑑𝑥 and 𝜎 = −∞ 𝑋 − 𝜇 2 𝑓 𝑋 𝑑𝑥
𝜇 = 0 and 𝜎 = 1, i.e., 𝒩(0,1)
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3661.htm
http://www.itl.nist.gov/div898/handbook/index.htm
02/14/2018
Introduction to Data Mining, 2nd Edition
9
Statistical Testing …
Make inferences (decisions) about that validity of a
result
For statistical inference (testing), we need two thing:
– A statement that we want to disprove
◆
Called the null hypothesis (H0)
◆
The null hypothesis is typically a statement that the result is
merely due to random variation
◆
It is typically the opposite of what we would like to show
– A random variable, 𝑅, called a test statistic, for which we
know or can determine a distribution if H0 is true.
◆
The distribution of 𝑅 under H0 is called the null distribution
◆
The value of 𝑅 is obtained from the result and is typically
numeric
02/14/2018
Introduction to Data Mining, 2nd Edition
10
Examples of Null Hypotheses
A coin or a die is a fair coin.
The difference between the means of two
samples is 0
The purchase of a particular item in a store is
unrelated to the purchase of a second item, e.g.,
the purchase of bread and milk are unconnected
The accuracy of a classifier is no better than
random
02/14/2018
Introduction to Data Mining, 2nd Edition
11
Significance Testing
– Significance testing was devised by the statistician Fisher
– Only interested in whether null hypothesis is true
– Significance testing was intended only for exploratory analyses of
the null hypothesis in the preliminary stages of a study
◆
For example, to refine the null hypothesis or modify future experiments
– For many years, significance testing has been a key
approach for justifying the validity of scientific results
– Introduced the concept of p-value, which is widely used
and misused
02/14/2018
Introduction to Data Mining, 2nd Edition
12
How Significance Testing Works
Analyze the data to obtain a result
– For example, data could be from flipping a coin 10 times
to test its fairness
The result is expressed as a value of the test
statistic, 𝑅
– For example, let 𝑅 be the number of heads in 10 flips
Compute the probability of seeing the current
value of 𝑅 or something more extreme
– This probability is known as the p-value of the test
statistic
02/14/2018
Introduction to Data Mining, 2nd Edition
13
How Significance Testing Works …
If the p-value is sufficiently small, we reject the
null hypothesis, H0 and say that the result is
statistically significant
– We say we reject the null hypothesis, H0
– A threshold on the p-value is called the significance
level, 𝜶
◆
Often the significance level is 0.01 or 0.05
If the p-value is not sufficiently small, we say that
we fail to reject the null hypothesis
– Sometimes we say that we accept the null
hypothesis but a high p-value does not necessarily
imply the null hypothesis is true
02/14/2018
Introduction to Data Mining, 2nd Edition
14
Example: Testing a coin for fairness
H0: P(X =1) = P(X =0) = 0.5
Define the test statistic 𝑅 to be the
number of heads in 10 flips
Set the significance level 𝛼 to be
0.05
The number of heads 𝑅 has a
binomial distribution
For which values of 𝑅 would you
reject H0?
02/14/2018
Introduction to Data Mining, 2nd Edition
k
P(S = k)
0
1
2
3
4
5
6
7
8
9
10
0.001
0.01
0.044
0.117
0.205
0.246
0.205
0.117
0.044
0.01
0.001
15
One-sided and Two-sided Tests
More extreme can be interpreted in different ways
For example, an observed value of the test
statistic, 𝑅𝑜𝑏𝑠 , can be considered extreme if
– it is greater than or equal to a certain value, 𝑅𝐻 ,
– smaller than or equal to a certain value, 𝑅𝐿 , or
– outside a specified interval, [𝑅𝐻 , 𝑅𝐿 ].
The first two cases are “one-sided tests” (righttailed and left-tailed, respectively),
The last case results in a “two-sided test.”
02/14/2018
Introduction to Data Mining, 2nd Edition
16
One-sided and Two-sided Tests …
Example of one-tailed and two tailed tests for a
test statistic 𝑅 that is normally distributed for a
roughly 5% significance level.
s
02/14/2018
Introduction to Data Mining, 2nd Edition
17
Neyman-Pearson Hypothesis Testing
Devised by statisticians Neyman and Pearson in
response to perceived shortcomings in
significance testing
– Explicitly specifies an alternative hypothesis, H1
– Significance testing cannot quantify how an observed
results supports H1
– Define an alternative distribution which is the
distribution of the test statistic if H1 is true
– We define a critical region for the test statistic 𝑅
If the value of 𝑅 falls in the critical region, we reject H0
◆ We may or may not accept H1 if H0 is rejected
◆
– The significance level, 𝜶, is the probability of the
critical region under H0
02/14/2018
Introduction to Data Mining, 2nd Edition
18
Hypothesis Testing …
Type I Error (𝜶): Error of incorrectly rejecting the
null hypothesis for a result.
– It is equal to the probability of the critical region under
H0, i.e., is the same as the significance level, 𝜶.
– Formally, α = P(𝑅 ∈ Critical Region | H0)
Type II Error (β): Error of falsely calling a result
as not significant when the alternative hypothesis
is true.
– It is equal to the probability of observing test statistic
values outside the critical region under H1
– Formally, β = P(𝑅 ∉ Critical Region | H1).
02/14/2018
Introduction to Data Mining, 2nd Edition
19
Hypothesis Testing …
Power: which is the probability of the critical
region under H1, i.e., 1−β.
– Power indicates how effective a test will be at
correctly rejecting the null hypothesis.
– Low power means that many results that actually
show the desired pattern or phenomenon will not be
considered significant and thus will be missed.
– Thus, if the power of a test is low, then it may not be
appropriate to ignore results that fall outside the
critical region.
02/14/2018
Introduction to Data Mining, 2nd Edition
20
Example: Classifying Medical Results
The value of a blood test is used as the test statistic, 𝑅, to
identify whether a patient has a particular disease or not.
– H0: For patients 𝐧𝐨𝐭 having the disease, 𝑅 has
distribution 𝒩(40, 5)
– H1: For patients having the disease, 𝑅 has
distribution 𝒩(60, 5)
𝛼=
𝑅−𝑢 2
∞
1
−
50 2𝜋𝜎2 𝑒 2𝜎2
𝛽=
𝑅−𝑢 2
50
1
−
−∞ 2𝜋𝜎2 𝑒 2𝜎2
𝑑𝑅 =
𝑅−40 2
∞ 1
−
50 50𝜋 𝑒 50
𝑑𝑅 =
𝑑𝑅 = 0.023, µ = 40, = 5
𝑅−60 2
50 1
−
−∞ 50𝜋 𝑒 50
𝑑𝑅 = 0.023, µ = 60, = 5
Power = 1 - 𝛽 = 0.977
– See figures on the next page
02/14/2018
Introduction to Data Mining, 2nd Edition
21
0.08
0.08
0.07
0.07
0.06
0.06
Probability Density
Probability Density
𝜶, 𝜷 𝐚𝐧𝐝 𝐏𝐨𝐰𝐞𝐫 𝐟𝐨𝐫 𝐌𝐞𝐝𝐢𝐜𝐚𝐥 𝐓𝐞𝐬𝐭𝐢𝐧𝐠 𝐄𝐱𝐚𝐦𝐩𝐥𝐞
0.05
0.04
0.03
0.05
0.04
0.03
0.02
0.02
0.01
0.01
0
20
30
40
50
60
70
0
20
80
30
40
R
50
60
70
80
R
0.08
0.08
0.07
0.07
0.06
0.06
Probability Density
Probability Density
Distribution of test statistic for the alternative hypothesis (rightmost density curve)
and null hypothesis (leftmost density curve). Shaded region in right subfigure is 𝛼.
0.05
0.04
0.03
0.05
0.04
0.03
0.02
0.02
0.01
0.01
0
20
30
40
50
R
60
70
80
0
20
30
40
50
60
70
80
R
Shaded region in left subfigure is β and shaded region in right subfigure is power.
02/14/2018
Introduction to Data Mining, 2nd Edition
22
Hypothesis Testing: Effect Size
Many times we can find a result that is
statistically significant but not significant from a
domain point of view
– A drug that lowers blood pressure by one percent
Effect size measures the magnitude of the effect
or characteristic being evaluated, and is often the
magnitude of the test statistic.
– Brings in domain considerations
The desired effect size impacts the choice of the
critical region, and thus the significance level and
power of the test
02/14/2018
Introduction to Data Mining, 2nd Edition
23
Effect Size: Example Problem
Consider several new treatments for a rare disease that have a
particular probability of success. If we only have a sample size
of 10 patients, what effect size will be needed to clearly
distinguish a new treatment from the baseline which has is 60
% effective?
R/p(X=1)
0
1
2
3
4
5
6
7
8
9
10
02/14/2018
0.60
0.0001
0.0016
0.0106
0.0425
0.1115
0.2007
0.2508
0.2150
0.1209
0.0403
0.0060
0.70
0.0000
0.0001
0.0014
0.0090
0.0368
0.1029
0.2001
0.2668
0.2335
0.1211
0.0282
0.80
0.0000
0.0000
0.0001
0.0008
0.0055
0.0264
0.0881
0.2013
0.3020
0.2684
0.1074
0.90
0.0000
0.0000
0.0000
0.0000
0.0001
0.0015
0.0112
0.0574
0.1937
0.3874
0.3487
Introduction to Data Mining, 2nd Edition
24
Multiple Hypothesis Testing
Arises when multiple results are produced and
multiple statistical tests are performed
The tests studied so far are for assessing the
evidence for the null (and perhaps alternative)
hypothesis for a single result
A regular statistical test does not suffice
– For example, getting 10 heads in a row for a fair coin
is unlikely for one such experiment
◆
probability =
1 10
2
= 0.001
– But, for 10,000 such experiments we would expect 10
such occurrences
02/14/2018
Introduction to Data Mining, 2nd Edition
25
Summarizing the Results of Multiple Tests
The following confusion table defines how results of multiple tests are
summarized
–
–
We assume the results fall into two classes, + and –, which, follow the
alternative and null hypotheses, respectively.
The focus is typically on the number of false positives (FP), i.e., the results
that belong to the null distribution (– class) but are declared significant
(+ class).
Confusion table for summarizing multiple hypothesis testing results.
Declared significant
(+ prediction)
Declared not significant
(– prediction)
Total
H1 True
(actual +)
True Positive (TP)
False Negative (FN)
type II error
Positives
(m1 )
H0 True
(actual –)
False Positive (FP)
type I error
True Negative (TN)
Negatives
(m0 )
Positive Predictions
(Ppred)
Negative Predictions
(Npred)
m
02/14/2018
Introduction to Data Mining, 2nd Edition
26
Family-wise Error Rate
By family, we mean a collection of related tests
family-wise error rate (FWER) is the probability
of observing even a single false positive (type I
error) in an entire set of m results.
– FWER = P(FP > 0).
Suppose your significance level is 0.05 for a
single test
–
–
–
–
Probability of no error for one test is 1 − 0.05 = 0.95.
Probability of no error for m tests is 0.95𝑚
FWER = P(FP > 0) = 1 – 0.95𝑚
If m = 10, FWER = 0.60
02/14/2018
Introduction to Data Mining, 2nd Edition
27
Bonferroni Procedure
Goal of FWER is to ensure that FWER < 𝛼,
where 𝛼 is often 0.05
Bonferroni Procedure:
– m results are to be tested
– Require FWER < α
– set the significance level, 𝛼 ∗ for every test to be
𝛼 ∗ = 𝛼/m.
If m = 10 and 𝛼 = 0.05 then 𝛼 ∗ = 0.05/10=0.005
02/14/2018
Introduction to Data Mining, 2nd Edition
28
Example: Bonferroni versus Naïve approach
Naïve approach is to evaluate statistical significance for each
result without adjusting the significance level.
1
Family−wise Error Rate (FWER)
0.9
Naive Approach
0.8
Bonferroni
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.05
0
0
10
20
30
40
50
60
Number of Results, m
70
80
90
100
The family wise error rate (FWER) curves for the naïve approach and the
Bonferroni procedure as a function of the number of results, m. 𝜶 = 𝟎. 𝟎𝟓.
02/14/2018
Introduction to Data Mining, 2nd Edition
29
False Discovery Rate
FWER controlling procedures seek a low probability for obtaining any
false positives
–
Not the appropriate tool when the goal is to allow some false positives in
order to get more true positives
The false discovery rate (FDR) measures the rate of false positives,
which are also called false discoveries
𝐹𝑃
𝐹𝑃
𝑄=
=
if 𝑃𝑝𝑟𝑒𝑑 > 0
𝑃𝑝𝑟𝑒𝑑 𝑇𝑃 + 𝐹𝑃
= 0 if 𝑃𝑝𝑟𝑒𝑑 = 0,
where 𝑃𝑝𝑟𝑒𝑑 is the number of predicted positives
If we know FP, the number of actual false positives, then FDR = FP.
–
Typically we don’t know FP in a testing situation
Thus, FDR = 𝑄 P(𝑃𝑝𝑟𝑒𝑑>0) = E(𝑄), the expected value of Q.
02/14/2018
Introduction to Data Mining, 2nd Edition
30
Benjamini-Hochberg Procedure
An algorithm to control the false discovery rate
(FDR)
Benjamini-Hochberg (BH) FDR algorithm.
1:
2:
3:
4:
5:
Compute p-values for the m results.
Order the p-values from smallest to largest (p1 to pm ).
α
Compute the significance level for pi as αi = i × m
.
Let k be the largest index such that pk ≤ αk .
Reject H0 for all results corresponding to the first k p-values, pi , 1 ≤ i ≤ k.
This procedure first orders the p-values from
smallest to largest
Then it uses a separate significance level for
each test
– 𝛼𝑖 = 𝑖 ×
02/14/2018
𝛼
𝑖
Introduction to Data Mining, 2nd Edition
31
FDR Example: Picking a stockbroker
Suppose we have a test for determining whether a stockbroker makes
profitable stock picks. This test, applied to an individual stockbroker, has a
significance level, 𝛼 = 0.05. We use the same value for our desired false
discovery rate.
–
Normally, we set the desired FDR rate higher, e.g., 10% or 20%
The following figure compares the naïve approach, Bonferroni, and the BH
FDR procedure with respect to the false discovery rate for various numbers
of tests, m. 1/3 of the sample were from the alternative distribution.
0.12
False Discovery Rate (FDR)
0.1
Naive Approach
Bonferroni
BH Procedure
0.08
0.06
0.05
0.04
0.02
0
0
10
20
30
40
50
60
70
80
90
100
Number of stockbrokers, m
False Discovery Rate as a function of m.
02/14/2018
Introduction to Data Mining, 2nd Edition
32
FDR Example: Picking a stockbroker …
The following figure compares the naïve approach, Bonferroni, and
the BH FDR procedure with respect to the power for various numbers
of tests, m. 1/3 of the sample were from the alternative distribution.
1
Expected Power (True Positive Rate)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Naive Approach
0.2
Bonferroni
BH Procedure
0.1
0
0
10
20
30
40
50
60
70
Number of stockbrokers, m
80
90
100
Expected Power as function o f m.
02/14/2018
Introduction to Data Mining, 2nd Edition
33
Comparison of FWER and FDR
FWER is appropriate when it is important to avoid
any error.
– But an FWER procedure such as Bonferroni makes
many Type II errors and thus, has poor power.
– An FWER approach has very a very false discovery
rate
FDR is appropriate when it is important to identity
positive results, i.e., those belonging to the
alternative distribution.
– By construction, the false discovery rate is good for
an FDR procedure such as the BH approach
– An FDR approach also has good power
02/14/2018
Introduction to Data Mining, 2nd Edition
34
Purchase answer to see full
attachment