Class 12: Long-Run Defense and Stopping Rules
February 17, 2021
1
Plan
Weak Law of Large Numbers
The long-run defense
Criticisms of the long-run defense
Frequentists are sensitive to the stopping rule
Does sensitivity to the stopping rule matter?
2
Which is true?
1. If you flip a fair coin 1000 times, then the proportion of heads
will be equal to 12 .
2. If you flip a fair coin 1000 times, then the proportion of heads
will be close to 12 .
3. If you flip a fair coin often enough, then the proportion of
heads will be close to 12 .
4. If you flip a fair coin often enough, then the proportion of
heads will likely be equal to 12 .
3
Weak Law of Large Numbers
WLLN for coins
If you flip a coin often enough, then the proportion of heads will
likely be close to the bias.
4
Weak Law of Large Numbers
WLLN for coins
If you flip a coin often enough, then the proportion of heads will
likely be close to the bias.
WLLN for dice
If you roll a die often enough, then the average of the outcomes
will likely be close to the expected value.
4
Weak Law of Large Numbers
WLLN for coins
If you flip a coin often enough, then the proportion of heads will
likely be close to the bias.
WLLN for dice
If you roll a die often enough, then the average of the outcomes
will likely be close to the expected value.
WLLN for arbitrary distributions
If you perform an experiment often enough, then the average of
the outcomes will likely be close to the expected value.
4
Plan
Weak Law of Large Numbers
The long-run defense
Criticisms of the long-run defense
Frequentists are sensitive to the stopping rule
Does sensitivity to the stopping rule matter?
5
The defense
Million-Dollar Question
Why use significance testing or Neyman-Pearson testing?
Million-Dollar Answer?
Because they do well in the long run.
6
Example: significance testing
Microchips 1
You’re in charge of quality control on a microchip production line.
Your task is to decide whether the chips are sound or defective. So
you perform a significance test on each chip. The probability of
mistaken rejection is .05.
7
How does it do in the long run?
Suppose you perform the test once on many chips.
8
How does it do in the long run?
Suppose you perform the test once on many chips.
Focusing on the sound chips, what does the Weak Law tell us?
8
How does it do in the long run?
Suppose you perform the test once on many chips.
Focusing on the sound chips, what does the Weak Law tell us?
If you test enough sound chips, the proportion of incorrect
decisions among them is likely close to .05.
8
How does it do in the long run?
Suppose you perform the test once on many chips.
Focusing on the sound chips, what does the Weak Law tell us?
If you test enough sound chips, the proportion of incorrect
decisions among them is likely close to .05.
8
Example
Microchips 2
You’re in charge of quality control on a microchip production line.
Your task is to decide whether the chips are sound or defective. So
you perform a Neyman-Pearson test on each chip. Your rejection
region has size .05 and power .80.
9
How does it do in the long run?
Imagine you test many chips.
10
How does it do in the long run?
Imagine you test many chips.
Focusing on the sound chips, what does the Weak Law tell us?
10
How does it do in the long run?
Imagine you test many chips.
Focusing on the sound chips, what does the Weak Law tell us?
If you test enough sound chips, the proportion of incorrect
decisions among them is likely close to .05.
10
How does it do in the long run?
Imagine you test many chips.
Focusing on the sound chips, what does the Weak Law tell us?
If you test enough sound chips, the proportion of incorrect
decisions among them is likely close to .05.
Focusing on the defective chips, what does the Weak Law tell us?
10
How does it do in the long run?
Imagine you test many chips.
Focusing on the sound chips, what does the Weak Law tell us?
If you test enough sound chips, the proportion of incorrect
decisions among them is likely close to .05.
Focusing on the defective chips, what does the Weak Law tell us?
If you test enough defective chips, the proportion of correct
decisions among them is likely close to .80.
10
How does it do in the long run?
Imagine you test many chips.
Focusing on the sound chips, what does the Weak Law tell us?
If you test enough sound chips, the proportion of incorrect
decisions among them is likely close to .05.
Focusing on the defective chips, what does the Weak Law tell us?
If you test enough defective chips, the proportion of correct
decisions among them is likely close to .80.
10
Plan
Weak Law of Large Numbers
The long-run defense
Criticisms of the long-run defense
Frequentists are sensitive to the stopping rule
Does sensitivity to the stopping rule matter?
11
Mistaken rejection v. correct rejection
We showed that:
If you test enough sound chips, the proportion of incorrect
decisions among them is likely close to the probability of
mistaken rejection.
12
Mistaken rejection v. correct rejection
We showed that:
If you test enough sound chips, the proportion of incorrect
decisions among them is likely close to the probability of
mistaken rejection.
The probability of mistaken rejection is at most the significance
level.
12
Mistaken rejection v. correct rejection
We showed that:
If you test enough sound chips, the proportion of incorrect
decisions among them is likely close to the probability of
mistaken rejection.
The probability of mistaken rejection is at most the significance
level.
So we can make the probability of mistaken rejection as low as we
like by reducing the significance level!
12
Mistaken rejection v. correct rejection
We showed that:
If you test enough sound chips, the proportion of incorrect
decisions among them is likely close to the probability of
mistaken rejection.
The probability of mistaken rejection is at most the significance
level.
So we can make the probability of mistaken rejection as low as we
like by reducing the significance level!
But what happens to the probability of correct rejection?
12
Just more probabilities
To say the size is .05 and the power is .80 is to say:
(1) The probability of making an incorrect decision for a sound
chip is .05.
(2) The probability of making a correct decision for a defective
chip is .80.
What you can then prove is:
(3) If you test enough sound chips, the proportion of incorrect
decisions among them is likely close to .05.
(4) If you test enough defective chips, the proportion of correct
decisions among them is likely close to .80.
13
Just more probabilities
To say the size is .05 and the power is .80 is to say:
(1) The probability of making an incorrect decision for a sound
chip is .05.
(2) The probability of making a correct decision for a defective
chip is .80.
What you can then prove is:
(3) If you test enough sound chips, the proportion of incorrect
decisions among them is likely close to .05.
(4) If you test enough defective chips, the proportion of correct
decisions among them is likely close to .80.
If (1) and (2) don’t reassure you, then why are (3) and (4) any
more reassuring?
13
A fallacy
From reliability to confidence
Imagine a machine for evaluating statements: you feed in the
statement, the machine churns away for a while, and then it prints
out ‘True’ or ‘False’. The machine is 90% reliable: i.e. with
probability 90%, it prints out the correct answer. Should you be
90%-confident, of any particular answer printed by the machine,
that the answer is correct?
14
Plan
Weak Law of Large Numbers
The long-run defense
Criticisms of the long-run defense
Frequentists are sensitive to the stopping rule
Does sensitivity to the stopping rule matter?
15
Question
Coin
To assess whether a coin is fair, I flip it 20 times and count the
number of heads. The outcome is: hhhthhhththhhhhtthht.
16
Question
Coin
To assess whether a coin is fair, I flip it 20 times and count the
number of heads. The outcome is: hhhthhhththhhhhtthht.
What is the p-value?
16
Question
Coin
To assess whether a coin is fair, I flip it 20 times and count the
number of heads. The outcome is: hhhthhhththhhhhtthht.
What is the p-value? It depends!
16
What is a stopping rule?
A stopping rule is part of the experimental design: it describes
what you do in the experiment.
17
What is a stopping rule?
A stopping rule is part of the experimental design: it describes
what you do in the experiment.
Examples:
17
What is a stopping rule?
A stopping rule is part of the experimental design: it describes
what you do in the experiment.
Examples:
I stop when you’ve flipped the coin 20 times
17
What is a stopping rule?
A stopping rule is part of the experimental design: it describes
what you do in the experiment.
Examples:
I stop when you’ve flipped the coin 20 times
I stop when you’ve flipped the coin 100 times
17
What is a stopping rule?
A stopping rule is part of the experimental design: it describes
what you do in the experiment.
Examples:
I stop when you’ve flipped the coin 20 times
I stop when you’ve flipped the coin 100 times
I stop when you’ve got a total of 6 tails
17
What is a stopping rule?
A stopping rule is part of the experimental design: it describes
what you do in the experiment.
Examples:
I stop when you’ve flipped the coin 20 times
I stop when you’ve flipped the coin 100 times
I stop when you’ve got a total of 6 tails
I after each flip draw a card from a deck and stop when you
draw the Ace of Spades
17
What is a stopping rule?
A stopping rule is part of the experimental design: it describes
what you do in the experiment.
Examples:
I stop when you’ve flipped the coin 20 times
I stop when you’ve flipped the coin 100 times
I stop when you’ve got a total of 6 tails
I after each flip draw a card from a deck and stop when you
draw the Ace of Spades
I ? stop when you’ve got three times as many heads as tails
17
What is a stopping rule?
A stopping rule is part of the experimental design: it describes
what you do in the experiment.
Examples:
I stop when you’ve flipped the coin 20 times
I stop when you’ve flipped the coin 100 times
I stop when you’ve got a total of 6 tails
I after each flip draw a card from a deck and stop when you
draw the Ace of Spades
I ? stop when you’ve got three times as many heads as tails
I ? stop when your p-value is less than 5%
17
Coin, again
Coin
To assess whether a coin is fair, I flip it 20 times and count the
number of heads. The outcome is: hhhthhhththhhhhtthht.
18
Coin, again
Coin
To assess whether a coin is fair, I flip it 20 times and count the
number of heads. The outcome is: hhhthhhththhhhhtthht.
The description of the experiment is consistent with multiple
stopping rules.
18
Coin, again
Coin
To assess whether a coin is fair, I flip it 20 times and count the
number of heads. The outcome is: hhhthhhththhhhhtthht.
The description of the experiment is consistent with multiple
stopping rules.
In particular:
I Fixed Flips: stop when you’ve tossed the coin 20 times.
I Fixed Tails: stop when you’ve got a total of 6 tails.
18
Coin, again
Coin
To assess whether a coin is fair, I flip it 20 times and count the
number of heads. The outcome is: hhhthhhththhhhhtthht.
The description of the experiment is consistent with multiple
stopping rules.
In particular:
I Fixed Flips: stop when you’ve tossed the coin 20 times.
I Fixed Tails: stop when you’ve got a total of 6 tails.
The p-value depends on the stopping rule!
18
Analyzing with Fixed Flips
Here’s the distribution of the number of heads given Fixed Flips:
The actual number of heads is 14.
The p-value is .115.
19
Analyzing with Fixed Tails
Here’s the distribution of the number of heads given Fixed Tails:
The actual number of heads is 14.
The p-value is .032.
20
Does sensitivity to stopping rules matter?
Howson and Urbach (2006: 158–9):
The fact that significance tests and, indeed, all classical
inference models [are sensitive to the stopping rule] is a
decisive objection to the whole approach.
Royall (1997: 23–4):
Any concept or technique for evaluating observations as
evidence that denies this equivalence, attaching a different
measure of ‘significance’ to [an outcome, according to the
underlying stopping rule], is invalid. [...] The ‘irrelevance of
the sample space’ is a critically important concept, for it
implies a structural flaw that is not limited to significance
tests, but pervades all of today’s dominant statistical
methodology.
21
Getting clear about the argument
(P1) The evidential import of the result doesn’t depend on the
stopping rule.
(P2) The p-value does depend on the stopping rule.
(C) Therefore the p-value doesn’t track the evidential import of
the result.
22
Plan
Weak Law of Large Numbers
The long-run defense
Criticisms of the long-run defense
Frequentists are sensitive to the stopping rule
Does sensitivity to the stopping rule matter?
23
Underdetermination
Collaboration
Suppose two scientists are doing an experiment together. As it
happens, they have in mind different stopping rules, but they don’t
realize this. By chance, the outcome of the experiment is
consistent with both stopping rules. What is the p-value of the
outcome?
24
Exploiting the phenomenon?
Can you exploit sensitivity to the stopping rule to boost your
chance of getting the desired outcome?
25
Exploiting the phenomenon?
Can you exploit sensitivity to the stopping rule to boost your
chance of getting the desired outcome?
Suppose I offer a prize if you manage to bias a coin: I’ll award you
the prize just if you reject the hypothesis that the coin is fair at the
5%-level.
25
Exploiting the phenomenon?
Can you exploit sensitivity to the stopping rule to boost your
chance of getting the desired outcome?
Suppose I offer a prize if you manage to bias a coin: I’ll award you
the prize just if you reject the hypothesis that the coin is fair at the
5%-level.
Two options:
I Fixed Flips: flip it 20 times.
I Fixed Difference: flip the coin until you have 10 more heads
than tails, or you’ve flipped it a total of 100 times, whichever
comes first.
25
Exploiting the phenomenon?
Can you exploit sensitivity to the stopping rule to boost your
chance of getting the desired outcome?
Suppose I offer a prize if you manage to bias a coin: I’ll award you
the prize just if you reject the hypothesis that the coin is fair at the
5%-level.
Two options:
I Fixed Flips: flip it 20 times.
I Fixed Difference: flip the coin until you have 10 more heads
than tails, or you’ve flipped it a total of 100 times, whichever
comes first.
Does the Fixed Difference test boost your chances of mistakenly
rejecting the null?
25
Exploiting the phenomenon
In the Fixed Flips test, we reject at the 5%-level just if we get 0–5
or 15–20 heads. The probability, assuming the null, of getting one
of these outcomes, is about .41.
In the Fixed Difference test, here’s the distribution of the number
of heads:
It turns out we reject at the 5%-level just if we get 0–17 heads.
The probability, assuming the null, of getting one of these
outcomes is about .43.
26
Is the sensitivity a good thing?
Suppose you have two experiments to test whether a coin is fair: in
the first, the experimenter flips the coin until he gets more heads
than tails; in the second, the experimenter flips the coin 101 times.
Both get the same outcome—the very same sequence of h’s and
t’s. The outcome contains 51 heads and 50 tails and the number
of heads exceeds the number of tails only on the last flip.
27
Is the sensitivity a good thing?
Suppose you have two experiments to test whether a coin is fair: in
the first, the experimenter flips the coin until he gets more heads
than tails; in the second, the experimenter flips the coin 101 times.
Both get the same outcome—the very same sequence of h’s and
t’s. The outcome contains 51 heads and 50 tails and the number
of heads exceeds the number of tails only on the last flip.
Surely the outcome points in different directions in the two cases.
27
Is the sensitivity a good thing?
Suppose you have two experiments to test whether a coin is fair: in
the first, the experimenter flips the coin until he gets more heads
than tails; in the second, the experimenter flips the coin 101 times.
Both get the same outcome—the very same sequence of h’s and
t’s. The outcome contains 51 heads and 50 tails and the number
of heads exceeds the number of tails only on the last flip.
Surely the outcome points in different directions in the two cases.
In the first case, you should be more confident than not that the
coin is tails-biased. After all, it took a long time—101 flips!—to
get more heads than tails.
27
Is the sensitivity a good thing?
Suppose you have two experiments to test whether a coin is fair: in
the first, the experimenter flips the coin until he gets more heads
than tails; in the second, the experimenter flips the coin 101 times.
Both get the same outcome—the very same sequence of h’s and
t’s. The outcome contains 51 heads and 50 tails and the number
of heads exceeds the number of tails only on the last flip.
Surely the outcome points in different directions in the two cases.
In the first case, you should be more confident than not that the
coin is tails-biased. After all, it took a long time—101 flips!—to
get more heads than tails.
In the second case, you should be more confident than not that the
coin is heads-biased. After all, there are more heads than tails.
27
Is the sensitivity a good thing?
Suppose you have two experiments to test whether a coin is fair: in
the first, the experimenter flips the coin until he gets more heads
than tails; in the second, the experimenter flips the coin 101 times.
Both get the same outcome—the very same sequence of h’s and
t’s. The outcome contains 51 heads and 50 tails and the number
of heads exceeds the number of tails only on the last flip.
Surely the outcome points in different directions in the two cases.
In the first case, you should be more confident than not that the
coin is tails-biased. After all, it took a long time—101 flips!—to
get more heads than tails.
In the second case, you should be more confident than not that the
coin is heads-biased. After all, there are more heads than tails.
Does this show that being sensitive to the stopping rule is a virtue,
not a vice?
27
Next up
Haley Schilling’s guest lecture: she’ll describe a new and
interesting spin on significance tests.
28
Philosophy of Statistics: Homework 7
due on Gradescope by 11am on Friday February 26
Guidelines. Some questions ask you to justify your answers. For these questions, credit will be
based on how well you justify your answers, not on whether your answers are correct. (There’s often
no consensus on the correct answers, even among statisticians.) However, that doesn’t mean that
anything goes: some answers will be hard to justify well. I give suggested word counts but these are
just ballpark numbers. Don’t sweat them too much. Collaboration is encouraged, but make sure to
write up your answers by yourself and list your collaborators.
Problem 1 (15 points). Imagine a machine for evaluating statements: you feed in the statement,
the machine churns away for a while, and then it prints out ‘True’ or ‘False’. The machine is very
reliable: if you feed in a true statement, then with high probability it prints ‘True’; if you feed in a
false statement, then with high probability it prints ‘False’. But it does not follow that you should
be very confident, of any particular answer printed by the machine, that the answer is correct! Give
an example where you would be certain the answer was correct, an example where you would be
very confident the answer was correct, an example where you would be very confident the answer
was incorrect, and an example where you would be certain the answer was incorrect.
Problem 2 (20 points). A stopping rule tells you what to do in an experiment. For example,
in an experiment to test whether a coin is fair one stopping rule is “flip the coin 20 times” and
another stopping rule is “flip the coin until you get a total of 6 tails”. There are many other possible
stopping rules.
Now, imagine a basketball coach wishes to test whether a player’s free throw percentage is 80%,
as the player claims. So the coach has the player take some free throws. The results are: make,
miss, miss, make, make, miss, make, make, make, miss, make, miss, make, make, make. Give four
examples of stopping rules which the coach might be using. (Don’t worry about whether your
stopping rules are weird or not. Any four stopping rules consistent with the results will do.)
Problem 3 (15 points). In class, we discussed how significance tests are sensitive to the stopping
rule. Explain what this means and illustrate with an example. (You can use the example we
discussed in class if you like, or come up with your own.) Be as precise as you can.
Problem 4 (25 points). Howson and Urbach (2006: 158–9) say: “The fact that significance tests
and, indeed, all classical inference models [are sensitive to the stopping rule] is a decisive objection
to the whole approach.” Royall (1997: 24) agrees, saying that sensitivity to the stopping rule is a
“structural flaw” and means that frequentist methods are “invalid”.
Let h be some hypothesis and x be the outcome of an experiment in a significance test of h. We
might try to formulate the objection as an argument, like this:
(P1) The strength of evidence of x against h doesn’t depend on the stopping rule.
(P2) The p-value of x does depend on the stopping rule.
(C) So, the p-value doesn’t measure the strength of evidence of x against h.
1
Evaluate this argument. Be as specific as you can. If a premise is false, which premise and why? If
the argument is invalid, how does it go wrong and how might it be improved? If the conclusion is
true, can anything be salvaged from significance testing? (300 words.)
Problem 5 (25 points). A Bayesian says that an outcome x is evidence for a hypothesis h just
if the posterior is greater than the prior. In symbols: P (h | x) > P (h). OK, but how should we
measure the strength of the evidence? In class we measured it like this:
Difference measure: the strength of evidence of x for h is measured by the difference between
posterior and prior, i.e. P (h | x) − P (h).
But other measures are possible. For example:
Ratio measure: the strength of evidence of x for h is measured by the ratio of posterior and
prior, i.e. P (h | x)/P (h)
These two measures are quite different. For example, there are experiments where each of two of
the possible outcomes, x1 and x2 , are evidence for each of two hypotheses, h and i, but according
to one of the measures, the strength of evidence of x1 for h is greater than the strength of evidence
of x2 for i, and according to the other measure, the strength of evidence of x1 for h is less than the
strength of evidence of x2 for i. Given an example of such an experiment and show your calculation
of the strengths of evidence.
2

Purchase answer to see full
attachment