UCLA Philosophy of Statistics Problems

User Generated

qnqhyvx8

Mathematics

University Of California Los Angeles

Description

Unformatted Attachment Preview

Class 12: Long-Run Defense and Stopping Rules February 17, 2021 1 Plan Weak Law of Large Numbers The long-run defense Criticisms of the long-run defense Frequentists are sensitive to the stopping rule Does sensitivity to the stopping rule matter? 2 Which is true? 1. If you flip a fair coin 1000 times, then the proportion of heads will be equal to 12 . 2. If you flip a fair coin 1000 times, then the proportion of heads will be close to 12 . 3. If you flip a fair coin often enough, then the proportion of heads will be close to 12 . 4. If you flip a fair coin often enough, then the proportion of heads will likely be equal to 12 . 3 Weak Law of Large Numbers WLLN for coins If you flip a coin often enough, then the proportion of heads will likely be close to the bias. 4 Weak Law of Large Numbers WLLN for coins If you flip a coin often enough, then the proportion of heads will likely be close to the bias. WLLN for dice If you roll a die often enough, then the average of the outcomes will likely be close to the expected value. 4 Weak Law of Large Numbers WLLN for coins If you flip a coin often enough, then the proportion of heads will likely be close to the bias. WLLN for dice If you roll a die often enough, then the average of the outcomes will likely be close to the expected value. WLLN for arbitrary distributions If you perform an experiment often enough, then the average of the outcomes will likely be close to the expected value. 4 Plan Weak Law of Large Numbers The long-run defense Criticisms of the long-run defense Frequentists are sensitive to the stopping rule Does sensitivity to the stopping rule matter? 5 The defense Million-Dollar Question Why use significance testing or Neyman-Pearson testing? Million-Dollar Answer? Because they do well in the long run. 6 Example: significance testing Microchips 1 You’re in charge of quality control on a microchip production line. Your task is to decide whether the chips are sound or defective. So you perform a significance test on each chip. The probability of mistaken rejection is .05. 7 How does it do in the long run? Suppose you perform the test once on many chips. 8 How does it do in the long run? Suppose you perform the test once on many chips. Focusing on the sound chips, what does the Weak Law tell us? 8 How does it do in the long run? Suppose you perform the test once on many chips. Focusing on the sound chips, what does the Weak Law tell us? If you test enough sound chips, the proportion of incorrect decisions among them is likely close to .05. 8 How does it do in the long run? Suppose you perform the test once on many chips. Focusing on the sound chips, what does the Weak Law tell us? If you test enough sound chips, the proportion of incorrect decisions among them is likely close to .05. 8 Example Microchips 2 You’re in charge of quality control on a microchip production line. Your task is to decide whether the chips are sound or defective. So you perform a Neyman-Pearson test on each chip. Your rejection region has size .05 and power .80. 9 How does it do in the long run? Imagine you test many chips. 10 How does it do in the long run? Imagine you test many chips. Focusing on the sound chips, what does the Weak Law tell us? 10 How does it do in the long run? Imagine you test many chips. Focusing on the sound chips, what does the Weak Law tell us? If you test enough sound chips, the proportion of incorrect decisions among them is likely close to .05. 10 How does it do in the long run? Imagine you test many chips. Focusing on the sound chips, what does the Weak Law tell us? If you test enough sound chips, the proportion of incorrect decisions among them is likely close to .05. Focusing on the defective chips, what does the Weak Law tell us? 10 How does it do in the long run? Imagine you test many chips. Focusing on the sound chips, what does the Weak Law tell us? If you test enough sound chips, the proportion of incorrect decisions among them is likely close to .05. Focusing on the defective chips, what does the Weak Law tell us? If you test enough defective chips, the proportion of correct decisions among them is likely close to .80. 10 How does it do in the long run? Imagine you test many chips. Focusing on the sound chips, what does the Weak Law tell us? If you test enough sound chips, the proportion of incorrect decisions among them is likely close to .05. Focusing on the defective chips, what does the Weak Law tell us? If you test enough defective chips, the proportion of correct decisions among them is likely close to .80. 10 Plan Weak Law of Large Numbers The long-run defense Criticisms of the long-run defense Frequentists are sensitive to the stopping rule Does sensitivity to the stopping rule matter? 11 Mistaken rejection v. correct rejection We showed that: If you test enough sound chips, the proportion of incorrect decisions among them is likely close to the probability of mistaken rejection. 12 Mistaken rejection v. correct rejection We showed that: If you test enough sound chips, the proportion of incorrect decisions among them is likely close to the probability of mistaken rejection. The probability of mistaken rejection is at most the significance level. 12 Mistaken rejection v. correct rejection We showed that: If you test enough sound chips, the proportion of incorrect decisions among them is likely close to the probability of mistaken rejection. The probability of mistaken rejection is at most the significance level. So we can make the probability of mistaken rejection as low as we like by reducing the significance level! 12 Mistaken rejection v. correct rejection We showed that: If you test enough sound chips, the proportion of incorrect decisions among them is likely close to the probability of mistaken rejection. The probability of mistaken rejection is at most the significance level. So we can make the probability of mistaken rejection as low as we like by reducing the significance level! But what happens to the probability of correct rejection? 12 Just more probabilities To say the size is .05 and the power is .80 is to say: (1) The probability of making an incorrect decision for a sound chip is .05. (2) The probability of making a correct decision for a defective chip is .80. What you can then prove is: (3) If you test enough sound chips, the proportion of incorrect decisions among them is likely close to .05. (4) If you test enough defective chips, the proportion of correct decisions among them is likely close to .80. 13 Just more probabilities To say the size is .05 and the power is .80 is to say: (1) The probability of making an incorrect decision for a sound chip is .05. (2) The probability of making a correct decision for a defective chip is .80. What you can then prove is: (3) If you test enough sound chips, the proportion of incorrect decisions among them is likely close to .05. (4) If you test enough defective chips, the proportion of correct decisions among them is likely close to .80. If (1) and (2) don’t reassure you, then why are (3) and (4) any more reassuring? 13 A fallacy From reliability to confidence Imagine a machine for evaluating statements: you feed in the statement, the machine churns away for a while, and then it prints out ‘True’ or ‘False’. The machine is 90% reliable: i.e. with probability 90%, it prints out the correct answer. Should you be 90%-confident, of any particular answer printed by the machine, that the answer is correct? 14 Plan Weak Law of Large Numbers The long-run defense Criticisms of the long-run defense Frequentists are sensitive to the stopping rule Does sensitivity to the stopping rule matter? 15 Question Coin To assess whether a coin is fair, I flip it 20 times and count the number of heads. The outcome is: hhhthhhththhhhhtthht. 16 Question Coin To assess whether a coin is fair, I flip it 20 times and count the number of heads. The outcome is: hhhthhhththhhhhtthht. What is the p-value? 16 Question Coin To assess whether a coin is fair, I flip it 20 times and count the number of heads. The outcome is: hhhthhhththhhhhtthht. What is the p-value? It depends! 16 What is a stopping rule? A stopping rule is part of the experimental design: it describes what you do in the experiment. 17 What is a stopping rule? A stopping rule is part of the experimental design: it describes what you do in the experiment. Examples: 17 What is a stopping rule? A stopping rule is part of the experimental design: it describes what you do in the experiment. Examples: I stop when you’ve flipped the coin 20 times 17 What is a stopping rule? A stopping rule is part of the experimental design: it describes what you do in the experiment. Examples: I stop when you’ve flipped the coin 20 times I stop when you’ve flipped the coin 100 times 17 What is a stopping rule? A stopping rule is part of the experimental design: it describes what you do in the experiment. Examples: I stop when you’ve flipped the coin 20 times I stop when you’ve flipped the coin 100 times I stop when you’ve got a total of 6 tails 17 What is a stopping rule? A stopping rule is part of the experimental design: it describes what you do in the experiment. Examples: I stop when you’ve flipped the coin 20 times I stop when you’ve flipped the coin 100 times I stop when you’ve got a total of 6 tails I after each flip draw a card from a deck and stop when you draw the Ace of Spades 17 What is a stopping rule? A stopping rule is part of the experimental design: it describes what you do in the experiment. Examples: I stop when you’ve flipped the coin 20 times I stop when you’ve flipped the coin 100 times I stop when you’ve got a total of 6 tails I after each flip draw a card from a deck and stop when you draw the Ace of Spades I ? stop when you’ve got three times as many heads as tails 17 What is a stopping rule? A stopping rule is part of the experimental design: it describes what you do in the experiment. Examples: I stop when you’ve flipped the coin 20 times I stop when you’ve flipped the coin 100 times I stop when you’ve got a total of 6 tails I after each flip draw a card from a deck and stop when you draw the Ace of Spades I ? stop when you’ve got three times as many heads as tails I ? stop when your p-value is less than 5% 17 Coin, again Coin To assess whether a coin is fair, I flip it 20 times and count the number of heads. The outcome is: hhhthhhththhhhhtthht. 18 Coin, again Coin To assess whether a coin is fair, I flip it 20 times and count the number of heads. The outcome is: hhhthhhththhhhhtthht. The description of the experiment is consistent with multiple stopping rules. 18 Coin, again Coin To assess whether a coin is fair, I flip it 20 times and count the number of heads. The outcome is: hhhthhhththhhhhtthht. The description of the experiment is consistent with multiple stopping rules. In particular: I Fixed Flips: stop when you’ve tossed the coin 20 times. I Fixed Tails: stop when you’ve got a total of 6 tails. 18 Coin, again Coin To assess whether a coin is fair, I flip it 20 times and count the number of heads. The outcome is: hhhthhhththhhhhtthht. The description of the experiment is consistent with multiple stopping rules. In particular: I Fixed Flips: stop when you’ve tossed the coin 20 times. I Fixed Tails: stop when you’ve got a total of 6 tails. The p-value depends on the stopping rule! 18 Analyzing with Fixed Flips Here’s the distribution of the number of heads given Fixed Flips: The actual number of heads is 14. The p-value is .115. 19 Analyzing with Fixed Tails Here’s the distribution of the number of heads given Fixed Tails: The actual number of heads is 14. The p-value is .032. 20 Does sensitivity to stopping rules matter? Howson and Urbach (2006: 158–9): The fact that significance tests and, indeed, all classical inference models [are sensitive to the stopping rule] is a decisive objection to the whole approach. Royall (1997: 23–4): Any concept or technique for evaluating observations as evidence that denies this equivalence, attaching a different measure of ‘significance’ to [an outcome, according to the underlying stopping rule], is invalid. [...] The ‘irrelevance of the sample space’ is a critically important concept, for it implies a structural flaw that is not limited to significance tests, but pervades all of today’s dominant statistical methodology. 21 Getting clear about the argument (P1) The evidential import of the result doesn’t depend on the stopping rule. (P2) The p-value does depend on the stopping rule. (C) Therefore the p-value doesn’t track the evidential import of the result. 22 Plan Weak Law of Large Numbers The long-run defense Criticisms of the long-run defense Frequentists are sensitive to the stopping rule Does sensitivity to the stopping rule matter? 23 Underdetermination Collaboration Suppose two scientists are doing an experiment together. As it happens, they have in mind different stopping rules, but they don’t realize this. By chance, the outcome of the experiment is consistent with both stopping rules. What is the p-value of the outcome? 24 Exploiting the phenomenon? Can you exploit sensitivity to the stopping rule to boost your chance of getting the desired outcome? 25 Exploiting the phenomenon? Can you exploit sensitivity to the stopping rule to boost your chance of getting the desired outcome? Suppose I offer a prize if you manage to bias a coin: I’ll award you the prize just if you reject the hypothesis that the coin is fair at the 5%-level. 25 Exploiting the phenomenon? Can you exploit sensitivity to the stopping rule to boost your chance of getting the desired outcome? Suppose I offer a prize if you manage to bias a coin: I’ll award you the prize just if you reject the hypothesis that the coin is fair at the 5%-level. Two options: I Fixed Flips: flip it 20 times. I Fixed Difference: flip the coin until you have 10 more heads than tails, or you’ve flipped it a total of 100 times, whichever comes first. 25 Exploiting the phenomenon? Can you exploit sensitivity to the stopping rule to boost your chance of getting the desired outcome? Suppose I offer a prize if you manage to bias a coin: I’ll award you the prize just if you reject the hypothesis that the coin is fair at the 5%-level. Two options: I Fixed Flips: flip it 20 times. I Fixed Difference: flip the coin until you have 10 more heads than tails, or you’ve flipped it a total of 100 times, whichever comes first. Does the Fixed Difference test boost your chances of mistakenly rejecting the null? 25 Exploiting the phenomenon In the Fixed Flips test, we reject at the 5%-level just if we get 0–5 or 15–20 heads. The probability, assuming the null, of getting one of these outcomes, is about .41. In the Fixed Difference test, here’s the distribution of the number of heads: It turns out we reject at the 5%-level just if we get 0–17 heads. The probability, assuming the null, of getting one of these outcomes is about .43. 26 Is the sensitivity a good thing? Suppose you have two experiments to test whether a coin is fair: in the first, the experimenter flips the coin until he gets more heads than tails; in the second, the experimenter flips the coin 101 times. Both get the same outcome—the very same sequence of h’s and t’s. The outcome contains 51 heads and 50 tails and the number of heads exceeds the number of tails only on the last flip. 27 Is the sensitivity a good thing? Suppose you have two experiments to test whether a coin is fair: in the first, the experimenter flips the coin until he gets more heads than tails; in the second, the experimenter flips the coin 101 times. Both get the same outcome—the very same sequence of h’s and t’s. The outcome contains 51 heads and 50 tails and the number of heads exceeds the number of tails only on the last flip. Surely the outcome points in different directions in the two cases. 27 Is the sensitivity a good thing? Suppose you have two experiments to test whether a coin is fair: in the first, the experimenter flips the coin until he gets more heads than tails; in the second, the experimenter flips the coin 101 times. Both get the same outcome—the very same sequence of h’s and t’s. The outcome contains 51 heads and 50 tails and the number of heads exceeds the number of tails only on the last flip. Surely the outcome points in different directions in the two cases. In the first case, you should be more confident than not that the coin is tails-biased. After all, it took a long time—101 flips!—to get more heads than tails. 27 Is the sensitivity a good thing? Suppose you have two experiments to test whether a coin is fair: in the first, the experimenter flips the coin until he gets more heads than tails; in the second, the experimenter flips the coin 101 times. Both get the same outcome—the very same sequence of h’s and t’s. The outcome contains 51 heads and 50 tails and the number of heads exceeds the number of tails only on the last flip. Surely the outcome points in different directions in the two cases. In the first case, you should be more confident than not that the coin is tails-biased. After all, it took a long time—101 flips!—to get more heads than tails. In the second case, you should be more confident than not that the coin is heads-biased. After all, there are more heads than tails. 27 Is the sensitivity a good thing? Suppose you have two experiments to test whether a coin is fair: in the first, the experimenter flips the coin until he gets more heads than tails; in the second, the experimenter flips the coin 101 times. Both get the same outcome—the very same sequence of h’s and t’s. The outcome contains 51 heads and 50 tails and the number of heads exceeds the number of tails only on the last flip. Surely the outcome points in different directions in the two cases. In the first case, you should be more confident than not that the coin is tails-biased. After all, it took a long time—101 flips!—to get more heads than tails. In the second case, you should be more confident than not that the coin is heads-biased. After all, there are more heads than tails. Does this show that being sensitive to the stopping rule is a virtue, not a vice? 27 Next up Haley Schilling’s guest lecture: she’ll describe a new and interesting spin on significance tests. 28 Philosophy of Statistics: Homework 7 due on Gradescope by 11am on Friday February 26 Guidelines. Some questions ask you to justify your answers. For these questions, credit will be based on how well you justify your answers, not on whether your answers are correct. (There’s often no consensus on the correct answers, even among statisticians.) However, that doesn’t mean that anything goes: some answers will be hard to justify well. I give suggested word counts but these are just ballpark numbers. Don’t sweat them too much. Collaboration is encouraged, but make sure to write up your answers by yourself and list your collaborators. Problem 1 (15 points). Imagine a machine for evaluating statements: you feed in the statement, the machine churns away for a while, and then it prints out ‘True’ or ‘False’. The machine is very reliable: if you feed in a true statement, then with high probability it prints ‘True’; if you feed in a false statement, then with high probability it prints ‘False’. But it does not follow that you should be very confident, of any particular answer printed by the machine, that the answer is correct! Give an example where you would be certain the answer was correct, an example where you would be very confident the answer was correct, an example where you would be very confident the answer was incorrect, and an example where you would be certain the answer was incorrect. Problem 2 (20 points). A stopping rule tells you what to do in an experiment. For example, in an experiment to test whether a coin is fair one stopping rule is “flip the coin 20 times” and another stopping rule is “flip the coin until you get a total of 6 tails”. There are many other possible stopping rules. Now, imagine a basketball coach wishes to test whether a player’s free throw percentage is 80%, as the player claims. So the coach has the player take some free throws. The results are: make, miss, miss, make, make, miss, make, make, make, miss, make, miss, make, make, make. Give four examples of stopping rules which the coach might be using. (Don’t worry about whether your stopping rules are weird or not. Any four stopping rules consistent with the results will do.) Problem 3 (15 points). In class, we discussed how significance tests are sensitive to the stopping rule. Explain what this means and illustrate with an example. (You can use the example we discussed in class if you like, or come up with your own.) Be as precise as you can. Problem 4 (25 points). Howson and Urbach (2006: 158–9) say: “The fact that significance tests and, indeed, all classical inference models [are sensitive to the stopping rule] is a decisive objection to the whole approach.” Royall (1997: 24) agrees, saying that sensitivity to the stopping rule is a “structural flaw” and means that frequentist methods are “invalid”. Let h be some hypothesis and x be the outcome of an experiment in a significance test of h. We might try to formulate the objection as an argument, like this: (P1) The strength of evidence of x against h doesn’t depend on the stopping rule. (P2) The p-value of x does depend on the stopping rule. (C) So, the p-value doesn’t measure the strength of evidence of x against h. 1 Evaluate this argument. Be as specific as you can. If a premise is false, which premise and why? If the argument is invalid, how does it go wrong and how might it be improved? If the conclusion is true, can anything be salvaged from significance testing? (300 words.) Problem 5 (25 points). A Bayesian says that an outcome x is evidence for a hypothesis h just if the posterior is greater than the prior. In symbols: P (h | x) > P (h). OK, but how should we measure the strength of the evidence? In class we measured it like this: Difference measure: the strength of evidence of x for h is measured by the difference between posterior and prior, i.e. P (h | x) − P (h). But other measures are possible. For example: Ratio measure: the strength of evidence of x for h is measured by the ratio of posterior and prior, i.e. P (h | x)/P (h) These two measures are quite different. For example, there are experiments where each of two of the possible outcomes, x1 and x2 , are evidence for each of two hypotheses, h and i, but according to one of the measures, the strength of evidence of x1 for h is greater than the strength of evidence of x2 for i, and according to the other measure, the strength of evidence of x1 for h is less than the strength of evidence of x2 for i. Given an example of such an experiment and show your calculation of the strengths of evidence. 2
Purchase answer to see full attachment
Explanation & Answer:
5 Problems
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Here is the assignment :)

Problem 1 (15 points). Imagine a machine for evaluating statements: you feed in the
statement, the machine churns away for a while, and then it prints out ‘True’ or ‘False’. The
machine is very reliable: if you feed in a true statement, then with high probability it prints
‘True’; if you feed in a false statement, then with high probability it prints ‘False’. But it does
not follow that you should be very confident, of any particular answer printed by the machine,
that the answer is correct! Give an example where you would be certain the answer was
correct, an example where you would be very confident the answer was correct, an example
where you would be very confident the answer was incorrect, and an example where you
would be certain the answer was incorrect.
Example where you would be certain the answer was correct
Let’s say we ask the machine a specific question A, the machine answers: True.
Even though we know that there is a high probability this answer is correct, we do not know
for sure that it is correct.
So, we ask the machine the same question A again, the machine answers: True.
Still, we cannot be certain this is the correct answer.
So, we ask the machine the same question A again, the machine answers: True.
We continue asking the machine the same question A 1,000 times. Every time, the machine
answers: True.
In this case, we can be certain the answer was correct, as the probability that the machine
answers the same question A incorrectly 1,000 times in a row is low.
Example where you would be confident the answer was correct
Let’s say we ask the machine a specific question B, the machine answers: True.
Again, we cannot be sure this is correct, even though the probability that it is correct is high.
So, we ask the machine the same question B again, the machine answers: True.
We repeat this 1,000 times.
990 of those times, the machine answers: True.
10 times, the machine answers: False.
In this case we can be confident that the original answer (True) is correct. If we consider that
the machine has a probability of let’s say 0.90 of getting the answer correct and 0.10
probability of answering the question incorrectly. If asked 1,000 times, we cou...


Anonymous
Excellent! Definitely coming back for more study materials.

Studypool
4.7
Indeed
4.5
Sitejabber
4.4

Related Tags