New York University Find the Values Probability and Statistics Questions

User Generated

zvpuryyrnghfp

Mathematics

New York University

Description

Hi, please only bid / accept if you are fully able to answer these questions. I need a full explanation so I can fully study the concepts! Thank you so much (:

Unformatted Attachment Preview

New York University − Courant Institute, MATH-UA 235 Probability and Statistics Maximilian Nitzschner 04/06/2021 Disclaimer: These are lecture notes for the course Probability and Statistics (MATH-UA 235), given at New York University in Spring 2021. The primary textbook reference for this course is [1]. For some advanced topics and further reading, especially concerning more mathematical details, the book [2] may also be helpful. These notes are preliminary and may contain typos. If you see any mistakes or think that the presentation is unclear and could be improved, please send an email to: maximilian.nitzschner@cims.nyu.edu. All comments and suggestions are appreciated. 2 Contents Contents 0. Motivation 0.1. What is probability theory? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.2. What is statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. Outcomes, events and probability 1.1. Sample spaces . . . . . . . . . . 1.2. Events, σ-algebras . . . . . . . . 1.3. Probability . . . . . . . . . . . . 1.4. Elementary Combinatorics . . . 5 5 6 . . . . 7 7 8 11 13 2. Conditional probability and stochastic independence 2.1. Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. The law of total probability and Bayes’ theorem . . . . . . . . . . . . . . . . . 2.3. Stochastic independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 18 20 3. Discrete distributions 23 4. Introduction to statistical tests and Neyman-Pearson lemma 4.1. Basic notions of statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2. The Neyman-Pearson lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 28 30 5. Continuous distributions 34 6. Random variables 6.1. Definition of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2. Law and cumulative distribution of a real random variable . . . . . . . . . . . 6.3. Transformation of random variables . . . . . . . . . . . . . . . . . . . . . . . . 41 41 42 47 7. Expectation, variance and higher moments of random variables 7.1. Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2. Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 50 54 8. Joint distributions and independence of random variables 8.1. Joint distributions of random variables . . . . . . . . . . . . . . . . . . . . . . 8.2. Independence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . 59 59 66 9. Covariance and correlation 70 . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents 10. Operations with random variables 10.1. Extremes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2. Sums of independent random variables . . . . . . . . . . . . . . . . . . . . . . 74 74 75 11. Poisson processes 78 12. Stochastic convergence and the weak law of large numbers 12.1. The law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2. Moment estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 82 84 A. Appendix A.1. Multiple integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 87 4 0. Motivation 0. Motivation The purpose of probability theory and statistics is to study systems that involve randomness. 0.1. What is probability theory? Consider as a very simple example throwing a single die, which is a process with a random outcome. A first objective will be to develop the mathematical description of characteristics of such a random experiment: This is the specification of a stochastic model. Loosely speaking: Probability theory is concerned with the description of random phenomena using stochastic models. (0.1) Here are some examples of phenomena calling for a probabilistic description: I throwing a (fair) die or coin multiple times; I describing the random movement of a particle in Zd , d ≥ 1 (random walk): at every time step, the particle moves randomly to one of its neighboring sites, with “equal probability”; Figure 0.1.: Left panel: Possible jumps for a particle at the origin; right panel: Position of the particle after 24 steps. Some typical questions we could ask in this context are for instance: I What is the average of the numbers of the die after a large number of throws? I Where will the random particle be after a large number of steps? I What is the approximate probability that the sum of the numbers coming up when throwing the die 1000 times exceeds 5000? 5 0. Motivation I Will the random particle ever come back to the origin? In this course we develop techniques to answer some of these questions. Notably, we will see the law of large numbers and the central limit theorem, which address the first three questions above. 0.2. What is statistics? Let us assume we have some outcomes of a random experiment following a distribution with an unknown parameter. Is it possible to guess or reconstruct this parameter from observations? Again, loosely speaking: Statistics is concerned with methods to draw conclusions from given random observations. (0.2) Here are some questions that statistics are useful for. I Assume we have 100 measurements of a certain physical parameter following a known distribution (for instance the life-time of a light bulb). How do we effectively estimate the parameter underlying this distribution? I If we throw a coin 100 times and obtain “heads” 75 times, is it reasonable to claim that the coin is not a fair coin? What would we base this decision on? The first question is an example of an estimation problem, whereas the second question motivates the study of statistical tests. We will develop some of the theory of estimators, tests and other statistical methods in this course. 6 1. Outcomes, events and probability 1. Outcomes, events and probability (Reference: [1, Chapter 2], or [2, Sections 1.1-1.2, 2.1-2.3]) Our primary objective is to construct a mathematical model for a random experiment. Conceptually, this involves the specification of three quantities: I a set of outcomes or sample space Ω 6= ∅; an element ω ∈ Ω should be interpreted as a possible realization / measurement of the random experiment. I a class of events F ⊆ P(Ω), called σ-algebra; an event A ∈ F is a subset of Ω, and we aim at specifying its probability. I a probability measure P, which is a map from F to [0, 1] that assigns a probability P[A] to any given event A ∈ F. The triple (Ω, F, P) is called a probability space. In the following sections, we give precise definitions of these objects and present examples. 1.1. Sample spaces Definition 1.1. A non-empty set Ω consisting of the possible realizations of a random experiment is called set of outcomes or sample space. An element ω ∈ Ω is called an outcome. Example 1.2. (i) Tossing a coin: The possible outcomes are heads and tails, which we denote by H and T respectively. In this case, we have Ω1 = {H, T }. (1.1) (ii) Rolling a die: The outcomes are the integer numbers from 1 to 6, so Ω2 = {1, 2, 3, 4, 5, 6}. (1.2) (iii) Tossing a coin and rolling a die: We define the sample space as the Cartesian product of Ω1 and Ω2 , namely Ω3 = Ω1 × Ω2 = {(ω1 , ω2 ) ; ω1 ∈ Ω1 , ω2 ∈ Ω2 } = {(H, 1), (H, 2), ..., (H, 6), (T, 1), (T, 2), ..., (T, 6)}. 7 (1.3) 1. Outcomes, events and probability (iv) n-fold coin toss (where n ∈ N = {1, 2, ...}): Here, we need to record the outcome as an n-tuple Ω4 = {(H, H, ..., H, H), (H, H, ..., H, T ), (H, H, ..., T, H), ..., (T, T, ..., T, T )} | {z } n elements = {(ω1 , ..., ωn ) ; ωi ∈ {H, T } for 1 ≤ i ≤ n} = Ω1 × ... × Ω1 = {z } | (1.4) Ωn1 . n times (v) Tossing a coin infinitely many times: The natural choice for outcomes will be similar as in the previous example, but with sequences of infinite length rather than n-tuples. More precisely Ω5 = ΩN 1 = {(ω1 , ω2 , ...) ; ωi ∈ {H, T } for i ∈ N}. (1.5) (vi) The number of customers in a shop during a given day: Ω6 = N0 = {0, 1, 2, ...}. (1.6) Ω7 = R+ 0 = [0, ∞). (1.7) (vii) The lifetime of a light bulb: Let us point out that the sample spaces Ω1 , Ω2 , Ω3 , Ω4 and Ω6 are countable1 , whereas Ω5 and Ω7 are uncountable. 1.2. Events, σ-algebras Suppose that we have fixed a sample space Ω. In general we are interested in the occurrence of events that consist of a certain selection of outcomes. For instance consider rolling a die once (recall from Example 1.2, (ii) that Ω2 = {1, 2, 3, 4, 5, 6} is a reasonable choice for the sample space for this random experiment). The event A = “the upper face of the die shows an even number” (1.8) can then be expressed as A = {2, 4, 6} ⊆ Ω2 . (1.9) Naive definition: An event is a subset A ⊆ Ω of the sample space. This works in the case where Ω is countable (in particular, if Ω is finite), but leads to an important complication when Ω is uncountable (see Example 1.2, (v) and (vii)). It turns out that if we allow every subset A ⊆ Ω for an uncountable Ω, we cannot define a probability for A without running into problems. Fortunately, we can restrict out attention to smaller classes of subsets. 1 A set S is countable if there exists a surjective (onto) map ρ : N → S. This includes the case of finite S. 8 1. Outcomes, events and probability Definition 1.3. Let Ω 6= ∅. The power set P(Ω) is the set of all subsets of Ω, i.e. P(Ω) = {A ; A ⊆ Ω}. (1.10) A σ-algebra on Ω is a subset F ⊆ P(Ω) that fulfills the following properties: (S1) Ω ∈ F. (S2) If A ∈ F, then Ac = Ω \ A ∈ F. (S3) If for every j ∈ N, Aj ∈ F, then S∞ j=1 Aj = A1 ∪ A2 ∪ A3 ∪ ... ∈ F. A set A ∈ F is called an event. If ω ∈ A, we say that the event A occurs (for the outcome ω). If ω∈ / A, we say that A does not occur (for the outcome ω). Remark 1.4. (i) The power set P(Ω) itself is a σ-algebra. (ii) The event Ω always occurs in a random experiment, since ω ∈ Ω is always true. On the other hand, the event ∅ = Ωc never occurs, since ω ∈ ∅ can never be true. (iii) In the previous definition, (S2) should be understood as follows: If A ∈ F is an event, then Ac , which has the interpretation that A does not S occur, should also be an event. Similarly (S3) means: If A1 , A2 , A3 , ... are events, then ∞ j=1 Aj , which has the interpretation that one of the Aj occurs, should also be an event. End of Lecture 1 We draw some simple conclusions from Definition 1.3. Proposition 1.5. Let Ω 6= ∅ and F ⊆ P(Ω) a σ-algebra. (i) ∅ ∈ F. (ii) If for every j ∈ N, Aj ∈ F, then T∞ j=1 Aj ∈ F. (iii) If A, B ∈ F, then A ∪ B ∈ F, A ∩ B ∈ F and A \ B ∈ F. Proof. We first prove (i): Since Ω ∈ F by (S1) and ∅ = Ωc = Ω \ Ω, we have that ∅ ∈ F by (S2). We turn to (ii): By de Morgan’s rules2 we have that  c ∞ ∞ \ [  Aj  = Acj |{z} j=1 j=1 ∈ F, by (S3). ∈F, by (S2) 2 The de Morgan rules state that for any collection {Ui }i∈I of subsets Ui ⊆ U , one has !c !c [ \ c \ [ c Ui = Ui , Ui = Ui i∈I i∈I i∈I . 9 i∈I (1.11) 1. Outcomes, events and probability Therefore, we have again by (S2) that ∞ \  Aj =   j=1 ∞ \ c c Aj   ∈ F. (1.12) j=1 e1 , A2 = B = A e2 and Aj = ∅, A ej = Ω for j ≥ 3 (which We now prove (iii): Set A1 = A = A are all in F, using the assumption, (i) and (S1)). We then see that A ∪ B = A ∪ B ∪ ∅ ∪ ∅ ∪ ... = ∞ [ Aj ∈ F, j=1 ∞ [ A ∩ B = A ∩ B ∩ Ω ∩ Ω ∩ ... = ej ∈ F, A (1.13) (1.14) j=1 where we used (S2) and (ii), respectively. Finally, we have that A\B =A∩ Bc |{z} ∈ F. (1.15) ∈F, by (S2) We illustrate the set operations using again the example of rolling a single die. Example 1.6. We use (Ω, F) = ({1, 2, 3, 4, 5, 6}, P({1, 2, 3, 4, 5, 6})) and consider the events A = “the upper face of the die shows an even number” = {2, 4, 6}, B = “the upper face of the die shows a prime number” = {2, 3, 5}, C = “the upper face of the die shows an odd number” = {1, 3, 5}. From this, we obtain B c = {1, 4, 6}, A ∪ B = {2, 3, 4, 5, 6}, A ∩ B = {2}, A ∩ C = ∅. We see that the set B c describes the event that B does not occur, the set A ∪ B describes the event that A or B occurs3 and A ∩ B describes the event that A and B occur (both). The fact that A ∩ C is the empty set corresponds to the fact that the events A and C are mutually exclusive. 3 As always in mathematics, the word “or” has a non-exclusive meaning: it includes the case where A and B occur both. 10 1. Outcomes, events and probability Ω Ω A B A B A∪B A∩B A Ω Ω B A Ac Figure 1.1.: Graphical representation of intersection, union and complement of sets (first three panels) and an example of two disjoint sets. 1.3. Probability Definition 1.7. Let Ω 6= ∅ and F ⊆ P(Ω) a σ-algebra on Ω. A function P : F → [0, 1] is called a probability measure (or simply a probability) if the following properties are fulfilled: (P1) P[Ω] = 1 (normalization). (P2) If (Aj )j∈N is a sequence of events Aj ∈ F that are pairwise disjoint, namely Aj ∩ Ak = ∅ for every j, k ∈ N with j 6= k, then   ∞ ∞ [ X P Aj  = P[Aj ] (σ-additivity). (1.16) j=1 j=1 The triple (Ω, F, P) is called a probability space. Example 1.8. A very natural class of examples is given by considering F = P(Ω), ∅= 6 Ω finite, (1.17) and choosing the probability measure as follows: P : P(Ω) → [0, 1], P[A] = |A| , |Ω| (1.18) where | · | denotes the cardinality (i.e. the number of elements) of a set. The probability measure P is the (discrete) uniform distribution on Ω. The resulting probability space (Ω, P(Ω), P) is sometimes called a Laplace probability space. It is characterized by the fact that P[{ω}] = 1 , |Ω| for every ω ∈ Ω, 11 (1.19) 1. Outcomes, events and probability meaning that every outcome has the same probability. Concrete example: We roll a die twice and are interested in the probability that the number 6 shows up at least once. Assuming that the die is fair, we set consider the probability space (Ω, F, P) given by Ω = {1, 2, 3, 4, 5, 6}2 = {(1, 1), (1, 2), ..., (1, 6), (2, 1), ..., (2, 6), ..., (6, 6)}, F = P(Ω), |A| |A| = , P[A] = |Ω| 36 (1.20) for all A ∈ P(Ω), and the event in question is given by B = “At least one 6 shows up” = {(1, 6), (2, 6), (3, 6), (4, 6), (5, 6), (6, 6), (6, 5), (6, 4), (6, 3), (6, 2), (6, 1)}. (1.21) We clearly have that 11 . 36 Let us now give some elementary but important properties of probabilities. P[“At least one 6 shows up”] = P[B] = (1.22) Proposition 1.9. Let (Ω, F, P) be a probability space and A, B, Aj ∈ F for j ∈ N. Then the following properties hold: (i) P[∅] = 0, (ii) P[Ac ] = 1 − P[A], (iii) If A ⊆ B, then P[A] ≤ P[B], (iv) P[A ∪ B] = P[A] + P[B] − P[A ∩ B], hS i P (v) P ∞ A ≤ ∞ j j=1 j=1 P[Aj ]. Proof. We start with the proof of (i): Since ∅ = ∅ ∪ ∅ ∪ ∅ ∪ ... (and clearly ∅ ∩ ∅ = ∅), we see that (P2) P[∅] = ∞ X P[∅], (1.23) j=1 which can only be true if P[∅] = 0. For (ii) note that A and Ac are disjoint and fulfill A ∪ Ac = Ω. We set A1 = A, A2 = Ac and Aj = ∅ for j ≥ 3, so that (P1) 1 = P[Ω] = P[A ∪ Ac ∪ ∅ ∪ ∅ ∪ ...] (P2) = P[A] + P[Ac ] + P[∅] + P[∅] + ... | {z } =0 by (i) = P[A] + P[Ac ]. 12 (1.24) 1. Outcomes, events and probability e = B \ A(= {ω ∈ B ; ω ∈ e =A∪B =B For the proof of (iii), consider B / A}), so that A ∪ B e and A ∩ B = ∅. We find by the same argument as for (ii): e ∪ ∅ ∪ ∅ ∪ ...] = P[A] + P[B] e ≥ P[A]. P[B] = P[A ∪ B |{z} (1.25) ≥0 Note that this calculation shows the stronger statement A⊆B ⇒ P[B \ A] = P[B] − P[A]. (1.26) (?) For (iv), we define D = B \ (A ∩ B) and note that A ∩ B ⊆ B, A ∪ B = A ∪ D and A ∩ D = ∅. Argument for (?): ω ∈A∪B ⇔ ω ∈ A or ω ∈ B ⇔ ω ∈ A or ω ∈ B \ (A ∩ B) ⇔ ω ∈ A ∪ D. Thus, we see that (P2) P[A ∪ B] = P[A ∪ D ∪ ∅ ∪ ∅ ∪ ...] = P[A] + P[D] | {z } = P[B]−P[A∩B] (1.27) n ≥ 2. (1.28) (1.26) = P[A] + P[B] − P[A ∩ B]. Finally, we prove (v). We define the sets B1 = A1 , Bn = An \ n−1 [ Aj , j=1 S S∞ The sets Bj , j ∈ N, are pairwise disjoint and fulfill ∞ j=1 Aj = j=1 Bj as well as Bj ⊆ Aj for every j ∈ N. Therefore we have that     ∞ ∞ ∞ ∞ [ [ X (P2) X P Aj  = P  Bj  = P[Bj ] ≤ P[Aj ]. (1.29) j=1 j=1 j=1 j=1 End of Lecture 2 1.4. Elementary Combinatorics As seen with the example of rolling dice the previous section, in many elementary cases, the assumption that all (finitely many) outcomes are equally likely is justified. In this situation, the probability space is uniquely characterized by the number |Ω| ∈ N. We want to develop effective methods to count the number of outcomes of Ω and events A ⊆ Ω. 13 1. Outcomes, events and probability Remark 1.10. If N ∈ N random experiments with finite sample spaces Ω1 , Ω2 , ..., ΩN are performed successively, an appropriate choice for the sample space of the combined experiment is given by the Cartesian product Ω= N Y Ωj := Ω1 × Ω2 × ... × ΩN j=1 (1.30) = {(ω1 , ..., ωn ) ; ωj ∈ Ωj , 1 ≤ j ≤ N }. The cardinality of Ω is given by |Ω| = N Y |Ωj | = |Ω1 | · |Ω2 | · ... · |ΩN |. (1.31) j=1 We already saw this in Example 1.2, (iii). Let us motivate some less elementary results with the following example. Example 1.11. How likely is it that two persons in this room / in this Zoom call have their birthdays on the same date? To make this problem mathematically tractable, we make some simplifying assumptions: I We assume that every birthday is equally likely, and we igonore leap years. I We also suppose that the birthdays are equally distributed and independent from each other. Probabilistic Model: Ω = {1, 2, ..., 365}r , F = P(Ω), |A| P[A] = , |Ω| (r is the number of persons) (1.32) A ⊆ Ω. The event we are interested in, and its complement, are A = {(ω1 , ..., ωr ) ∈ Ω ; there exist i 6= j with ωi = ωj }, Ac = {(ω1 , ..., ωr ) ∈ Ω ; ωi 6= ωj for all i 6= j}. (1.33) We estimate P[A] (using exp(x) ≈ 1 + x for |x|  1): |Ac | |Ω| 365 · 364 · ... · (365 − r + 1) =1− r 365      1 r−1 =1− 1· 1− · ... 1 − 365 365 ( r−1  )    X k r2 ≈ 1 − exp − ≈ 1 − exp − , 365 730 k=1 | {z } P[A] = 1 − P[Ac ] = 1 − =exp(−r(r−1)/730) 14 (1.34) 1. Outcomes, events and probability since exp(r/730) ≈ 1. For r = 30, r = 40, r = 50 we find P[A] ≈ 0.71, P[A] ≈ 0.89 and P[A] ≈ 0.97, respectively. In the calculation of |Ω| and |Ac |, we see instances of “selecting an ordered sample from a set with and without repetitions” (the ordered set in question being {1, 2, 3, ..., 365} from which r elements are drawn). In many situations we are also interested in selecting an unordered sample from a set. All situations are recordered in the following Proposition. Proposition 1.12. The number of choices of a sample of size r ∈ N out of {1, 2, ..., n} is given as follows: with repetitions without repetitions nr ordered n! (n−r)! n r n+r−1 r   unordered For the case without repetitions, we additionally require r ≤ n. In the  tablen!above we used the n notations k! = k · (k − 1) · ... · 1 for k ∈ N (and 0! = 1), as well as k = k!(n−k)! for 0 ≤ k ≤ n. Proof. I Ordered samples, with repetitions: This is a special case of Remark 1.10. More precisely, we use Ω1 = {(ω1 , ..., ωr ) ; ωj ∈ {1, 2, ..., n}} = {1, 2, ..., n}r , (1.35) with |Ω1 | = nr . I Ordered samples, without repetitions: Here we use Ω2 = {(ω1 , ..., ωr ) ; ωj ∈ {1, 2, ..., n}, ωi 6= ωj for i 6= j}, (1.36) with |Ω2 | = n · (n − 1) · ... · (n − r + 1). I Unordered samples, without repetitions: Here we use Ω3 = {{ω1 , ..., ωr } ; ωj ∈ {1, 2, ..., n}, ωi 6= ωj for i 6= j}. (1.37) n! Here r!|Ω3 | = |Ω2 | = (n−r)! holds: This is because for r ∈ {1, ..., n} different elements ω1 , ..., ωr , there are exactly r possibilities of reordering. I Unordered samples, with repetitions: The sample space can be written as Ω4 = {(ω1 , ..., ωr ) ; ωj ∈ {1, 2, ..., n}, 1 ≤ ω1 ≤ ω2 ≤ ... ≤ ωr ≤ n}. (1.38) We visualize an element of Ω4 as follows: We separate the n numbers 1, ..., n by n − 1 lines (|), and for each instance of one of these numbers within the sequence (ω1 , ..., ωr ), we put a dot (•) in the respective bin. 15 1. Outcomes, events and probability Example: Let n = 6, r = 5. The element (1, 1, 3, 4, 6) ∈ Ω4 corresponds to the string • • || • | • ||•, and the element (2, 2, 2, 5, 5) ∈ Ω4 corresponds to the string | • • • ||| • •|. The number of different strings corresponds therefore to the numbers of choices of a set of r elements (the dots) out of a set with  n + r − 1 elements (the strings consisting of n+r−1 dots and lines), which is exactly by the previous step. r Example 1.13. A committee of 12 persons consists of 3 representatives of group A, 4 of group B and 5 of group C. We want to choose a subcommittee of 5 persons uniformly at random. What is the probability that this subcommittee consists of I one member of group A, I two members of group B, I two members of group C? We denote this event by E. Note that we do not specify the order within the groups and obviously, there are no repetitions in the choice of the members. Thus we have  I 31 choices for the member from group A,  I 42 choices for the member from group B,  I 52 choices for the member from group C,  and of course 12 5 for the entire subcommittee. Therefore the probability we look for is 3 1  P[E] = · 4 5 2 · 2  12 5   16 = 5 ≈ 0.23. 22 (1.39) 2. Conditional probability and stochastic independence 2. Conditional probability and stochastic independence (Reference: [1, Chapter 3], or [2, Sections 3.1, 3.3]) In this chapter we introduce the notion of conditional probability. Intuitively, the idea is that the existence of “partial knowledge” should influence how we determine the likelihood of a given outcome. 2.1. Conditional probability Let us start with a very easy example. Example 2.1. We throw two dice and ask for the probability that the sum of the numbers of both dice is smaller or equal to 7. We call this event A. Assuming that the dice are fair, this experiment is modelled by (Ω, F, P) = ({1, 2, 3, 4, 5, 6}2 , P(Ω), P), P[ · ] = |·| . 36 (2.1) Of course A is given by A ={(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (3, 1), (3, 2), (3, 3), (3, 4), (4, 1), (4, 2), (4, 3), (5, 1), (5, 2), (6, 1)}. (2.2) 7 So P[A] = 21 36 = 12 . Now imagine we are given the information that one of the dice shows the number 6. We call this event B, i.e. B = {(1, 6), (2, 6), (3, 6), (4, 6), (5, 6), (6, 6), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5)}. (2.3) If we already know that B happens, how likely is A? Clearly, the only outcomes of A that can still have occured are A ∩ B = {(1, 6), (6, 1)}. (2.4) Thus, knowing that B occured, we should now estimate the probability that A occurs by restricting the sample space Ω to B, so: |A∩B| |Ω| |B| |Ω| = |A ∩ B| 2 = . |B| 11 We elevate the term on the left-hand side to a definition. 17 (2.5) 2. Conditional probability and stochastic independence Definition 2.2. Let (Ω, F, P) be a probability space. Assume that the event B ∈ F has a positive probability P[B] > 0. We define the conditional probability of A ∈ F given B by P[A|B] = P[A ∩ B] . P[B] (2.6) Remark 2.3. (i) If the events A and B are mutually exclusive (A ∩ B = ∅), then we always have P[A|B] = 0, whenever the latter is defined. (ii) One can rewrite the equation (2.6) as P[A ∩ B] = P[B] · P[A|B]. (2.7) This is sometimes called the multiplication theorem. Proposition 2.4. Let (Ω, F, P) be a probability space. Assume that the event B ∈ F has a positive probability P[B] > 0. Then P[ · |B] defines a probability distribution on (Ω, F) as well. Proof. The proof is given as an exercise. End of Lecture 3 2.2. The law of total probability and Bayes’ theorem In this section, we fix a probability space (Ω, F, P). The following result is known as the law of total probability. Theorem 2.5. Let B1 , ..., Bn ∈ F with P[Bj ] > 0 for all 1 ≤ j ≤ n and Bj ∩ Bk = ∅ for every j 6= k. Then we have for all A ∈ F that P[A] = n X Sn j=1 Bj P[Bj ]P[A|Bj ]. = Ω with (2.8) j=1 Proof. Note that  P[A] = P [A ∩ Ω] = P A ∩ n [  Bj  = P  j=1 Prop. 1.9, (iv) = n X P[A ∩ Bj ] = j=1  n [  (A ∩ Bj ) j=1 n X (2.9) P[Bj ]P[A|Bj ]. j=1 We can right away combine the previous theorem with the definition of the conditional probability to obtain the following result, called Bayes’ theorem: 18 2. Conditional probability and stochastic independence Theorem 2.6. Under the same assumptions as Theorem 2.5, we have for every 1 ≤ k ≤ n, that P[Bk ]P[A|Bk ] P[Bk |A] = Pn . j=1 P[Bj ]P[A|Bj ] (2.10) Proof. (2.6) P[Bk |A] = P[Bk ∩ A] P[A] (2.7),(2.8) = P[B ]P[A|Bk ] Pn k . j=1 P[Bj ]P[A|Bj ] (2.11) Example 2.7. Biochemical tests for a certain marker / antigen / disease / ... within a population are never absolutely reliable. We consider a test with the following properties: Let T denote the event “the test is positive”. Let M be the event “a given individual has the marker”. We assume that the test in question satisfies: P[T |M ] = 0.99 c c P[T |M ] = 0.99 (sensitivity), (specificity), (2.12) and the marker we are looking for is such that for the population which is considered, P[M ] = 0.01 (prevalence). (2.13) What is the probability that someone who tests positive actually has the maker / antigen / disease? Observe that P[T |M c ] = 1 − 0.99 = 0.01. We use Bayes’ theorem 2.6 P[T |M ] · P[M ] P[T |M ] · P[M ] + P[T |M c ] · P[M c ] 0.99 · 0.01 = 0.99 · 0.01 + 0.01 · 0.99 1 = . 2 P[M |T ] = (2.14) We see that because the trait under consideration is so rare, we find that half of those who test positive do not actually admit this trait, even though the test is fairly reliable! Remark 2.8. Both the law of total probability and Bayes’ theorem are validSif we have countably many pairwise disjoint B1 , B2 , ... ∈ F with P[Bj ] > 0 for all j ∈ N and ∞ j=1 Bj = Ω. In this case, (2.8) and (2.10) become P[A] = ∞ X P[Bj ]P[A|Bj ], and (2.15) j=1 P[Bk ]P[A|Bk ] P[Bk |A] = P∞ , j=1 P[Bj ]P[A|Bj ] respectively. 19 (2.16) 2. Conditional probability and stochastic independence 2.3. Stochastic independence We now introduce the notion of stochastic independence of events, which is one of the central concepts in probability theory and statistics. Again, we fix a probability space (Ω, F, P) in this section. Heuristics: The events A, B ∈ F should be independent, if the occurence of A has no influence on the occurence of B, and vice versa. Specifically, if A happens, it should neither be more, nor less likely that B occurs, and vice versa, so P[A ∩ B] , P[B] P[A ∩ B] P[B] = P[B|A] = , P[A] P[A] = P[A|B] = (2.17) where we implicitly assumed that P[A], P[B] > 0. We turn this reasoning into a definition. Definition 2.9. (i) The events A, B ∈ F are called (stochastically) independent, if P[A ∩ B] = P[A] · P[B]. (2.18) (ii) Let n ∈ N, n ≥ 2. The events A1 , A2 , ..., An ∈ F are called jointly (stochastically) independent, if for every {i1 , ..., im } ⊆ {1, ..., n} with i1 , ..., im pairwise distinct, P[Ai1 ∩ ... ∩ Aim ] = P[Ai1 ] · ... · P[Aim ]. (2.19) Remark 2.10. (i) Note that both definitions include the case that a given event has probability zero. If we assume that P[A] > 0 and P[B] > 0, then (2.17) and (2.18) are equivalent. (ii) The events ∅ and Ω are independent from any other given event. Intuitively, they contain “no additional information” on the probability. (iii) Equation (2.19) means that the occurence of any subset of the events A1 , ..., An does not give additional information on the occurence of the others. For instance, Qn P[A1 ∩ A2 ∩ ... ∩ An ] j=1 P[Aj ] = P[A1 ], (2.20) P[A1 |A2 ∩ ... ∩ An ] = = Qn P[A2 ∩ ... ∩ An ] j=2 P[Aj ] provided that P[A2 ∩ ... ∩ An ] > 0. (iv) We stress that stochastic independence of two events A and B has nothing to do with them being disjoint as sets! In fact, If A and B are disjoint, then 0 = P[∅] = P[A ∩ B] = P[A] · P[B], so unless P[A] = 0 or P[B] = 0, disjoint events A and B are not independent. We illustrate the concept of independence with a number of examples. 20 (2.21) 2. Conditional probability and stochastic independence Example 2.11. (i) We draw a card randomly from a standard card deck1 , with  Ω = (i, j) ; i ∈ {♣, ♠, ♦, ♥}, j ∈ {1, 2, ..., 13} , equipped with the discrete uniform distribution. Consider the events  A = (♥, j) ; j ∈ {1, ..., 13} = drawing a ♥-card,  B = (i, 1) ; i ∈ {♣, ♠, ♦, ♥} = drawing an ace. (2.22) (2.23) Clearly, we have that  A ∩ B = (♥, 1) . (2.24) |A| 13 1 |B| 4 1 = = , P[B] = = = |Ω| 52 4 |Ω| 52 13 |A ∩ B| 1 P[A ∩ B] = = = P[A] · P[B]. |Ω| 52 (2.25) With this, we see that P[A] = So A and B are independent. (ii) Consider tossing a fair coin twice: Ω = {(H, H), (H, T ), (T, H), (T, T )}, A = “Heads” comes up in the first round 1 P[A] = , 2 = {(H, H), (H, T )}, B = “Heads” comes up in the second round 1 P[B] = , 2 1 P[A ∩ B] = , 4 = {(H, H), (T, H)}, C = “Heads” comes up exactly once 1 P[C] = , 2 = {(H, T ), (T, H)}, P[A ∩ C] = 1 = P[B ∩ C], 4 However: P[A ∩ B ∩ C] = P[∅] = 0 6= P[A] · P[B] · P[C]. This shows that the events A, B and C are pairwise independent (this means every two events out of {A, B, C} are independent), but not jointly independent. We finish this section with the following result: Theorem 2.12. Let A1 , A2 , ..., An ∈ F be jointly independent. Then also B1 , B2 , ..., Bn with Bi ∈ {Ai , Aci }, for 1 ≤ i ≤ n, are jointly independent. 1 With 52 French-suited playing cards. 21 2. Conditional probability and stochastic independence Proof. We only show the case n = 2 (the general case follows by induction). (A1 ∩ A2 ) ∪ (A1 ∩ Ac2 ) = A1 ⇒ (disjoint union) P[A1 ∩ A2 ] +P[A1 ∩ | {z } Ac2 ] = P[A1 ] (2.26) =P[A1 ]·P[A2 ] ⇒ P[A1 ∩ Ac2 ] = P[A1 ] · (1 − P[A2 ]) = P[A1 ] · P[Ac2 ]. By changing the roles of A1 and A2 , we also have P[Ac1 ∩ A2 ] = P[Ac1 ] · P[A2 ]. (2.27) We can finally use the same argument as in (2.26) (which implied the independence of A1 and Ac2 from the independence of A1 and A2 ) to infer the independence of Ac1 and Ac2 from the independence of Ac1 and A2 . End of Lecture 4 22 3. Discrete distributions 3. Discrete distributions (Reference: [1, Chapter 4], or [2, Sections 2.1-2.5.1]) In the present chapter the most important discrete distributions are defined. Definition 3.1. Let (Ω, P(Ω), P) be a probability space with a countable sample space Ω.1 The probability measure P is then called a discrete distribution. The function p : Ω → [0, 1], p(ω) = P[{ω}] (3.1) is the probability mass function of the distribution. Obviously, if we are given a discrete distribution on (Ω, P(Ω)), the probability mass function P is uniquely determined. Conversely, every set (p(ω))ω∈Ω of non-negative numbers with ω∈Ω p(ω) = 1 determines a unique probability measure on (Ω, P(Ω)), which is the statement of the following proposition. Proposition 3.2. Let Ω be countable and p : Ω → [0, 1] a map fulfilling X p(ω) = 1. (3.2) ω∈Ω Then the map P : P(Ω) → [0, 1], A 7→ P[A] = X p(ω) (3.3) ω∈A defines a probability measure on (Ω, P(Ω)). Proof. We first remark that since Ω is countable and the real numbers (p(ω))ω∈Ω are nonnegative, we can take any enumeration Ω = {ω1 , ω2 , ...} and define (P N X Ω = {ω1 , ..., ωN } finite, i=1 p(ωi ), p(ω) = (3.4) PN limN →∞ i=1 p(ωi ), Ω = {ω1 , ω2 , ...} infinite, ω∈Ω and the value of the series does not depend on the choice of the enumeration. Moreover, since A ⊆ Ω is also countable, the expression for P[A] in (3.3) is well-defined and in [0, 1]. The condition (P1) is immediate by (3.2). For (P2), we consider Aj ∈ P(Ω) for j ∈ N pairwise disjoint and use   ∞ ∞ X [ X X P Aj  = p(ω) = p(ω) j=1 ω∈ = S∞ ∞ X j=1 Aj P[Aj ]. j=1 1 We recall again that finite sets are countable by our convention. 23 j=1 ω∈Aj (3.5) 3. Discrete distributions In the second equality, we used again the fact that the (p(ω))ω∈Ω are non-negative. We will now present some of the most important discrete distributions. The discrete uniform distribution U(Ω) Ω finite , p(ω) = 1 . |Ω| (3.6) This is just giving a name for the distribution considered already multiple times, see Example 1.8. The Bernoulli distribution Ber(p) Ω = {0, 1}, p(1) = p ∈ [0, 1], p(0) = 1 − p. (3.7) The Bernoulli distribution models random experiments in which a “success” occurs with probability p, and a “failure” occurs with probability 1 − p (for instance, tossing a biased coin). Such experiments are also called Bernoulli experiments. The Binomial distribution Bin(n, p) Ω = {0, 1, ..., n}, Note that n X k=0   n k p(k) = p (1 − p)n−k , k p ∈ [0, 1], n ∈ N. n   X n k p(k) = p (1 − p)n−k = (p + 1 − p)n = 1. k (3.8) (3.9) k=0 The binomial distribution is extending the Bernoulli distribution in the following way: It models how many attempts out of n independent experiments with the same success parameter p ∈ [0, 1] are successful. To explain it, consider the auxiliary probability space ({0, 1}n , P({0, 1}n ), Q), Pn Q[{(ω1 , ..., ωn )}] = p j=1 ωj (1 − p)n− | {z } | {z p# of successes Pn j=1 ωj (1−p)# of failures . (3.10) } Here, the string (ω1 , ..., ωn ) ∈ {0, 1}n stands for the successes and failures of the experiment in the order observed, i.e. ( 1, if the jth experiment is a success, ωj = (3.11) 0, if the jth experiment is a failure. For instance, the string (1, 0, 0, 1) means that the first and last of four experiments are successes, whereas the second and third experiments are failures. By the product structure in (3.10), the experiments are independent. Now consider the event (for 0 ≤ k ≤ n) n X  Ek = (ω1 , ..., ωn ) ∈ {0, 1}n ; ωj = k = “exactly k sucesses”. j=1 24 (3.12) 3. Discrete distributions We have Q[Ek ] = |Ek |pk (1 − p)n−k =   n k p (1 − p)n−k . k (3.13) Example 3.3. We throw a die 4 times and we are interested in the number of times that the number six shows up. This is modelled by the binomial distribution Bin(4, 16 ). In the description (3.10) “0” stands for the occurence of a number other than six (failure), whereas “1” stands for the occurence of a six (success). In this example, we have Outcomes in {0, 1}4 Probability 4 p(0) = 56  3 p(1) = 41 · 61 · 56 ,  2 2 p(2) = 42 · 61 · 56  3 p(3) = 43 · 61 · 56 4 p(4) = 16 (0, 0, 0, 0) (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1) (1, 1, 0, 0), (1, 0, 1, 0), (1, 0, 0, 1), (0, 1, 1, 0), (0, 1, 0, 1), (0, 0, 1, 1) (0, 1, 1, 1), (1, 0, 1, 1), (1, 1, 0, 1), (0, 0, 0, 1) (1, 1, 1, 1) We can also order the outcomes in the form of a “tree diagram” as follows: 1−p 0 1−p 1−p 0 0 p 1−p p p 1 1−p 0 p 1 1−p 0 p 1 0 (0, 0, 1, 0) (0, 0, 1, 1) (0, 1, 0, 1) (0, 1, 1, 0) (0, 1, 1, 1) (1, 0, 0, 0) (1, 0, 0, 1) (1, 0, 1, 0) 1−p 0 p (0, 0, 0, 1) (0, 1, 0, 0) 1 1 (0, 0, 0, 0) 1 (1, 0, 1, 1) (1, 1, 0, 0) (1, 1, 0, 1) (1, 1, 1, 0) p 1 (1, 1, 1, 1) Figure 3.1.: Tree diagram of 4 successive independent Bernoulli experiments. Remark 3.4. The above example shows that the same question “How likely is it that the number six comes up exactly twice when rolling a die four times?” is treated much more efficiently on the probability space (Ω = {0, 1, 2, 3, 4}, P(Ω), P), P = Bin(4, 61 ), where we simply have    2  2 4 1 5 , P[“2 sixes”] = P[{2}] = · 2 6 6 25 3. Discrete distributions than on the probability space Pn e = {0, 1}n , P(Ω), e Q), (Ω Q[{(ω1 , ..., ωn )}] = p j=0 ωj (1 − p)n− Pn j=0 ωj , where Q[“2 sixes”] = Q[{(1, 1, 0, 0), (1, 0, 1, 0), (1, 0, 0, 1), (0, 1, 1, 0), (0, 1, 0, 1), (0, 0, 1, 1)}]    2  2 5 4 1 . = · 6 6 2 The information we need is already contained in the space (Ω = {0, 1, 2, 3, 4}, P(Ω), P). This concept of a reduction in complexity will motivate the study of random variables later. The Geometric distribution Geo(p) Ω = N, Note that ∞ X p(k) = k=1 p(k) = (1 − p)k−1 p, ∞ X (1 − p)k−1 p = k=1 p ∈ (0, 1). (3.14) p = 1. 1 − (1 − p) (3.15) The interpretation of the geometric distribution is the number of repetitions of a Bernoulli experiment (with success parameter p ∈ (0, 1)) until the first success. The Hypergeometric distribution H(N, M, n) Ω = {0, ..., n}, p(k) = M k  · N −M n−k N n  N, m, n ∈ N, 0 ≤ n, M ≤ N. , (3.16) The hypergeometric distribution should be understood as follows: Out of a set of N elements, M subelements have a certain favorable property. We choose uniformly at random an unordered sample of 0 ≤ n ≤ N elements out of the large set without repetitions. Then p(k) denotes the probability that exactly 0 ≤ k ≤ n have the the favorable property (this probability is always zero if M < k, which can happen if n > M ). Example 3.5. In an urn there are 10 balls, three are green and seven are red. We draw (at once) four balls from the urn. Here N = 10 (# of balls in urn), M = 3 (# of green balls), The probability that exactly two of the balls drawn are green is  7 3 2 · 2  . p(2) = 10 4 26 n = 4 (# of balls drawn). (3.17) 3. Discrete distributions The Poisson distribution P ois(λ) Ω = N0 , Note that ∞ X k=0 p(k) = p(k) = ∞ X λk k=0 k! λk −λ e , k! λ > 0. e−λ = e−λ · eλ = 1. (3.18) (3.19) The Poisson distribution is a natural distribution for modelling events that in principle can occur infinitely often (for instance the number of goals in a football game, or the number of raindrops falling in a given area during a given time, ...). An important application is the following approximation of the Binomial distribution by the Poisson distribution: Proposition 3.6. Let λ > 0 be fixed and pn = λ n for n ∈ N. Then, for every k ∈ N0 , it holds that   n k λk −λ lim pn (1 − pn )n−k = e . n→∞ k k! (3.20) Proof.       λk λ n λ −k n k n! 1 − · 1 − pn (1 − pn )n−k = k!(n − k)! nk n n k     n−k+1 λ n λ −k λk n n − 1 · · · ... · 1− · 1− = k! n n n n n λk → · 1 · e−λ · 1, as n → ∞. k! End of Lecture 5 27 (3.21) 4. Introduction to statistical tests and Neyman-Pearson lemma 4. Introduction to statistical tests and Neyman-Pearson lemma (Reference: [1, Chapters 25-26], or [2, Sections 10.1-10.2]) In this section, we introduce the notion of statistical tests and prove the Neyman-Pearson lemm for discrete distributions. We motivate this using an example. 4.1. Basic notions of statistical tests Example 4.1. We consider a certain drug A that has a known efficacy of 60 %. We want to evaluate the Claim: A new (more expensive) drug B has an efficacy of 70 %. To see whether the claim is valid, the drug B is tested with 100 persons. We choose the model: Ω = {0, 1, ..., n}, n = 100, (4.1) and the outcome ω ∈ Ω is the number of persons out of the 100 with a positive reaction to the administered drug B. Question: Do we have P = Bin(n, p), p = 0.6 (null hypothesis H0 ), or Q = Bin(n, q), q = 0.7 (alternative hypothesis H1 ). (4.2) The challenge is to determine a statistical test that decides between H0 and H1 , based on the test result: φ : Ω → {0, 1}, (4.3) which should be “optimal” in a way. Here, φ(ω) = 0 models keeping the null hypothesis with the observed outcome ω, and φ(ω) = 1 models rejecting the null hypothesis with the observed outcome ω. Definition 4.2. Let Ω be countable and φ : Ω → {0, 1} a statistical test. A type I error occurs, if we falsely reject the null hypothesis H0 , i.e. φ(ω) = 1 when H0 is correct. A type II error occurs, if we falsely do not reject the null hypothesis H0 , i.e. φ(ω) = 0 when H1 is correct. The null hypothesis H0 is chosen such that falsely rejecting it (type I errors) should be unlikely. (4.4) This means: When we set up statistical tests, a false rejection of H0 , i.e. a type I error is considered the more problematic error than a false non-rejection, i.e. a type II error. We illustrate this in two examples: 28 4. Introduction to statistical tests and Neyman-Pearson lemma I Going mushroom hunting: Suppose we are unsure whether certain collected wild mushrooms are edible or poisonous. Then H0 : mushrooms are poisonous, H1 : mushrooms are edible. We have chosen H0 in such a way, that falsely rejecting it is the more severe error! I Let us consider the previous Example 4.1. We want to avoid a situation, in which a new and more expensive drug is authorized, while being not more effective that the standard one used. According to this, we should define H0 : drug B has the same efficacy 0.6 as drug A, H1 : drug B has an higher efficacy of 0.7 than drug A. This is of course nothing else than (4.2). How likely are the errors of type I and II? We consider the case of simple hypotheses where H0 corresponds to the distribution P and H1 to distribution Q. This is in line with our Example 4.1. The probability of a type I error is given by P[φ = 1] = P[{ω ∈ Ω ; φ(ω) = 1}]. (4.5) The probability of a type II error on the other hand is Q[φ = 0] = Q[{ω ∈ Ω ; φ(ω) = 0}]. (4.6) If we consider the extreme case for all ω ∈ Ω, φextreme (ω) = 0 (4.7) where we never reject H0 , we see that the type I error has probability P[φextreme = 1] = P[∅] = 0. (4.8) On the other hand, the type II error of course has probability Q[φextreme = 0] = Q[Ω] = 1. (4.9) For a meaningful test, we need to do the following: I Specify an upper bound α ∈ (0, 1) for the type I error, that is, require P[φ = 1] ≤ α. (4.10) The value α is called the significance level for the test. Typical levels in practice are α = 0.05 or α = 0.01. 29 4. Introduction to statistical tests and Neyman-Pearson lemma I Minimize the value of Q[φ = 0] under the constraint P[φ = 1] ≤ α. If a solution φ∗ : Ω → {0, 1} exists, this is called the best / most powerful test at level α. We return to Example 4.1: Heuristically, given α ∈ (0, 1), we should try to look for a test φ : Ω = {0, ..., 100} → {0, 1} of the form ( 1, ω ∈ Cα = {kα , kα + 1, ..., 100} ⇔ ω ≥ kα , φ(ω) = (4.11) 0, ω ∈ Cαc = {0, 1, ..., kα − 1} ⇔ ω < kα , for some kα ∈ {0, ..., 100} to be determined. The range Cα = {kα , kα + 1, ..., 100} (4.12) is called the critical region for the test φ. End of Lecture 6 4.2. The Neyman-Pearson lemma In this section, we construct optimal tests for two simple hypotheses against each other. This is done in the following result, known as the Neyman-Pearson lemma. Lemma 4.3. Let Ω be countable and P and Q two discrete probability distributions on (Ω, P(Ω)), with probability mass functions p and q, respectively. We let L(ω) = q(ω) ∈ [0, ∞], p(ω) ω ∈ Ω, (4.13) (where x0 := ∞ for x ≥ 0) be the Likelihood-quotient of Q with respect to P. Assume that there exists a test φ∗ : Ω → {0, 1} with ( 1, L(ω) ≥ c∗α , ∗ φ (ω) = (4.14) 0, L(ω) < c∗α , and P[φ∗ = 1] = P[L(ω) ≥ c∗α ] = α. (4.15) Then every other test φ : Ω → {0, 1} with P[φ = 1] ≤ α fulfills Q[φ = 0] ≥ Q[φ∗ = 0] (4.16) Proof. Consider the two critical regions Cα∗ = {ω ∈ Ω ; φ∗ (ω) = 1}, Cα = {ω ∈ Ω ; φ(ω) = 1}. 30 (4.17) 4. Introduction to statistical tests and Neyman-Pearson lemma Now note that ω ∈ Cα∗ ⇔ q(ω) − c∗α p(ω) ≥ 0 X X ⇒ [q(ω) − c∗α p(ω)] ≥ [q(ω) − c∗α p(ω)] ∗ ω∈Cα ⇒ ω∈Cα ∗ 1] −c∗α P[φ∗ Q[φ = | {z } ∗] =Q[Cα ⇒ = 1] ≥ Q[φ = 1] −c∗α P[φ = 1] | {z } =Q[Cα ] ∗ Q[φ = 1] − Q[φ = 1] ≥ c∗α   ∗ P[φ = 1] − P[φ = 1] | {z } | {z } =α ⇒ ≥0 ≤α ∗ Q[φ = 0] = 1 − Q[φ = 1] ≥ 1 − Q[φ = 1] = Q[φ∗ = 0]. Example 4.4. We continue Example 4.1. Here we have for k ∈ {0, 1, 2, ..., n}:   n k H0 : p(k) = p (1 − p)n−k , p = 0.6, k   n k H1 : q(k) = q (1 − q)n−k , q = 0.7 k  k    k   q 1 − q n−k q p 1−q n q(k) = = / ≥ c∗α . ⇒ L(k) = p(k) p 1−p 1−q 1−p 1−p Since q > p, we have q 1−q / p 1−p (4.18) > 1, and so L(k) ≥ c∗α ⇔ k ≥ kα∗ . (4.19) Then choose kα∗ such that P[φ∗ = 1] = P[L(k) ≥ c∗α ] = P[k ≥ kα∗ ] = α. We consider n = 100, p = 0.6 and α . 0.01.1 We require n   X n k ! ∗ ∗ P[k ≥ kα ] = P[{kα , ..., n}] = p (1 − p)n−k = α. k ∗ (4.20) (4.21) k=kα We compute that for kα∗ = 72, we have α = 0.0084. The optimal test at this level is given by ( 1, k ≥ 72 (H0 is rejected) φ∗ : {0, 1, ..., 100} → {0, 1}, φ∗ (k) = (4.22) 0, k < 72 (H0 is not rejected). For the type I error, we have the probability P[φ∗ = 1] = P[k ≥ kα∗ ] = 0.0084. 1 Not every α ∈ (0, 1) can be chosen, see also Remark 4.5, (i) below. 31 (4.23) 4. Introduction to statistical tests and Neyman-Pearson lemma The type II error is given by ∗ Q[φ = 0] = Q[k < kα∗ ] ∗ −1  kα = X k=0  100 k q (1 − q)100−k ≈ 0.62. k (4.24) Note that the type II error is very large, but cannot be decreased, since the test is optimal by the Neyman-Pearson Lemma 4.3. To decrease the error, we need to increase n. Remark 4.5. (i) Let us stress again that in the discrete set-up, not all significance levels α ∈ (0, 1) can be chosen. The reason is that the probability α = P[φ = 1] can only attain countably many values. Again in the above example where P = Bin(100, 0.6), we have .. . P[{70, 71, ..., 100}] = 0.0248, P[{71, 72, ..., 100}] = 0.0148, P[{72, 73, ..., 100}] = 0.0084, (4.25) P[{73, 74, ..., 100}] = 0.0046, .. . The values on the right-hand side in (4.25) are some of the possible choices for α. In practice, if we are looking for a test at significance level α = 0.01, we look for the largest α ≤ 0.01, for which a test exist. From (4.25), we see that in the case of Examples 4.1 and 4.4, this means we take α = 0.0084 and kα∗ = 72. (ii) We used q = 0.7 to establish the equivalence (4.19), but in fact the only information we used was that q > p. In other words, we could actually strengthen the test by not testing H0 : P = Bin(100, 0.6) against H1 : Q = Bin(100, 0.7), but in fact test P = Bin(n, p), p = 0.6 (null hypothesis H0 ), or Q = Bin(n, q), q > 0.6 e 1 ). (alternative hypothesis H (4.26) q(k) The decisive factor here is that the likelihood ratio L(k) = p(k) is monotone in k, which ∗ is why (4.19) is valid. We say that the test φ is a uniformly most powerful test for H0 : P = Bin(n, p) with p = 0.6 against H1 : Q = Bin(n, q), with q > 0.6. For completeness, we introduce two more notions concerning statistical tests and relate them to the previous discussion. Definition 4.6. Let P, Q as in Lemma 4.3. We consider the respective Neyman-Pearson test φ∗ . (i) Assume Ω ⊆ N0 and that the likelihood quotient L(ω) of Q with respect to P is increasing in ω. Let ω e ∈ Ω be the observed value of the test. The p-value of the test φ∗ given ω e is given by p-value = P[{ω ∈ Ω ; ω ≥ ω e }]. (4.27) 32 4. Introduction to statistical tests and Neyman-Pearson lemma (ii) If Ω ⊆ N0 and the likelihood quotient L(ω) of Q with respect to P is decreasing in ω and ω e ∈ Ω is the observed value of the test, then the p-value of the test φ∗ given ω e is p-value = P[{ω ∈ Ω ; ω ≤ ω e }]. (4.28) (iii) The power of the test φ∗ is given by 1 − β = Q[φ∗ = 1]. (4.29) Let us shortly discuss these notions: The p-value of test with a given observation ω e ∈ Ω is the smallest level of significance under which the hypothesis H0 can be rejected. Note that it depends on the test and the realized observation ω e . In other words: ω ∈ Cα∗ ⇔ p-value for ω < α. (4.30) The power of the test φ∗ is simply the probability that H0 is rejected, when H1 is true. In (4.29), β stands for the probability of a type II error (H0 is not rejected, when H1 is true). To summarize: Here is a step-by-step procedure to construct and evaluate a Neyman-Pearson test of the simple hypothesis H0 = {P} against H1 = {Q}: 1. A level of significance α ∈ (0, 1), depending on the statistical problem in question is given, for instance α = 0.01. This level is an upper bound for the type I error P[φ = 1]. 2. The optimal test (for which Q[φ = 0] is minimal) is given by the Neyman-Pearson Lemma ( q(ω) 1, L(ω) = p(ω) ≥ c∗α , φ∗ (ω) = (4.31) 0, L(ω) < c∗α . 3. The value c∗α is determined by the condition P[φ∗ = 1] = P[L(ω) ≥ c∗α ] = α e, with α e≤α the largest possible value out of the set {P[L(ω) ≥ c] ; c ∈ [0, ∞]} ∩ [0, α]. 4. If Ω ⊆ N0 and L(k) ≡ L(ω) is monotonically increasing in k, we can simplify the test to ( 1, k ≥ kα∗ , φ∗ (k) = (4.32) 0, k < kα∗ . The value kα∗ is then determined by P[{k ∈ Ω ; k ≥ kα∗ }] = α e, where α e ≤ α is the largest possible value out of the set {P[{k, k + 1, ...}] ; k ∈ Ω} ∩ [0, α]. Similar arguments can be made if L(k) is monotonically decreasing in k. 5. Depending on the nature of the likelihood quotient, we can check whether the test φ∗ is in fact a uniformly most powerful test of H0 : {P} against a larger class of measures e 1 : {Q }θ H θ 33 5. Continuous distributions 5. Continuous distributions (Reference: [1, Chapter 5], or [2, Sections 1.2, 2.5.2, 2.6]) In this section, we introduce continuous distributions on Ω ⊆ R. Specifically, we want to be able to talk about probability spaces like (R, F, P) or ([0, 1], F, P) with appropriate choices of F and P. This requires some more details about σ-algebras, which we will present without proofs. Example 5.1. Consider the arrival of a train with delay. We assume that its arrival is “uniformly distributed” between 1 PM and 2 PM. How can we model this? Suppose 0= b 1 PM and 1= b 2 PM. We split [0, 1) in half-open intervals of equal length n1 , n ∈ {2, 3, 4, ...} whose leftmost points are j  1 2 1 (5.1) n ; 0 ≤ j ≤ n − 1 = 0, n , n , ..., 1 − n ⊆ [0, 1). The probability for the train to arrive within one of these intervals ∆j = [ nj , j+1 n ), 0 ≤ j ≤ n−1 1 should be n . For 0 ≤ a < b < 1, we should have approximately Z b X 1   1 X P [a, b) ≈ = 1 −→ 1 dx, |{z} n n j a j a≤ n 0. Then the cumulative distribution function of Y is given by ( Z x 0, x < 0, FY (x) = λe−λt 1[0,∞) (t)dt = (6.16) −λx 1−e , x ≥ 0. −∞ 43 6. Random variables FY (x) 1 x Figure 6.2.: Cumulative distribution function of Y ∼ E(λ) (iii) In the previous examples (i) and (ii), the random variable X is discrete and Y is continuous. Let us stress that this is not a dichotomy. For instance, let Y ∼ N(0, 1). Now let Here we have Z = Y · 1{Y ≥0} . (6.17) ( 0, x < 0, Rx FZ (x) = Φ(x) = −∞ ϕ(t)dt, x ≥ 0. (6.18) Note that Z is neither continuous (P[Z = 0] = 12 ), nor discrete (it can attain all values in [0, ∞). FZ (x) 1 1 2 x Figure 6.3.: Cumulative distribution function of Z as defined in (6.17). End of Lecture 9 We will now collect some general facts about cumulative distribution functions. Lemma 6.7. Let X : (Ω, F) → (R, B(R)) be a real random variable. Its cumulative distribution function F = FX satisfies the following properties: (i) F (x) ∈ [0, 1] for all x ∈ R. (ii) F is non-decreasing. 44 6. Random variables (iii) F is right continuous, i.e. lim F (x + ε) = F (x). ε↓0 (6.19) (iv) limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. Proof. Claim (i) follows since F (x) = P[X ∈ (−∞, x]] ∈ [0, 1]. For claim (ii), we use the fact that for x ≤ x0 , we have (−∞, x] ⊆ (−∞, x0 ] and so     F (x) = PX (−∞, x] ≤ PX (−∞, x0 ] = F (x0 ). (6.20) Claim (iii) from the fact that for any probability measure Q and a sequence (An )n∈N ⊆ F with A1 ⊇ A2 ⊇ A3 ⊇ ..., we have "∞ # \ Q An = lim Q[An ], (6.21) n→∞ n=1 which is an exercise. We apply this to the probability measure Q = PX and the sets An = (−∞, xn ], where xn ↓ x for some x ∈ R. Then     FX (xn ) = PX (−∞, xn ] → PX (−∞, x] = FX (x), as n → ∞, (6.22) T since ∞ n=1 (−∞, xn ] = (−∞, x]. For (iv), we first see that for a sequence (An )n∈N ⊆ F with A1 ⊆ A2 ⊆ A3 ⊆ ..., we have "∞ # [ Q An = lim Q[An ], (6.23) n→∞ n=1 taking complements in (6.21). Now we consider a sequence of real numbers (an )n∈N with an → ∞ as n → ∞. Then   FX (an ) = PX (−∞, an ] → PX [R] = 1, as n → ∞, (6.24) S since R = ∞ n=1 (−∞, an ] and (−∞, an ] ⊆ (−∞, an+1 ] for all n ∈ N. The other claim follows similarly, again by using (6.21). In fact, the above properties characterize distribution functions in the following sense: Theorem 6.8. Let F : R → R satisfying the properties (i) – (iv) of Lemma 6.7. Then there exists a probability space (Ω, F, P) and a random variable X : (Ω, F) → (R, B(R)) such that F = FX . The law PX of X is uniquely determined by F . Proof. We define X : ((0, 1), B((0, 1))) → (R, B(R)) as follows: X(ω) = sup{y ∈ R ; F (y) < ω}. 45 (6.25) 6. Random variables Note that {ω ∈ (0, 1) ; X(ω) ≤ x} = {ω ∈ (0, 1) ; ω ≤ F (x)}, x ∈ R. (6.26) Indeed, if ω ≤ F (x), then x ∈ / {y ∈ R ; F (y) < ω}, which implies x ≥ X(ω). On the other hand, if ω ∈ (0, 1) with F (x) < ω, since F is right continuous, there is ε > 0 with F (x + ε) < ω. Therefore X(ω) ≥ x + ε > x. This means that F (x) < ω implies that X(ω) > x. Consequently, x ≥ X(ω) implies ω ≤ F (x). We then equip the space (0, 1), B((0, 1))) with the uniform distribution P = U((0, 1)).1 Then the law of X has cumulative distribution function given by F . Indeed:   (6.26)   FX (x) = P X ≤ x = P (0, F (x)] = F (x). (6.27) The proof of the second part (uniqueness of PX ) is omitted. Let us briefly explain the role of the discontinuity points of a cumulative distribution function F . If we look at (i), (iii) in Example 6.6, we see that a jump of the cumulative distribution function at a point x ∈ R correspond to the probability PX [{x}] = P[X = x]. More formally: Lemma 6.9. Let X : (Ω, F) → (R, B(R)) be a real random variable with cumulative distribution function F = FX . Then, for every x ∈ R, we have P[X = x] = F (x) − F (x−), (6.28) where F (x−) (the left limit of F at x) is defined as F (x−) = lim F (x − ε). ε↓0 (6.29) Proof. Note that since F is non-decreasing, the limit in (6.29) is well defined and equal to limn→∞ F (x − n1 ). The claim (6.28) then follows from (6.21), the fact that {x} = ∞ \ (x − n1 , x] (6.30) n=1 and   F (x) − F (x − n1 ) = P (x − n1 , x] . (6.31) To summarize the previous results, we see that a cumuluative distribution function F uniquely determines a probability measure P on (R, B(R)) and vice versa. Let us also stress the fact that the cumulative distribution function is really associated to the law of a random variable, and not the random variable itself. This motivates the following definition. 1 Here, U((0, 1)) stands  for the uniform distribution on the open interval (0, 1), viewed as a probability measure on (0, 1), B (0, 1) . Since continuous distributions give zero mass to points, this is essentially the same distribution as the uniform distribution U([0, 1]) on [0, 1], viewed as a probability measure on (R, B(R)). 46 6. Random variables Definition 6.10. Let (Ω, F, P) be a probability space and X, Y : (Ω, F) → (R, B(R)) two random variables. We say that X and Y are equal in distribution or identically distributed if PX = PY (⇔ FX (x) = FY (x), for all x ∈ R). (6.32) d This is denoted as X = Y . Example 6.11. (i) Consider again throwing two dice. We use the probability space (6.1). The result of the first and second die are given by the random variables X : {1, ..., 6}2 → {1, ..., 6}, X(ω1 , ω2 ) = ω1 , 2 Y (ω1 , ω2 ) = ω2 . Y : {1, ..., 6} → {1, ..., 6}, (6.33) d We see that X ∼ U({1, ..., 6}) and Y ∼ U({1, ..., 6}), so X = Y . Note that of course X 6= Y , since for instance X(1, 2) = 1 6= 2 = Y (2, 1). (ii) Let X be any continuous random variable and fX the probability density of its law. Assume that fX is an even function, i.e. for all x ∈ R. fX (x) = fX (−x), (6.34) d Then X = −X. Indeed, we have F−X (x) = P[−X ≤ x] = P[X ≥ −x] = 1 − P[X < −x] Z −x =1− fX (t)dt −∞ Z x =1+ fX (−t)dt Z∞∞ =1− fX (t)dt (6.35) x = P[X ≤ x] = FX (x). Here we repeatedly used the results of Lemma 5.9. A more concrete example: If X ∼ N(0, σ 2 ), σ > 0, we have that −X ∼ N(0, σ 2 ) as well. The example above already gives a hint that random variables are a useful tool for algebraic manipulations. We will see more of this in the next section. 6.3. Transformation of random variables Example 6.12. We measure the temperature of a liquid in ◦ C and want to transform it into ◦ C : random variable X, ◦ F : random variable Y . 47 ◦ F. 6. Random variables We can use the known formula 9 Y = · X + 32, more generally Y = a · X + b, (6.36) 5 for a 6= 0, b ∈ R. We assume that X is a continuous random variable with probability density (distribution) function fX (FX ), for instance X ∼ N(µ, σ 2 ). What is the distribution of Y , i.e. how do fY or FY look like? For simplicity, we also assume a > 0. h i         y−b FY (y) = PY (−∞, y] = P Y ≤ y = P aX + b ≤ y = P X ≤ y−b = F X a a   1   d ⇒ fY (y) = FX y−b = · fX y−b . a a dy a (6.37) If Y = a · X + b with general a 6= 0, we have fY (y) =   1 . · fX y−b a |a| (6.38) In the special case where X ∼ N(µ, σ 2 ), we have 1 2 1 e− 2σ2 (x−µ) 2πσ 2 1 1 e− 2σ2 a2 (y−b−aµ) ⇒ fY (y) = √ 2πσ|a| 2 2 Y = aX + b ∼ N(aµ + b, a σ ). fX (x) = √ ⇒ (6.39) Formula (6.38) is the linear transformation rule for continuous random variables. We now show a general transformation rule for continuous random variables. Theorem 6.13. Let X be a continuous random variable and fX the probability density function of its law. Suppose that g : R → R is strictly increasing or strictly decreasing and differentiable.2 Then Y = g(X) (6.40) is also a continuous random variable and has density (  d −1 fX g −1 (y) · dy g (y) , y = g(x) with fX (x) > 0, fY (y) = 0, else. Proof. Let g be strictly increasing. Then, for [a, b) ⊆ {g(x) ; fX (x) > 0}, we have       PY [a, b) = P g(X) ∈ [a, b) = P X ∈ [g −1 (a), g −1 (b)) Z g−1 (b) Z b  d −1 = fX (x)dx = fX g −1 (y) g (y) dy. dy g −1 (a) a | {z } (6.41) (6.42) =fY (y) The proof for g strictly decreasing is similar. 2 To be very precise, we also need to make sure that Y = g(X) is still a random variable (i.e. that it is F − B(R)measurable. This follows from the fact that g is differentiable and thus B(R) − B(R)-measurable (in fact continuity is sufficient) and the simple fact that the composition of measurable functions is measurable. 48 6. Random variables Example 6.14. Let X ∼ U([0, 1]) and consider Y = exp(X) = eX . The function exp satisfies the requirements of the previous theorem, and its inverse is log. Moreover, fX (x) 6= 0 if and only if x ∈ [0, 1]. We have ( d log(y) = y1 , y ∈ [1, e], fY (y) = dy (6.43) 0, y∈ / [1, e]. In the next example, we use the same method as in Theorem 6.13 to introduce the χ2 distribution (with one degree of freedom). Example 6.15. Let X ∼ N(0, 1) and Y = X 2 . We want to calculate fY (y). Unfortunately the function g : R → R, x 7→ x2 is not strictly increasing / decreasing, but we can still use the √ same idea as in the proof of the transformation rule. Indeed, we see that g −1 (y) = y and d −1 1 √ dy g (y) = 2 y for y > 0. Then, for 0 ≤ a < b < ∞, we find √     √  √ √  P Y ∈ [a, b) = P X ∈ [ a, b) + P X ∈ (− b, − a]  √ √  = 2P X ∈ [ a, b) Z b y 1 1 √ e− 2 √ dy. =2 2 y 2π a (6.44) We used the symmetry of X ∼ N(0, 1) and the fact that PX does not give mass to points. It follows that y 1 1 fY (y) = √ y − 2 e− 2 1[0,∞) (y). (6.45) 2π We say that a random variable Y with a law given by the density is χ2 -distributed (with one degree of freedom). 49 7. Expectation, variance and higher moments of random variables 7. Expectation, variance and higher moments of random variables (Reference: [1, Chapter 7], or [2, Sections 4,1,4.3]) 7.1. Expectation We now introduce the notion of the expectation or expected value of a real random variable. The idea is to somehow quantify the “typical” or “average” value of a random variable X. Let us motivate the definition with an example. Example 7.1. We consider a game where we throw a die once, and get the following rewards: I $ 1 if the die shows 1 or 2, I $ 2 if the die shows 3 or 4, I $ 4 if the die shows 5, and I $ 8 if the die shows 6. What would be a “fair price” for playing this game? We would like to stakes to be such that we do not lose money on average by playing. Assume that we play n ∈ N times, and n1 , n2 , ..., n6 denotes the number of 1, 2, ..., 6 that show up in n rounds. Our return (in $) after n steps is 1 · n1 + 1 · n2 + 2 · n3 + 2 · n4 + 4 · n5 + 8 · n6 . (7.1) So, the average return in one round is 1· n1 n2 n3 n4 n5 n6 +1· +2· +2· +4· +8· . n n n n n n The idea is now that for large n, the relative fractions X ∼ U({1, ..., 6}). This gives us the value ni n should be close to P[X = i] for E[X] = 1 · P[X = 1] + 1 · P[X = 2] + 2 · P[X = 3] + 2 · P[X = 4] 1 + 4 · P[X = 5] + 8 · P[X = 6] = 18 · = 3. 6 This somehow suggests that the “fair” price to play the game is $ 3. This example motivates the definition of the expectation. 50 (7.2) (7.3) 7. Expectation, variance and higher moments of random variables Definition 7.2. (i) Let X be a discrete real random variable with values in ΩX (⊆ R) and let pX be the probability mass function of its law PX . We define the expectation of X as X   E X = ω · pX (ω), (7.4) ω∈ΩX if P ω∈ΩX |ω|pX (ω) < ∞. (ii) Let X be a continuous real random variable and let fX be the probability density function of its law PX . We define the expectation of X as Z ∞   E X = x · fX (x)dx, (7.5) −∞ if R∞ −∞ |x|fX (x)dx < ∞. End of Lecture 10 Remark 7.3. (i) For real random variables X that are neither discrete nor continuous (such as we saw in Example 6.6, (iii)), we typically cannot easily define the expectation by a formula as above. We refer to Section [2, Sections 4,1] for the case of general X. d (ii) The expectation only depends on the law PX of X. In other words: If X = Y and the expectation of X exists, then the expectation of Y exists as well and E[X] = E[Y ]. Let us give a couple of examples. (i) Let X = c ∈ R. Then Example 7.4.   E X = c, (7.6) since X is a discrete random variable and ΩX = {c}, P[X = c] = 1. (ii) Let (Ω, F, P) be a probability space and A ∈ F. Then 1A , defined by ( 1, ω ∈ A, 1A (ω) = 0, ω ∈ / A, is a random variable and E Indeed, 1 have −1 A (B) ∈ {∅, A, Ac , Ω} E 1A = P[A].   (7.8) for every B ∈ B(R), and Ω1A = {0, 1}. Therefore, we 1A = 0 · P[1A = 0] + 1 · P[1A = 1] .  (7.7)  | {z =P[A] (7.9) } (iii) X ∼ P ois(λ), where λ > 0. The corresponding probability mass function is pX (k) = λk −λ (for k ∈ N0 ), and k! e ∞ ∞ ∞ X X   X λk−1 −λ λk−1 −λ E X = kpX (k) = λ k· e = λe = λ. k! (k − 1)! k=0 k=0 k=1 51 (7.10) 7. Expectation, variance and higher moments of random variables (iv) X ∼ U([a, b]) for a < b. The corresponding probability density function is fX (x) = 1 b−a 1[a,b] (x). We have   E X = b Z a  2 b 1 x 1 a+b x· dx = = . b−a b−a 2 a 2 (7.11) (v) X ∼ N(µ, σ 2 ), µ ∈ R, σ > 0. Here the probability density function is fX (x) = √ 1 e− 2πσ (x−µ)2 2σ 2 . Z ∞ (x−µ)2 y2 1 1 − 2 y√ x· √ e 2σ dx = e− 2σ2 dy 2πσ 2πσ −∞ −∞ Z ∞ 2 y 1 √ +µ e− 2σ2 dy = 0 + µ = µ, 2πσ −∞ | {z }   E X = Z ∞ (7.12) =(∗) where we used that the expression (∗) is again a probability density function of N(0, σ 2 ), and therefore its integral is one. (vi) X ∼ Geo(p), p ∈ (0, 1) has expectation   1 E X = . p (7.13)   1 E X = . λ (7.14) (vii) X ∼ E(λ), λ > 0 has expectation (viii) X ∼ Bin(n, p), n ∈ N, p ∈ (0, 1) has expectation   E X = np. (7.15) (ix) Consider the probability distribution on N characterized by the probability mass function p(k) = π62 k12 .1 Let X ∼ P. Then the expectation of X does not exist. Indeed: ∞ X k · pX (k) = k=1 ∞ X 6 1 = ∞. π2 k (7.16) k=1 The claims in (vi) and (vii) are Exercises. Claim (viii) will be shown very easily later, after introducing the notion of independent random variables. Theorem 7.5. Let g : R → R. 1 The prefactor is chosen since P∞ 1 k=1 k2 = π2 . 6 This can be shown using Fourier series. 52 7. Expectation, variance and higher moments of random variables (i) If X is a discrete real random variable with probability mass function (pX (ω))ω∈ΩX , then X X   E g(X) = g(ω)pX (ω), if |g(ω)|pX (ω) < ∞. (7.17) ω∈ΩX ω∈ΩX (ii) If X is a continuous real random variable with probability density function fX (and g piecewise continuous2 ),then Z ∞ Z ∞   E g(X) = g(x)fX (x)dx, if |g(x)|fX (x)dx < ∞. (7.18) −∞ −∞ −1 Proof. For S∞ (i), let Y = g(X), then ΩY = {y1 , y2 , ...}. Let Ai = g ({yi }) for i ∈ N. Clearly, ΩX = i=1 Ai , and the Aj are pairwise disjoint. We have ∞ ∞ X X     X E g(X) = E Y = yi · PY [{yi }] = yi pX (ωj ) i=1 = ∞ X X i=1 ωj ∈Ai (7.19) g(ωj ) · pX (ωj ) i=1 ωj ∈Ai = X g(ω)pX (ω). ω∈ΩX For (ii), the proof of the general case is more complicated and relies on Measure Theory. For the special case where g is strictly increasing / strictly decreasing and differentiable, we can however use Theorem 6.13 and see that for Y = g(X) Z ∞ Z ∞ Z ∞    d −1 E g(X) = yfY (y)dy = yfX g −1 (y) · dy g (y) dy = g(x)fX (x)dx. −∞ −∞ −∞ (7.20)   Corollary 7.6. Let X be a real random variable with finite expectation E X . Then, for a, b ∈ R,     E aX + b = aE X + b. (7.21) Proof. We only prove this statement for X continuous or discrete. Without loss of generality, assume that X is continuous (the discrete case proceeds analogously). Consider g(x) = ax + b. Then Z ∞ Z ∞ Z ∞   E g(X) = (ax + b)fX (x)dx = a xfX (x)dx +b fX (x)dx . (7.22) −∞ −∞ −∞ | {z } {z } | =1 =E[X] A similar calculation shows that the integral 2 R∞ −∞ |g(x)|fX (x)dx is finite. This can be weakened to requiring that g : (R, B(R)) → (R, B(R)) measurable. 53 7. Expectation, variance and higher moments of random variables Theorem 7.7. Let X, Y be two real random variables, both with finite expectation. Then       E X +Y =E X +E Y . (7.23) We will show this later after introducing the joint distribution of random variables in the next section. 7.2. Variance Definition 7.8. Let X be a continuous or discrete real random variable. (i) For k ∈ N, the k-th moment of X is defined by     µk = E X k , if E |X|k < ∞. (7.24)   (ii) Assume that X has a finite second moment, E X 2 < ∞. We define the variance of X as   Var[X] = E (X − E[X])2 . (7.25) We also define the standard deviation of X by p σ(X) = Var[X]. (7.26) The variance is a measure how much the distribution of X typically spreads around its expectation. If it is large, the distribution is well spread-out. If it is small, the the distribution is concentrated around the expectation. Both the notion of kth moment and variance only depend on the law PX of X, similar as for the expectation.   Proposition 7.9. Let X be a continuous or discrete real random variable with E X 2 < ∞ and a, b ∈ R. Then (i)      2 Var X = E X 2 − E X = µ2 − µ21 . (7.27)     Var aX + b = a2 Var X . (7.28) (ii) Proof. We first prove (i):       Var X = E (X − E[X])2 = E X 2 − 2XE[X] + E[X]2     = E X 2 − 2E[X]2 + E[X]2 = E X 2 − E[X]2 . (7.29) For (ii), we calculate   Var[aX + b] = E (aX + b − aE[X] − b)2   = E a2 (X − E[X])2 = a2 Var[X]. 54 (7.30) 7. Expectation, variance and higher moments of random variables End of Lecture 11 Let us give some examples. Example 7.10. (i) Let X ∼ Ber(p), p ∈ (0, 1). We have   E X = 0 · (1 − p) + 1 · p = p,   E X 2 = 02 · (1 − p) + 12 · p = p, (7.31) 2 Var[X] = p − p = p(1 − p). (ii) Let X ∼ U({1, ..., 6}). We have 6   X E X = k · P[X = k] = 3.5,   E X2 = k=1 6 X k 2 · P[X = k] = k=1 ⇒ Var[X] = 91 , 6 (7.32) 91 49 70 35 − = = ≈ 2.92. 6 4 24 12   (iii) Let X ∼ N(0, 1). We already saw that E X = 0 (see (7.12)). We now evaluate   Z ∞ 2 Var[X] = E (X − E[X]) = x2 ϕ(x)dx | {z } −∞ Z ∞ =0 1 2 1 x2 e− 2 x dx =√ 2π −∞ Z ∞ i 1 2 ∞ 1 2 1 h 1 −xe− 2 x e− 2 x dx =√ +√ −∞ 2π 2π −∞ = 0 + 1 = 1. (7.33) Now let Y = σX + µ for µ ∈ R, σ > 0. Then Y ∼N(µ, σ 2 ) (by (6.39)) Var[Y ] = σ 2 Var[X] = σ 2 (by (7.28)). (7.34) We have established that Z ∼ N(µ, σ 2 ) ⇒   E Z = µ, Var[Z] = σ 2 . In other words: The standard deviation of N(µ, σ 2 ) is exactly σ. 55 (7.35) 7. Expectation, variance and higher moments of random variables (iv) X ∼ P ois(λ). Recall that E[X] = λ. Then ∞   X λk E X(X − 1) = k(k − 1) e−λ k! k=0 ∞ X λk−2 −λ e = λ2 (k − 2)! k=2  2   E X = E X(X − 1) + E[X] = λ2 + λ,    2 Var[X] = E X 2 − E X = λ2 + λ − λ2 = λ. = λ2 ⇒ ⇒ (7.36) (v) X ∼ Bin(n, p), n ∈ N, p ∈ (0, 1) has variance Var[X] = np(1 − p). (7.37) We now argue why the variance is indeed a useful quantification how the distribution is spread out. We study the expression P [|X − E[X]| ≥ ε] for ε > 0. The following inequality is called the Markov inequality. Theorem 7.11. Let X be a non-negative random variable with finite expectation. Then, for ε > 0,     E X . (7.38) P X≥ε ≤ ε Proof. We prove the case where X is discrete. The case of continuous X is similar. Let ΩX = {ω1 , ω2 , ...}. Then ∞   X   P X = ωi P X≥ε = ≤ ≤ i=1 ωi ≥ε ∞ X  ωi  P X = ωi ε i=1 ωi ≥ε ∞ X 1 ε i=1 (7.39)   1   ωi P X = ωi = E X . ε From the Markov inequality, we have the following result, called the Chebyshev inequality.   Theorem 7.12. Let E X 2 < ∞. For any a ∈ R and ε > 0, one has     E (X − a)2 P |X − a| ≥ ε ≤ . (7.40) ε2 In particular, for a = E[X] one has   Var[X] P |X − E[X]| ≥ ε ≤ . ε2 56 (7.41) 7. Expectation, variance and higher moments of random variables Proof. We apply the Markov inequality (7.38) to the non-negative random variable (X − a)2 . It follows that     P |X − a| ≥ ε = P (X − a)2 ≥ ε2   (7.42) (7.38) E (X − a)2 ≤ . ε2 Chebyshev’s inequality gives a bound how likely it is that the result of a random number deviates by a certain amount from its expectation. In particular, we have the following “kσrules”:   Corollary 7.13. Let E X 2 < ∞. p (i) If σ = σ(X) = Var(X) > 0, we have   1 P |X − E[X]| ≥ kσ ≤ 2 , k (7.43)   1 P |X − E[X]| ≥ 2σ ≤ , 4   1 P |X − E[X]| ≥ 3σ ≤ . 9 (7.44)   P X = E[X] = 1. (7.45) for k > 0. In particular:   (ii) If Var X = 0, then Remark 7.14. The Chebyshev inequality is very general, but only gives very rough bounds. Consider for instance X ∼ N(µ, σ 2 ) for µ ∈ R, σ > 0. Then     X −µ
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Please view explanation and answer below.

4. Let X1 , ..., X50 be i.i.d. random variables with E[X1 ] = Var[X1 ] = 2. Let X 50 =
Based on Chebyshev’s inequality, determine the smallest value of c > 0 such that
P[|X 50 − E[X 50 ]| ≥ c] ≤

1
50

P50

i=1

Xi .

1
.
100
[1 Point]


(A) c = 10 2.
(B) c = 10.
(C) c = 4.
(D) c = 2.
(E) c = 1.
5. Let X1 , ..., X20 be real random variables with Xi ∼ Ber( 15 ) for all 1 ≤ i ≤ 20. Consider Y =
P20
[1 Point]
i=1 Xi . Which of the following claims is true?
(A) Y ∼ Bin(20, 51 ).
(B) E[5Y + 5] = 25.
(C) Var[5Y + 5] = 80.
(D) All of (A) – (C) are true.
(E) None of (A) – (C) are true.
6. Which of the following statements is false?

[1 Point]

(A) For the calculation of the p-value of a statistical test, we must know the observed outcome.
(B) For the calculation of the p-value of a statistical test, we must know the significance level α.
(C) If for a statistical test the p-value is larger than the significance level α, we cannot reject H0 at
level α.
(D) For the calculation of the p-value of a statistical test, we do not need to know the power 1 − β.
(E) The power 1 − β of a statistical test is the probability that H0 is rejected, if it is false.
7. Let Z be a continous real random variable with the probabili...


Anonymous
Really helped me to better understand my coursework. Super recommended.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Related Tags