New York University − Courant Institute, MATH-UA 235
Probability and Statistics
Maximilian Nitzschner
04/06/2021
Disclaimer:
These are lecture notes for the course Probability and Statistics (MATH-UA 235), given at
New York University in Spring 2021.
The primary textbook reference for this course is [1]. For some advanced topics and further
reading, especially concerning more mathematical details, the book [2] may also be helpful.
These notes are preliminary and may contain typos. If you see any mistakes or think that the
presentation is unclear and could be improved, please send an email to:
maximilian.nitzschner@cims.nyu.edu. All comments and suggestions are appreciated.
2
Contents
Contents
0. Motivation
0.1. What is probability theory? . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0.2. What is statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Outcomes, events and probability
1.1. Sample spaces . . . . . . . . . .
1.2. Events, σ-algebras . . . . . . . .
1.3. Probability . . . . . . . . . . . .
1.4. Elementary Combinatorics . . .
5
5
6
.
.
.
.
7
7
8
11
13
2. Conditional probability and stochastic independence
2.1. Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2. The law of total probability and Bayes’ theorem . . . . . . . . . . . . . . . . .
2.3. Stochastic independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
17
18
20
3. Discrete distributions
23
4. Introduction to statistical tests and Neyman-Pearson lemma
4.1. Basic notions of statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2. The Neyman-Pearson lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
28
30
5. Continuous distributions
34
6. Random variables
6.1. Definition of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2. Law and cumulative distribution of a real random variable . . . . . . . . . . .
6.3. Transformation of random variables . . . . . . . . . . . . . . . . . . . . . . . .
41
41
42
47
7. Expectation, variance and higher moments of random variables
7.1. Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2. Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
50
54
8. Joint distributions and independence of random variables
8.1. Joint distributions of random variables . . . . . . . . . . . . . . . . . . . . . .
8.2. Independence of random variables . . . . . . . . . . . . . . . . . . . . . . . . .
59
59
66
9. Covariance and correlation
70
.
.
.
.
.
.
.
.
.
.
.
.
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
10. Operations with random variables
10.1. Extremes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2. Sums of independent random variables . . . . . . . . . . . . . . . . . . . . . .
74
74
75
11. Poisson processes
78
12. Stochastic convergence and the weak law of large numbers
12.1. The law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2. Moment estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
82
84
A. Appendix
A.1. Multiple integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
87
4
0. Motivation
0. Motivation
The purpose of probability theory and statistics is to study systems that involve randomness.
0.1. What is probability theory?
Consider as a very simple example throwing a single die, which is a process with a random
outcome. A first objective will be to develop the mathematical description of characteristics of
such a random experiment: This is the specification of a stochastic model. Loosely speaking:
Probability theory is concerned with the description of random phenomena using
stochastic models.
(0.1)
Here are some examples of phenomena calling for a probabilistic description:
I throwing a (fair) die or coin multiple times;
I describing the random movement of a particle in Zd , d ≥ 1 (random walk): at every time
step, the particle moves randomly to one of its neighboring sites, with “equal probability”;
Figure 0.1.: Left panel: Possible jumps for a particle at the origin; right panel: Position of the
particle after 24 steps.
Some typical questions we could ask in this context are for instance:
I What is the average of the numbers of the die after a large number of throws?
I Where will the random particle be after a large number of steps?
I What is the approximate probability that the sum of the numbers coming up when
throwing the die 1000 times exceeds 5000?
5
0. Motivation
I Will the random particle ever come back to the origin?
In this course we develop techniques to answer some of these questions. Notably, we will see
the law of large numbers and the central limit theorem, which address the first three questions
above.
0.2. What is statistics?
Let us assume we have some outcomes of a random experiment following a distribution with
an unknown parameter. Is it possible to guess or reconstruct this parameter from observations?
Again, loosely speaking:
Statistics is concerned with methods to draw conclusions from given random observations. (0.2)
Here are some questions that statistics are useful for.
I Assume we have 100 measurements of a certain physical parameter following a known
distribution (for instance the life-time of a light bulb). How do we effectively estimate the
parameter underlying this distribution?
I If we throw a coin 100 times and obtain “heads” 75 times, is it reasonable to claim that
the coin is not a fair coin? What would we base this decision on?
The first question is an example of an estimation problem, whereas the second question motivates
the study of statistical tests. We will develop some of the theory of estimators, tests and other
statistical methods in this course.
6
1. Outcomes, events and probability
1. Outcomes, events and probability
(Reference: [1, Chapter 2], or [2, Sections 1.1-1.2, 2.1-2.3])
Our primary objective is to construct a mathematical model for a random experiment. Conceptually, this involves the specification of three quantities:
I a set of outcomes or sample space Ω 6= ∅; an element ω ∈ Ω should be interpreted as a
possible realization / measurement of the random experiment.
I a class of events F ⊆ P(Ω), called σ-algebra; an event A ∈ F is a subset of Ω, and we
aim at specifying its probability.
I a probability measure P, which is a map from F to [0, 1] that assigns a probability P[A]
to any given event A ∈ F.
The triple (Ω, F, P) is called a probability space. In the following sections, we give precise
definitions of these objects and present examples.
1.1. Sample spaces
Definition 1.1. A non-empty set Ω consisting of the possible realizations of a random experiment is called set of outcomes or sample space. An element ω ∈ Ω is called an outcome.
Example 1.2.
(i) Tossing a coin: The possible outcomes are heads and tails, which we denote
by H and T respectively. In this case, we have
Ω1 = {H, T }.
(1.1)
(ii) Rolling a die: The outcomes are the integer numbers from 1 to 6, so
Ω2 = {1, 2, 3, 4, 5, 6}.
(1.2)
(iii) Tossing a coin and rolling a die: We define the sample space as the Cartesian product of
Ω1 and Ω2 , namely
Ω3 = Ω1 × Ω2
= {(ω1 , ω2 ) ; ω1 ∈ Ω1 , ω2 ∈ Ω2 }
= {(H, 1), (H, 2), ..., (H, 6), (T, 1), (T, 2), ..., (T, 6)}.
7
(1.3)
1. Outcomes, events and probability
(iv) n-fold coin toss (where n ∈ N = {1, 2, ...}): Here, we need to record the outcome as an
n-tuple
Ω4 = {(H, H, ..., H, H), (H, H, ..., H, T ), (H, H, ..., T, H), ..., (T, T, ..., T, T )}
|
{z
}
n elements
= {(ω1 , ..., ωn ) ; ωi ∈ {H, T } for 1 ≤ i ≤ n}
= Ω1 × ... × Ω1 =
{z
}
|
(1.4)
Ωn1 .
n times
(v) Tossing a coin infinitely many times: The natural choice for outcomes will be similar as
in the previous example, but with sequences of infinite length rather than n-tuples. More
precisely
Ω5 = ΩN
1 = {(ω1 , ω2 , ...) ; ωi ∈ {H, T } for i ∈ N}.
(1.5)
(vi) The number of customers in a shop during a given day:
Ω6 = N0 = {0, 1, 2, ...}.
(1.6)
Ω7 = R+
0 = [0, ∞).
(1.7)
(vii) The lifetime of a light bulb:
Let us point out that the sample spaces Ω1 , Ω2 , Ω3 , Ω4 and Ω6 are countable1 , whereas Ω5
and Ω7 are uncountable.
1.2. Events, σ-algebras
Suppose that we have fixed a sample space Ω. In general we are interested in the occurrence of
events that consist of a certain selection of outcomes. For instance consider rolling a die once
(recall from Example 1.2, (ii) that
Ω2 = {1, 2, 3, 4, 5, 6}
is a reasonable choice for the sample space for this random experiment). The event
A = “the upper face of the die shows an even number”
(1.8)
can then be expressed as
A = {2, 4, 6} ⊆ Ω2 .
(1.9)
Naive definition: An event is a subset A ⊆ Ω of the sample space.
This works in the case where Ω is countable (in particular, if Ω is finite), but leads to an important
complication when Ω is uncountable (see Example 1.2, (v) and (vii)). It turns out that if we
allow every subset A ⊆ Ω for an uncountable Ω, we cannot define a probability for A without
running into problems. Fortunately, we can restrict out attention to smaller classes of subsets.
1
A set S is countable if there exists a surjective (onto) map ρ : N → S. This includes the case of finite S.
8
1. Outcomes, events and probability
Definition 1.3. Let Ω 6= ∅. The power set P(Ω) is the set of all subsets of Ω, i.e.
P(Ω) = {A ; A ⊆ Ω}.
(1.10)
A σ-algebra on Ω is a subset F ⊆ P(Ω) that fulfills the following properties:
(S1) Ω ∈ F.
(S2) If A ∈ F, then Ac = Ω \ A ∈ F.
(S3) If for every j ∈ N, Aj ∈ F, then
S∞
j=1 Aj
= A1 ∪ A2 ∪ A3 ∪ ... ∈ F.
A set A ∈ F is called an event. If ω ∈ A, we say that the event A occurs (for the outcome ω). If
ω∈
/ A, we say that A does not occur (for the outcome ω).
Remark 1.4.
(i) The power set P(Ω) itself is a σ-algebra.
(ii) The event Ω always occurs in a random experiment, since ω ∈ Ω is always true. On the
other hand, the event ∅ = Ωc never occurs, since ω ∈ ∅ can never be true.
(iii) In the previous definition, (S2) should be understood as follows: If A ∈ F is an event, then
Ac , which has the interpretation that A does not
S occur, should also be an event. Similarly
(S3) means: If A1 , A2 , A3 , ... are events, then ∞
j=1 Aj , which has the interpretation that
one of the Aj occurs, should also be an event.
End of Lecture 1
We draw some simple conclusions from Definition 1.3.
Proposition 1.5. Let Ω 6= ∅ and F ⊆ P(Ω) a σ-algebra.
(i) ∅ ∈ F.
(ii) If for every j ∈ N, Aj ∈ F, then
T∞
j=1 Aj
∈ F.
(iii) If A, B ∈ F, then A ∪ B ∈ F, A ∩ B ∈ F and A \ B ∈ F.
Proof. We first prove (i): Since Ω ∈ F by (S1) and ∅ = Ωc = Ω \ Ω, we have that ∅ ∈ F by (S2).
We turn to (ii): By de Morgan’s rules2 we have that
c
∞
∞
\
[
Aj =
Acj
|{z}
j=1
j=1
∈ F, by (S3).
∈F, by (S2)
2
The de Morgan rules state that for any collection {Ui }i∈I of subsets Ui ⊆ U , one has
!c
!c
[
\ c
\
[ c
Ui
=
Ui ,
Ui
=
Ui
i∈I
i∈I
i∈I
.
9
i∈I
(1.11)
1. Outcomes, events and probability
Therefore, we have again by (S2) that
∞
\
Aj =
j=1
∞
\
c c
Aj ∈ F.
(1.12)
j=1
e1 , A2 = B = A
e2 and Aj = ∅, A
ej = Ω for j ≥ 3 (which
We now prove (iii): Set A1 = A = A
are all in F, using the assumption, (i) and (S1)). We then see that
A ∪ B = A ∪ B ∪ ∅ ∪ ∅ ∪ ... =
∞
[
Aj ∈ F,
j=1
∞
[
A ∩ B = A ∩ B ∩ Ω ∩ Ω ∩ ... =
ej ∈ F,
A
(1.13)
(1.14)
j=1
where we used (S2) and (ii), respectively. Finally, we have that
A\B =A∩
Bc
|{z}
∈ F.
(1.15)
∈F, by (S2)
We illustrate the set operations using again the example of rolling a single die.
Example 1.6. We use (Ω, F) = ({1, 2, 3, 4, 5, 6}, P({1, 2, 3, 4, 5, 6})) and consider the events
A = “the upper face of the die shows an even number” = {2, 4, 6},
B = “the upper face of the die shows a prime number” = {2, 3, 5},
C = “the upper face of the die shows an odd number” = {1, 3, 5}.
From this, we obtain
B c = {1, 4, 6},
A ∪ B = {2, 3, 4, 5, 6},
A ∩ B = {2},
A ∩ C = ∅.
We see that the set B c describes the event that B does not occur, the set A ∪ B describes the
event that A or B occurs3 and A ∩ B describes the event that A and B occur (both). The
fact that A ∩ C is the empty set corresponds to the fact that the events A and C are mutually
exclusive.
3
As always in mathematics, the word “or” has a non-exclusive meaning: it includes the case where A and B occur
both.
10
1. Outcomes, events and probability
Ω
Ω
A
B
A
B
A∪B
A∩B
A
Ω
Ω
B
A
Ac
Figure 1.1.: Graphical representation of intersection, union and complement of sets (first three
panels) and an example of two disjoint sets.
1.3. Probability
Definition 1.7. Let Ω 6= ∅ and F ⊆ P(Ω) a σ-algebra on Ω. A function P : F → [0, 1] is
called a probability measure (or simply a probability) if the following properties are fulfilled:
(P1) P[Ω] = 1 (normalization).
(P2) If (Aj )j∈N is a sequence of events Aj ∈ F that are pairwise disjoint, namely Aj ∩ Ak = ∅
for every j, k ∈ N with j 6= k, then
∞
∞
[
X
P
Aj =
P[Aj ]
(σ-additivity).
(1.16)
j=1
j=1
The triple (Ω, F, P) is called a probability space.
Example 1.8. A very natural class of examples is given by considering
F = P(Ω),
∅=
6 Ω finite,
(1.17)
and choosing the probability measure as follows:
P : P(Ω) → [0, 1],
P[A] =
|A|
,
|Ω|
(1.18)
where | · | denotes the cardinality (i.e. the number of elements) of a set. The probability measure
P is the (discrete) uniform distribution on Ω. The resulting probability space (Ω, P(Ω), P) is
sometimes called a Laplace probability space. It is characterized by the fact that
P[{ω}] =
1
,
|Ω|
for every ω ∈ Ω,
11
(1.19)
1. Outcomes, events and probability
meaning that every outcome has the same probability.
Concrete example: We roll a die twice and are interested in the probability that the number
6 shows up at least once. Assuming that the die is fair, we set consider the probability space
(Ω, F, P) given by
Ω = {1, 2, 3, 4, 5, 6}2 = {(1, 1), (1, 2), ..., (1, 6), (2, 1), ..., (2, 6), ..., (6, 6)},
F = P(Ω),
|A|
|A|
=
,
P[A] =
|Ω|
36
(1.20)
for all A ∈ P(Ω),
and the event in question is given by
B = “At least one 6 shows up”
= {(1, 6), (2, 6), (3, 6), (4, 6), (5, 6), (6, 6), (6, 5), (6, 4), (6, 3), (6, 2), (6, 1)}.
(1.21)
We clearly have that
11
.
36
Let us now give some elementary but important properties of probabilities.
P[“At least one 6 shows up”] = P[B] =
(1.22)
Proposition 1.9. Let (Ω, F, P) be a probability space and A, B, Aj ∈ F for j ∈ N. Then the
following properties hold:
(i) P[∅] = 0,
(ii) P[Ac ] = 1 − P[A],
(iii) If A ⊆ B, then P[A] ≤ P[B],
(iv) P[A ∪ B] = P[A] + P[B] − P[A ∩ B],
hS
i P
(v) P ∞
A
≤ ∞
j
j=1
j=1 P[Aj ].
Proof. We start with the proof of (i): Since ∅ = ∅ ∪ ∅ ∪ ∅ ∪ ... (and clearly ∅ ∩ ∅ = ∅), we see that
(P2)
P[∅] =
∞
X
P[∅],
(1.23)
j=1
which can only be true if P[∅] = 0.
For (ii) note that A and Ac are disjoint and fulfill A ∪ Ac = Ω. We set A1 = A, A2 = Ac and
Aj = ∅ for j ≥ 3, so that
(P1)
1 = P[Ω] = P[A ∪ Ac ∪ ∅ ∪ ∅ ∪ ...]
(P2)
= P[A] + P[Ac ] + P[∅] + P[∅] + ...
|
{z
}
=0 by (i)
= P[A] + P[Ac ].
12
(1.24)
1. Outcomes, events and probability
e = B \ A(= {ω ∈ B ; ω ∈
e =A∪B =B
For the proof of (iii), consider B
/ A}), so that A ∪ B
e
and A ∩ B = ∅. We find by the same argument as for (ii):
e ∪ ∅ ∪ ∅ ∪ ...] = P[A] + P[B]
e ≥ P[A].
P[B] = P[A ∪ B
|{z}
(1.25)
≥0
Note that this calculation shows the stronger statement
A⊆B
⇒
P[B \ A] = P[B] − P[A].
(1.26)
(?)
For (iv), we define D = B \ (A ∩ B) and note that A ∩ B ⊆ B, A ∪ B = A ∪ D and A ∩ D = ∅.
Argument for (?):
ω ∈A∪B
⇔
ω ∈ A or ω ∈ B
⇔
ω ∈ A or ω ∈ B \ (A ∩ B)
⇔
ω ∈ A ∪ D.
Thus, we see that
(P2)
P[A ∪ B] = P[A ∪ D ∪ ∅ ∪ ∅ ∪ ...] = P[A] +
P[D]
| {z }
= P[B]−P[A∩B]
(1.27)
n ≥ 2.
(1.28)
(1.26)
= P[A] + P[B] − P[A ∩ B].
Finally, we prove (v). We define the sets
B1 = A1 ,
Bn = An \
n−1
[
Aj ,
j=1
S
S∞
The sets Bj , j ∈ N, are pairwise disjoint and fulfill ∞
j=1 Aj =
j=1 Bj as well as Bj ⊆ Aj
for every j ∈ N. Therefore we have that
∞
∞
∞
∞
[
[
X
(P2) X
P
Aj = P
Bj =
P[Bj ] ≤
P[Aj ].
(1.29)
j=1
j=1
j=1
j=1
End of Lecture 2
1.4. Elementary Combinatorics
As seen with the example of rolling dice the previous section, in many elementary cases, the
assumption that all (finitely many) outcomes are equally likely is justified. In this situation,
the probability space is uniquely characterized by the number |Ω| ∈ N. We want to develop
effective methods to count the number of outcomes of Ω and events A ⊆ Ω.
13
1. Outcomes, events and probability
Remark 1.10. If N ∈ N random experiments with finite sample spaces Ω1 , Ω2 , ..., ΩN are
performed successively, an appropriate choice for the sample space of the combined experiment
is given by the Cartesian product
Ω=
N
Y
Ωj := Ω1 × Ω2 × ... × ΩN
j=1
(1.30)
= {(ω1 , ..., ωn ) ; ωj ∈ Ωj , 1 ≤ j ≤ N }.
The cardinality of Ω is given by
|Ω| =
N
Y
|Ωj | = |Ω1 | · |Ω2 | · ... · |ΩN |.
(1.31)
j=1
We already saw this in Example 1.2, (iii).
Let us motivate some less elementary results with the following example.
Example 1.11. How likely is it that two persons in this room / in this Zoom call have their
birthdays on the same date? To make this problem mathematically tractable, we make some
simplifying assumptions:
I We assume that every birthday is equally likely, and we igonore leap years.
I We also suppose that the birthdays are equally distributed and independent from each
other.
Probabilistic Model:
Ω = {1, 2, ..., 365}r ,
F = P(Ω),
|A|
P[A] =
,
|Ω|
(r is the number of persons)
(1.32)
A ⊆ Ω.
The event we are interested in, and its complement, are
A = {(ω1 , ..., ωr ) ∈ Ω ; there exist i 6= j with ωi = ωj },
Ac = {(ω1 , ..., ωr ) ∈ Ω ; ωi 6= ωj for all i 6= j}.
(1.33)
We estimate P[A] (using exp(x) ≈ 1 + x for |x| 1):
|Ac |
|Ω|
365 · 364 · ... · (365 − r + 1)
=1−
r
365
1
r−1
=1− 1· 1−
· ... 1 −
365
365
( r−1
)
X
k
r2
≈ 1 − exp
−
≈ 1 − exp −
,
365
730
k=1
|
{z
}
P[A] = 1 − P[Ac ] = 1 −
=exp(−r(r−1)/730)
14
(1.34)
1. Outcomes, events and probability
since exp(r/730) ≈ 1. For r = 30, r = 40, r = 50 we find P[A] ≈ 0.71, P[A] ≈ 0.89 and
P[A] ≈ 0.97, respectively.
In the calculation of |Ω| and |Ac |, we see instances of “selecting an ordered sample from a set
with and without repetitions” (the ordered set in question being {1, 2, 3, ..., 365} from which
r elements are drawn). In many situations we are also interested in selecting an unordered
sample from a set. All situations are recordered in the following Proposition.
Proposition 1.12. The number of choices of a sample of size r ∈ N out of {1, 2, ..., n} is given
as follows:
with repetitions
without repetitions
nr
ordered
n!
(n−r)!
n
r
n+r−1
r
unordered
For the case without repetitions, we additionally require r ≤ n. In the
tablen!above we used the
n
notations k! = k · (k − 1) · ... · 1 for k ∈ N (and 0! = 1), as well as k = k!(n−k)! for 0 ≤ k ≤ n.
Proof.
I Ordered samples, with repetitions: This is a special case of Remark 1.10. More
precisely, we use
Ω1 = {(ω1 , ..., ωr ) ; ωj ∈ {1, 2, ..., n}} = {1, 2, ..., n}r ,
(1.35)
with |Ω1 | = nr .
I Ordered samples, without repetitions: Here we use
Ω2 = {(ω1 , ..., ωr ) ; ωj ∈ {1, 2, ..., n}, ωi 6= ωj for i 6= j},
(1.36)
with |Ω2 | = n · (n − 1) · ... · (n − r + 1).
I Unordered samples, without repetitions: Here we use
Ω3 = {{ω1 , ..., ωr } ; ωj ∈ {1, 2, ..., n}, ωi 6= ωj for i 6= j}.
(1.37)
n!
Here r!|Ω3 | = |Ω2 | = (n−r)!
holds: This is because for r ∈ {1, ..., n} different elements
ω1 , ..., ωr , there are exactly r possibilities of reordering.
I Unordered samples, with repetitions: The sample space can be written as
Ω4 = {(ω1 , ..., ωr ) ; ωj ∈ {1, 2, ..., n}, 1 ≤ ω1 ≤ ω2 ≤ ... ≤ ωr ≤ n}.
(1.38)
We visualize an element of Ω4 as follows: We separate the n numbers 1, ..., n by n − 1
lines (|), and for each instance of one of these numbers within the sequence (ω1 , ..., ωr ),
we put a dot (•) in the respective bin.
15
1. Outcomes, events and probability
Example: Let n = 6, r = 5. The element (1, 1, 3, 4, 6) ∈ Ω4 corresponds to the string
• • || • | • ||•,
and the element (2, 2, 2, 5, 5) ∈ Ω4 corresponds to the string
| • • • ||| • •|.
The number of different strings corresponds therefore to the numbers of choices of a set
of r elements (the dots) out of a set with
n + r − 1 elements (the strings consisting of
n+r−1
dots and lines), which is exactly
by the previous step.
r
Example 1.13. A committee of 12 persons consists of 3 representatives of group A, 4 of group
B and 5 of group C. We want to choose a subcommittee of 5 persons uniformly at random.
What is the probability that this subcommittee consists of
I one member of group A,
I two members of group B,
I two members of group C?
We denote this event by E. Note that we do not specify the order within the groups and
obviously, there are no repetitions in the choice of the members. Thus we have
I 31 choices for the member from group A,
I 42 choices for the member from group B,
I 52 choices for the member from group C,
and of course 12
5 for the entire subcommittee. Therefore the probability we look for is
3
1
P[E] =
·
4
5
2 · 2
12
5
16
=
5
≈ 0.23.
22
(1.39)
2. Conditional probability and stochastic independence
2. Conditional probability and stochastic
independence
(Reference: [1, Chapter 3], or [2, Sections 3.1, 3.3])
In this chapter we introduce the notion of conditional probability. Intuitively, the idea is that the
existence of “partial knowledge” should influence how we determine the likelihood of a given
outcome.
2.1. Conditional probability
Let us start with a very easy example.
Example 2.1. We throw two dice and ask for the probability that the sum of the numbers of
both dice is smaller or equal to 7. We call this event A. Assuming that the dice are fair, this
experiment is modelled by
(Ω, F, P) = ({1, 2, 3, 4, 5, 6}2 , P(Ω), P),
P[ · ] =
|·|
.
36
(2.1)
Of course A is given by
A ={(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5),
(3, 1), (3, 2), (3, 3), (3, 4), (4, 1), (4, 2), (4, 3), (5, 1), (5, 2), (6, 1)}.
(2.2)
7
So P[A] = 21
36 = 12 . Now imagine we are given the information that one of the dice shows the
number 6. We call this event B, i.e.
B = {(1, 6), (2, 6), (3, 6), (4, 6), (5, 6), (6, 6), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5)}.
(2.3)
If we already know that B happens, how likely is A? Clearly, the only outcomes of A that can
still have occured are
A ∩ B = {(1, 6), (6, 1)}.
(2.4)
Thus, knowing that B occured, we should now estimate the probability that A occurs by
restricting the sample space Ω to B, so:
|A∩B|
|Ω|
|B|
|Ω|
=
|A ∩ B|
2
= .
|B|
11
We elevate the term on the left-hand side to a definition.
17
(2.5)
2. Conditional probability and stochastic independence
Definition 2.2. Let (Ω, F, P) be a probability space. Assume that the event B ∈ F has a
positive probability P[B] > 0. We define the conditional probability of A ∈ F given B by
P[A|B] =
P[A ∩ B]
.
P[B]
(2.6)
Remark 2.3.
(i) If the events A and B are mutually exclusive (A ∩ B = ∅), then we always
have P[A|B] = 0, whenever the latter is defined.
(ii) One can rewrite the equation (2.6) as
P[A ∩ B] = P[B] · P[A|B].
(2.7)
This is sometimes called the multiplication theorem.
Proposition 2.4. Let (Ω, F, P) be a probability space. Assume that the event B ∈ F has a
positive probability P[B] > 0. Then P[ · |B] defines a probability distribution on (Ω, F) as well.
Proof. The proof is given as an exercise.
End of Lecture 3
2.2. The law of total probability and Bayes’ theorem
In this section, we fix a probability space (Ω, F, P).
The following result is known as the law of total probability.
Theorem 2.5. Let B1 , ..., Bn ∈ F with P[Bj ] > 0 for all 1 ≤ j ≤ n and
Bj ∩ Bk = ∅ for every j 6= k. Then we have for all A ∈ F that
P[A] =
n
X
Sn
j=1 Bj
P[Bj ]P[A|Bj ].
= Ω with
(2.8)
j=1
Proof. Note that
P[A] = P [A ∩ Ω] = P A ∩
n
[
Bj = P
j=1
Prop. 1.9, (iv)
=
n
X
P[A ∩ Bj ] =
j=1
n
[
(A ∩ Bj )
j=1
n
X
(2.9)
P[Bj ]P[A|Bj ].
j=1
We can right away combine the previous theorem with the definition of the conditional
probability to obtain the following result, called Bayes’ theorem:
18
2. Conditional probability and stochastic independence
Theorem 2.6. Under the same assumptions as Theorem 2.5, we have for every 1 ≤ k ≤ n, that
P[Bk ]P[A|Bk ]
P[Bk |A] = Pn
.
j=1 P[Bj ]P[A|Bj ]
(2.10)
Proof.
(2.6)
P[Bk |A] =
P[Bk ∩ A]
P[A]
(2.7),(2.8)
=
P[B ]P[A|Bk ]
Pn k
.
j=1 P[Bj ]P[A|Bj ]
(2.11)
Example 2.7. Biochemical tests for a certain marker / antigen / disease / ... within a population
are never absolutely reliable. We consider a test with the following properties:
Let T denote the event “the test is positive”. Let M be the event “a given individual has the
marker”. We assume that the test in question satisfies:
P[T |M ] = 0.99
c
c
P[T |M ] = 0.99
(sensitivity),
(specificity),
(2.12)
and the marker we are looking for is such that for the population which is considered,
P[M ] = 0.01
(prevalence).
(2.13)
What is the probability that someone who tests positive actually has the maker / antigen /
disease?
Observe that P[T |M c ] = 1 − 0.99 = 0.01. We use Bayes’ theorem 2.6
P[T |M ] · P[M ]
P[T |M ] · P[M ] + P[T |M c ] · P[M c ]
0.99 · 0.01
=
0.99 · 0.01 + 0.01 · 0.99
1
= .
2
P[M |T ] =
(2.14)
We see that because the trait under consideration is so rare, we find that half of those who test
positive do not actually admit this trait, even though the test is fairly reliable!
Remark 2.8. Both the law of total probability and Bayes’ theorem are validSif we have countably
many pairwise disjoint B1 , B2 , ... ∈ F with P[Bj ] > 0 for all j ∈ N and ∞
j=1 Bj = Ω. In this
case, (2.8) and (2.10) become
P[A] =
∞
X
P[Bj ]P[A|Bj ], and
(2.15)
j=1
P[Bk ]P[A|Bk ]
P[Bk |A] = P∞
,
j=1 P[Bj ]P[A|Bj ]
respectively.
19
(2.16)
2. Conditional probability and stochastic independence
2.3. Stochastic independence
We now introduce the notion of stochastic independence of events, which is one of the central
concepts in probability theory and statistics. Again, we fix a probability space (Ω, F, P) in this
section.
Heuristics: The events A, B ∈ F should be independent, if the occurence of A has no influence
on the occurence of B, and vice versa. Specifically, if A happens, it should neither be more, nor
less likely that B occurs, and vice versa, so
P[A ∩ B]
,
P[B]
P[A ∩ B]
P[B] = P[B|A] =
,
P[A]
P[A] = P[A|B] =
(2.17)
where we implicitly assumed that P[A], P[B] > 0. We turn this reasoning into a definition.
Definition 2.9.
(i) The events A, B ∈ F are called (stochastically) independent, if
P[A ∩ B] = P[A] · P[B].
(2.18)
(ii) Let n ∈ N, n ≥ 2. The events A1 , A2 , ..., An ∈ F are called jointly (stochastically)
independent, if for every {i1 , ..., im } ⊆ {1, ..., n} with i1 , ..., im pairwise distinct,
P[Ai1 ∩ ... ∩ Aim ] = P[Ai1 ] · ... · P[Aim ].
(2.19)
Remark 2.10.
(i) Note that both definitions include the case that a given event has probability
zero. If we assume that P[A] > 0 and P[B] > 0, then (2.17) and (2.18) are equivalent.
(ii) The events ∅ and Ω are independent from any other given event. Intuitively, they contain
“no additional information” on the probability.
(iii) Equation (2.19) means that the occurence of any subset of the events A1 , ..., An does not
give additional information on the occurence of the others. For instance,
Qn
P[A1 ∩ A2 ∩ ... ∩ An ]
j=1 P[Aj ]
= P[A1 ],
(2.20)
P[A1 |A2 ∩ ... ∩ An ] =
= Qn
P[A2 ∩ ... ∩ An ]
j=2 P[Aj ]
provided that P[A2 ∩ ... ∩ An ] > 0.
(iv) We stress that stochastic independence of two events A and B has nothing to do with
them being disjoint as sets! In fact, If A and B are disjoint, then
0 = P[∅] = P[A ∩ B] = P[A] · P[B],
so unless P[A] = 0 or P[B] = 0, disjoint events A and B are not independent.
We illustrate the concept of independence with a number of examples.
20
(2.21)
2. Conditional probability and stochastic independence
Example 2.11.
(i) We draw a card randomly from a standard card deck1 , with
Ω = (i, j) ; i ∈ {♣, ♠, ♦, ♥}, j ∈ {1, 2, ..., 13} ,
equipped with the discrete uniform distribution. Consider the events
A = (♥, j) ; j ∈ {1, ..., 13} = drawing a ♥-card,
B = (i, 1) ; i ∈ {♣, ♠, ♦, ♥} = drawing an ace.
(2.22)
(2.23)
Clearly, we have that
A ∩ B = (♥, 1) .
(2.24)
|A|
13
1
|B|
4
1
=
= ,
P[B] =
=
=
|Ω|
52
4
|Ω|
52
13
|A ∩ B|
1
P[A ∩ B] =
=
= P[A] · P[B].
|Ω|
52
(2.25)
With this, we see that
P[A] =
So A and B are independent.
(ii) Consider tossing a fair coin twice:
Ω = {(H, H), (H, T ), (T, H), (T, T )},
A = “Heads” comes up in the first round
1
P[A] = ,
2
= {(H, H), (H, T )},
B = “Heads” comes up in the second round
1
P[B] = ,
2
1
P[A ∩ B] = ,
4
= {(H, H), (T, H)},
C = “Heads” comes up exactly once
1
P[C] = ,
2
= {(H, T ), (T, H)},
P[A ∩ C] =
1
= P[B ∩ C],
4
However: P[A ∩ B ∩ C] = P[∅] = 0 6= P[A] · P[B] · P[C].
This shows that the events A, B and C are pairwise independent (this means every two
events out of {A, B, C} are independent), but not jointly independent.
We finish this section with the following result:
Theorem 2.12. Let A1 , A2 , ..., An ∈ F be jointly independent. Then also B1 , B2 , ..., Bn with
Bi ∈ {Ai , Aci }, for 1 ≤ i ≤ n, are jointly independent.
1
With 52 French-suited playing cards.
21
2. Conditional probability and stochastic independence
Proof. We only show the case n = 2 (the general case follows by induction).
(A1 ∩ A2 ) ∪ (A1 ∩ Ac2 ) = A1
⇒
(disjoint union)
P[A1 ∩ A2 ] +P[A1 ∩
| {z }
Ac2 ]
= P[A1 ]
(2.26)
=P[A1 ]·P[A2 ]
⇒
P[A1 ∩ Ac2 ] = P[A1 ] · (1 − P[A2 ]) = P[A1 ] · P[Ac2 ].
By changing the roles of A1 and A2 , we also have
P[Ac1 ∩ A2 ] = P[Ac1 ] · P[A2 ].
(2.27)
We can finally use the same argument as in (2.26) (which implied the independence of A1 and
Ac2 from the independence of A1 and A2 ) to infer the independence of Ac1 and Ac2 from the
independence of Ac1 and A2 .
End of Lecture 4
22
3. Discrete distributions
3. Discrete distributions
(Reference: [1, Chapter 4], or [2, Sections 2.1-2.5.1])
In the present chapter the most important discrete distributions are defined.
Definition 3.1. Let (Ω, P(Ω), P) be a probability space with a countable sample space Ω.1
The probability measure P is then called a discrete distribution. The function
p : Ω → [0, 1],
p(ω) = P[{ω}]
(3.1)
is the probability mass function of the distribution.
Obviously, if we are given a discrete distribution on (Ω, P(Ω)), the probability mass function
P is uniquely determined. Conversely, every set (p(ω))ω∈Ω of non-negative numbers with
ω∈Ω p(ω) = 1 determines a unique probability measure on (Ω, P(Ω)), which is the statement
of the following proposition.
Proposition 3.2. Let Ω be countable and p : Ω → [0, 1] a map fulfilling
X
p(ω) = 1.
(3.2)
ω∈Ω
Then the map
P : P(Ω) → [0, 1],
A 7→ P[A] =
X
p(ω)
(3.3)
ω∈A
defines a probability measure on (Ω, P(Ω)).
Proof. We first remark that since Ω is countable and the real numbers (p(ω))ω∈Ω are nonnegative, we can take any enumeration Ω = {ω1 , ω2 , ...} and define
(P
N
X
Ω = {ω1 , ..., ωN } finite,
i=1 p(ωi ),
p(ω) =
(3.4)
PN
limN →∞ i=1 p(ωi ), Ω = {ω1 , ω2 , ...} infinite,
ω∈Ω
and the value of the series does not depend on the choice of the enumeration. Moreover, since
A ⊆ Ω is also countable, the expression for P[A] in (3.3) is well-defined and in [0, 1].
The condition (P1) is immediate by (3.2). For (P2), we consider Aj ∈ P(Ω) for j ∈ N pairwise
disjoint and use
∞
∞ X
[
X
X
P
Aj =
p(ω) =
p(ω)
j=1
ω∈
=
S∞
∞
X
j=1
Aj
P[Aj ].
j=1
1
We recall again that finite sets are countable by our convention.
23
j=1 ω∈Aj
(3.5)
3. Discrete distributions
In the second equality, we used again the fact that the (p(ω))ω∈Ω are non-negative.
We will now present some of the most important discrete distributions.
The discrete uniform distribution U(Ω)
Ω finite ,
p(ω) =
1
.
|Ω|
(3.6)
This is just giving a name for the distribution considered already multiple times, see Example 1.8.
The Bernoulli distribution Ber(p)
Ω = {0, 1},
p(1) = p ∈ [0, 1],
p(0) = 1 − p.
(3.7)
The Bernoulli distribution models random experiments in which a “success” occurs with
probability p, and a “failure” occurs with probability 1 − p (for instance, tossing a biased coin).
Such experiments are also called Bernoulli experiments.
The Binomial distribution Bin(n, p)
Ω = {0, 1, ..., n},
Note that
n
X
k=0
n k
p(k) =
p (1 − p)n−k ,
k
p ∈ [0, 1], n ∈ N.
n
X
n k
p(k) =
p (1 − p)n−k = (p + 1 − p)n = 1.
k
(3.8)
(3.9)
k=0
The binomial distribution is extending the Bernoulli distribution in the following way: It
models how many attempts out of n independent experiments with the same success parameter
p ∈ [0, 1] are successful. To explain it, consider the auxiliary probability space
({0, 1}n , P({0, 1}n ), Q),
Pn
Q[{(ω1 , ..., ωn )}] = p j=1 ωj (1 − p)n−
| {z } |
{z
p# of successes
Pn
j=1
ωj
(1−p)# of failures
.
(3.10)
}
Here, the string (ω1 , ..., ωn ) ∈ {0, 1}n stands for the successes and failures of the experiment
in the order observed, i.e.
(
1, if the jth experiment is a success,
ωj =
(3.11)
0, if the jth experiment is a failure.
For instance, the string (1, 0, 0, 1) means that the first and last of four experiments are successes,
whereas the second and third experiments are failures. By the product structure in (3.10), the
experiments are independent. Now consider the event (for 0 ≤ k ≤ n)
n
X
Ek = (ω1 , ..., ωn ) ∈ {0, 1}n ;
ωj = k = “exactly k sucesses”.
j=1
24
(3.12)
3. Discrete distributions
We have
Q[Ek ] = |Ek |pk (1 − p)n−k =
n k
p (1 − p)n−k .
k
(3.13)
Example 3.3. We throw a die 4 times and we are interested in the number of times that the number
six shows up. This is modelled by the binomial distribution Bin(4, 16 ). In the description (3.10)
“0” stands for the occurence of a number other than six (failure), whereas “1” stands for the
occurence of a six (success). In this example, we have
Outcomes in {0, 1}4
Probability
4
p(0) = 56
3
p(1) = 41 · 61 · 56 ,
2
2
p(2) = 42 · 61 · 56
3
p(3) = 43 · 61 · 56
4
p(4) = 16
(0, 0, 0, 0)
(1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1)
(1, 1, 0, 0), (1, 0, 1, 0), (1, 0, 0, 1), (0, 1, 1, 0), (0, 1, 0, 1), (0, 0, 1, 1)
(0, 1, 1, 1), (1, 0, 1, 1), (1, 1, 0, 1), (0, 0, 0, 1)
(1, 1, 1, 1)
We can also order the outcomes in the form of a “tree diagram” as follows:
1−p 0
1−p
1−p
0
0
p
1−p
p
p
1
1−p
0
p
1
1−p
0
p
1
0
(0, 0, 1, 0)
(0, 0, 1, 1)
(0, 1, 0, 1)
(0, 1, 1, 0)
(0, 1, 1, 1)
(1, 0, 0, 0)
(1, 0, 0, 1)
(1, 0, 1, 0)
1−p 0
p
(0, 0, 0, 1)
(0, 1, 0, 0)
1
1
(0, 0, 0, 0)
1
(1, 0, 1, 1)
(1, 1, 0, 0)
(1, 1, 0, 1)
(1, 1, 1, 0)
p
1
(1, 1, 1, 1)
Figure 3.1.: Tree diagram of 4 successive independent Bernoulli experiments.
Remark 3.4. The above example shows that the same question “How likely is it that the number
six comes up exactly twice when rolling a die four times?” is treated much more efficiently on
the probability space
(Ω = {0, 1, 2, 3, 4}, P(Ω), P),
P = Bin(4, 61 ),
where we simply have
2 2
4
1
5
,
P[“2 sixes”] = P[{2}] =
·
2
6
6
25
3. Discrete distributions
than on the probability space
Pn
e = {0, 1}n , P(Ω),
e Q),
(Ω
Q[{(ω1 , ..., ωn )}] = p
j=0
ωj
(1 − p)n−
Pn
j=0
ωj
,
where
Q[“2 sixes”] = Q[{(1, 1, 0, 0), (1, 0, 1, 0), (1, 0, 0, 1), (0, 1, 1, 0), (0, 1, 0, 1), (0, 0, 1, 1)}]
2 2
5
4
1
.
=
·
6
6
2
The information we need is already contained in the space (Ω = {0, 1, 2, 3, 4}, P(Ω), P). This
concept of a reduction in complexity will motivate the study of random variables later.
The Geometric distribution Geo(p)
Ω = N,
Note that
∞
X
p(k) =
k=1
p(k) = (1 − p)k−1 p,
∞
X
(1 − p)k−1 p =
k=1
p ∈ (0, 1).
(3.14)
p
= 1.
1 − (1 − p)
(3.15)
The interpretation of the geometric distribution is the number of repetitions of a Bernoulli
experiment (with success parameter p ∈ (0, 1)) until the first success.
The Hypergeometric distribution H(N, M, n)
Ω = {0, ..., n},
p(k) =
M
k
·
N −M
n−k
N
n
N, m, n ∈ N, 0 ≤ n, M ≤ N.
,
(3.16)
The hypergeometric distribution should be understood as follows: Out of a set of N elements,
M subelements have a certain favorable property. We choose uniformly at random an unordered
sample of 0 ≤ n ≤ N elements out of the large set without repetitions. Then p(k) denotes the
probability that exactly 0 ≤ k ≤ n have the the favorable property (this probability is always
zero if M < k, which can happen if n > M ).
Example 3.5. In an urn there are 10 balls, three are green and seven are red. We draw (at once)
four balls from the urn. Here
N = 10 (# of balls in urn),
M = 3 (# of green balls),
The probability that exactly two of the balls drawn are green is
7
3
2 · 2
.
p(2) =
10
4
26
n = 4 (# of balls drawn).
(3.17)
3. Discrete distributions
The Poisson distribution P ois(λ)
Ω = N0 ,
Note that
∞
X
k=0
p(k) =
p(k) =
∞
X
λk
k=0
k!
λk −λ
e ,
k!
λ > 0.
e−λ = e−λ · eλ = 1.
(3.18)
(3.19)
The Poisson distribution is a natural distribution for modelling events that in principle can
occur infinitely often (for instance the number of goals in a football game, or the number of
raindrops falling in a given area during a given time, ...).
An important application is the following approximation of the Binomial distribution by the
Poisson distribution:
Proposition 3.6. Let λ > 0 be fixed and pn =
λ
n
for n ∈ N. Then, for every k ∈ N0 , it holds that
n k
λk −λ
lim
pn (1 − pn )n−k =
e .
n→∞ k
k!
(3.20)
Proof.
λk
λ n
λ −k
n k
n!
1
−
·
1
−
pn (1 − pn )n−k =
k!(n − k)! nk
n
n
k
n−k+1
λ n
λ −k
λk n n − 1
· ·
· ... ·
1−
· 1−
=
k! n
n
n
n
n
λk
→
· 1 · e−λ · 1,
as n → ∞.
k!
End of Lecture 5
27
(3.21)
4. Introduction to statistical tests and Neyman-Pearson lemma
4. Introduction to statistical tests and
Neyman-Pearson lemma
(Reference: [1, Chapters 25-26], or [2, Sections 10.1-10.2])
In this section, we introduce the notion of statistical tests and prove the Neyman-Pearson
lemm for discrete distributions. We motivate this using an example.
4.1. Basic notions of statistical tests
Example 4.1. We consider a certain drug A that has a known efficacy of 60 %. We want to
evaluate the
Claim: A new (more expensive) drug B has an efficacy of 70 %. To see whether the claim is
valid, the drug B is tested with 100 persons.
We choose the model:
Ω = {0, 1, ..., n},
n = 100,
(4.1)
and the outcome ω ∈ Ω is the number of persons out of the 100 with a positive reaction to the
administered drug B.
Question: Do we have
P = Bin(n, p),
p = 0.6
(null hypothesis H0 ),
or Q = Bin(n, q),
q = 0.7
(alternative hypothesis H1 ).
(4.2)
The challenge is to determine a statistical test that decides between H0 and H1 , based on the
test result:
φ : Ω → {0, 1},
(4.3)
which should be “optimal” in a way. Here, φ(ω) = 0 models keeping the null hypothesis with
the observed outcome ω, and φ(ω) = 1 models rejecting the null hypothesis with the observed
outcome ω.
Definition 4.2. Let Ω be countable and φ : Ω → {0, 1} a statistical test. A type I error occurs,
if we falsely reject the null hypothesis H0 , i.e. φ(ω) = 1 when H0 is correct. A type II error
occurs, if we falsely do not reject the null hypothesis H0 , i.e. φ(ω) = 0 when H1 is correct.
The null hypothesis H0 is chosen such that falsely rejecting it (type I errors) should be unlikely.
(4.4)
This means: When we set up statistical tests, a false rejection of H0 , i.e. a type I error is
considered the more problematic error than a false non-rejection, i.e. a type II error. We illustrate
this in two examples:
28
4. Introduction to statistical tests and Neyman-Pearson lemma
I Going mushroom hunting: Suppose we are unsure whether certain collected wild mushrooms are edible or poisonous. Then
H0 :
mushrooms are poisonous,
H1 :
mushrooms are edible.
We have chosen H0 in such a way, that falsely rejecting it is the more severe error!
I Let us consider the previous Example 4.1. We want to avoid a situation, in which a new
and more expensive drug is authorized, while being not more effective that the standard
one used. According to this, we should define
H0 :
drug B has the same efficacy 0.6 as drug A,
H1 :
drug B has an higher efficacy of 0.7 than drug A.
This is of course nothing else than (4.2).
How likely are the errors of type I and II? We consider the case of simple hypotheses
where H0 corresponds to the distribution P and H1 to distribution Q. This is in line with our
Example 4.1. The probability of a type I error is given by
P[φ = 1] = P[{ω ∈ Ω ; φ(ω) = 1}].
(4.5)
The probability of a type II error on the other hand is
Q[φ = 0] = Q[{ω ∈ Ω ; φ(ω) = 0}].
(4.6)
If we consider the extreme case
for all ω ∈ Ω,
φextreme (ω) = 0
(4.7)
where we never reject H0 , we see that the type I error has probability
P[φextreme = 1] = P[∅] = 0.
(4.8)
On the other hand, the type II error of course has probability
Q[φextreme = 0] = Q[Ω] = 1.
(4.9)
For a meaningful test, we need to do the following:
I Specify an upper bound α ∈ (0, 1) for the type I error, that is, require
P[φ = 1] ≤ α.
(4.10)
The value α is called the significance level for the test. Typical levels in practice are
α = 0.05 or α = 0.01.
29
4. Introduction to statistical tests and Neyman-Pearson lemma
I Minimize the value of Q[φ = 0] under the constraint P[φ = 1] ≤ α. If a solution
φ∗ : Ω → {0, 1} exists, this is called the best / most powerful test at level α.
We return to Example 4.1: Heuristically, given α ∈ (0, 1), we should try to look for a test
φ : Ω = {0, ..., 100} → {0, 1} of the form
(
1, ω ∈ Cα = {kα , kα + 1, ..., 100} ⇔ ω ≥ kα ,
φ(ω) =
(4.11)
0, ω ∈ Cαc = {0, 1, ..., kα − 1} ⇔ ω < kα ,
for some kα ∈ {0, ..., 100} to be determined. The range
Cα = {kα , kα + 1, ..., 100}
(4.12)
is called the critical region for the test φ.
End of Lecture 6
4.2. The Neyman-Pearson lemma
In this section, we construct optimal tests for two simple hypotheses against each other. This is
done in the following result, known as the Neyman-Pearson lemma.
Lemma 4.3. Let Ω be countable and P and Q two discrete probability distributions on (Ω, P(Ω)),
with probability mass functions p and q, respectively. We let
L(ω) =
q(ω)
∈ [0, ∞],
p(ω)
ω ∈ Ω,
(4.13)
(where x0 := ∞ for x ≥ 0) be the Likelihood-quotient of Q with respect to P. Assume that there
exists a test φ∗ : Ω → {0, 1} with
(
1, L(ω) ≥ c∗α ,
∗
φ (ω) =
(4.14)
0, L(ω) < c∗α ,
and
P[φ∗ = 1] = P[L(ω) ≥ c∗α ] = α.
(4.15)
Then every other test φ : Ω → {0, 1} with P[φ = 1] ≤ α fulfills
Q[φ = 0] ≥ Q[φ∗ = 0]
(4.16)
Proof. Consider the two critical regions
Cα∗ = {ω ∈ Ω ; φ∗ (ω) = 1},
Cα = {ω ∈ Ω ; φ(ω) = 1}.
30
(4.17)
4. Introduction to statistical tests and Neyman-Pearson lemma
Now note that
ω ∈ Cα∗ ⇔ q(ω) − c∗α p(ω) ≥ 0
X
X
⇒
[q(ω) − c∗α p(ω)] ≥
[q(ω) − c∗α p(ω)]
∗
ω∈Cα
⇒
ω∈Cα
∗
1] −c∗α P[φ∗
Q[φ =
| {z }
∗]
=Q[Cα
⇒
= 1] ≥ Q[φ = 1] −c∗α P[φ = 1]
| {z }
=Q[Cα ]
∗
Q[φ = 1] − Q[φ = 1] ≥
c∗α
∗
P[φ = 1] − P[φ = 1]
| {z } | {z }
=α
⇒
≥0
≤α
∗
Q[φ = 0] = 1 − Q[φ = 1] ≥ 1 − Q[φ = 1] = Q[φ∗ = 0].
Example 4.4. We continue Example 4.1. Here we have for k ∈ {0, 1, 2, ..., n}:
n k
H0 : p(k) =
p (1 − p)n−k ,
p = 0.6,
k
n k
H1 : q(k) =
q (1 − q)n−k ,
q = 0.7
k
k
k
q
1 − q n−k
q
p
1−q n
q(k)
=
=
/
≥ c∗α .
⇒ L(k) =
p(k)
p
1−p
1−q 1−p
1−p
Since q > p, we have
q
1−q
/
p
1−p
(4.18)
> 1, and so
L(k) ≥ c∗α
⇔
k ≥ kα∗ .
(4.19)
Then choose kα∗ such that
P[φ∗ = 1] = P[L(k) ≥ c∗α ] = P[k ≥ kα∗ ] = α.
We consider n = 100, p = 0.6 and α . 0.01.1 We require
n
X
n k
!
∗
∗
P[k ≥ kα ] = P[{kα , ..., n}] =
p (1 − p)n−k = α.
k
∗
(4.20)
(4.21)
k=kα
We compute that for kα∗ = 72, we have α = 0.0084. The optimal test at this level is given by
(
1, k ≥ 72 (H0 is rejected)
φ∗ : {0, 1, ..., 100} → {0, 1},
φ∗ (k) =
(4.22)
0, k < 72 (H0 is not rejected).
For the type I error, we have the probability
P[φ∗ = 1] = P[k ≥ kα∗ ] = 0.0084.
1
Not every α ∈ (0, 1) can be chosen, see also Remark 4.5, (i) below.
31
(4.23)
4. Introduction to statistical tests and Neyman-Pearson lemma
The type II error is given by
∗
Q[φ = 0] = Q[k <
kα∗ ]
∗ −1
kα
=
X
k=0
100 k
q (1 − q)100−k ≈ 0.62.
k
(4.24)
Note that the type II error is very large, but cannot be decreased, since the test is optimal by
the Neyman-Pearson Lemma 4.3. To decrease the error, we need to increase n.
Remark 4.5.
(i) Let us stress again that in the discrete set-up, not all significance levels
α ∈ (0, 1) can be chosen. The reason is that the probability α = P[φ = 1] can only attain
countably many values. Again in the above example where P = Bin(100, 0.6), we have
..
.
P[{70, 71, ..., 100}] = 0.0248,
P[{71, 72, ..., 100}] = 0.0148,
P[{72, 73, ..., 100}] = 0.0084,
(4.25)
P[{73, 74, ..., 100}] = 0.0046,
..
.
The values on the right-hand side in (4.25) are some of the possible choices for α. In
practice, if we are looking for a test at significance level α = 0.01, we look for the largest
α ≤ 0.01, for which a test exist. From (4.25), we see that in the case of Examples 4.1
and 4.4, this means we take α = 0.0084 and kα∗ = 72.
(ii) We used q = 0.7 to establish the equivalence (4.19), but in fact the only information we
used was that q > p. In other words, we could actually strengthen the test by not testing
H0 : P = Bin(100, 0.6) against H1 : Q = Bin(100, 0.7), but in fact test
P = Bin(n, p),
p = 0.6
(null hypothesis H0 ),
or Q = Bin(n, q),
q > 0.6
e 1 ).
(alternative hypothesis H
(4.26)
q(k)
The decisive factor here is that the likelihood ratio L(k) = p(k)
is monotone in k, which
∗
is why (4.19) is valid. We say that the test φ is a uniformly most powerful test for
H0 : P = Bin(n, p) with p = 0.6 against H1 : Q = Bin(n, q), with q > 0.6.
For completeness, we introduce two more notions concerning statistical tests and relate them
to the previous discussion.
Definition 4.6. Let P, Q as in Lemma 4.3. We consider the respective Neyman-Pearson test φ∗ .
(i) Assume Ω ⊆ N0 and that the likelihood quotient L(ω) of Q with respect to P is increasing
in ω. Let ω
e ∈ Ω be the observed value of the test. The p-value of the test φ∗ given ω
e is
given by
p-value = P[{ω ∈ Ω ; ω ≥ ω
e }].
(4.27)
32
4. Introduction to statistical tests and Neyman-Pearson lemma
(ii) If Ω ⊆ N0 and the likelihood quotient L(ω) of Q with respect to P is decreasing in ω and
ω
e ∈ Ω is the observed value of the test, then the p-value of the test φ∗ given ω
e is
p-value = P[{ω ∈ Ω ; ω ≤ ω
e }].
(4.28)
(iii) The power of the test φ∗ is given by
1 − β = Q[φ∗ = 1].
(4.29)
Let us shortly discuss these notions: The p-value of test with a given observation ω
e ∈ Ω is
the smallest level of significance under which the hypothesis H0 can be rejected. Note that it
depends on the test and the realized observation ω
e . In other words:
ω ∈ Cα∗
⇔
p-value for ω < α.
(4.30)
The power of the test φ∗ is simply the probability that H0 is rejected, when H1 is true. In (4.29),
β stands for the probability of a type II error (H0 is not rejected, when H1 is true).
To summarize: Here is a step-by-step procedure to construct and evaluate a Neyman-Pearson
test of the simple hypothesis H0 = {P} against H1 = {Q}:
1. A level of significance α ∈ (0, 1), depending on the statistical problem in question is
given, for instance α = 0.01. This level is an upper bound for the type I error P[φ = 1].
2. The optimal test (for which Q[φ = 0] is minimal) is given by the Neyman-Pearson Lemma
(
q(ω)
1, L(ω) = p(ω)
≥ c∗α ,
φ∗ (ω) =
(4.31)
0, L(ω) < c∗α .
3. The value c∗α is determined by the condition P[φ∗ = 1] = P[L(ω) ≥ c∗α ] = α
e, with α
e≤α
the largest possible value out of the set {P[L(ω) ≥ c] ; c ∈ [0, ∞]} ∩ [0, α].
4. If Ω ⊆ N0 and L(k) ≡ L(ω) is monotonically increasing in k, we can simplify the test to
(
1, k ≥ kα∗ ,
φ∗ (k) =
(4.32)
0, k < kα∗ .
The value kα∗ is then determined by P[{k ∈ Ω ; k ≥ kα∗ }] = α
e, where α
e ≤ α is the largest
possible value out of the set {P[{k, k + 1, ...}] ; k ∈ Ω} ∩ [0, α]. Similar arguments can
be made if L(k) is monotonically decreasing in k.
5. Depending on the nature of the likelihood quotient, we can check whether the test φ∗
is in fact a uniformly most powerful test of H0 : {P} against a larger class of measures
e 1 : {Q }θ
H
θ
33
5. Continuous distributions
5. Continuous distributions
(Reference: [1, Chapter 5], or [2, Sections 1.2, 2.5.2, 2.6])
In this section, we introduce continuous distributions on Ω ⊆ R. Specifically, we want to be
able to talk about probability spaces like (R, F, P) or ([0, 1], F, P) with appropriate choices
of F and P. This requires some more details about σ-algebras, which we will present without
proofs.
Example 5.1. Consider the arrival of a train with delay. We assume that its arrival is “uniformly
distributed” between 1 PM and 2 PM. How can we model this? Suppose 0=
b 1 PM and 1=
b 2 PM.
We split [0, 1) in half-open intervals of equal length n1 , n ∈ {2, 3, 4, ...} whose leftmost points
are
j
1 2
1
(5.1)
n ; 0 ≤ j ≤ n − 1 = 0, n , n , ..., 1 − n ⊆ [0, 1).
The probability for the train to arrive within one of these intervals ∆j = [ nj , j+1
n ), 0 ≤ j ≤ n−1
1
should be n . For 0 ≤ a < b < 1, we should have approximately
Z b
X 1
1 X
P [a, b) ≈
=
1 −→
1 dx,
|{z}
n
n j
a
j
a≤ n 0. Then the cumulative distribution function of Y is given by
(
Z x
0,
x < 0,
FY (x) =
λe−λt 1[0,∞) (t)dt =
(6.16)
−λx
1−e
, x ≥ 0.
−∞
43
6. Random variables
FY (x)
1
x
Figure 6.2.: Cumulative distribution function of Y ∼ E(λ)
(iii) In the previous examples (i) and (ii), the random variable X is discrete and Y is continuous.
Let us stress that this is not a dichotomy. For instance, let Y ∼ N(0, 1). Now let
Here we have
Z = Y · 1{Y ≥0} .
(6.17)
(
0,
x < 0,
Rx
FZ (x) =
Φ(x) = −∞ ϕ(t)dt, x ≥ 0.
(6.18)
Note that Z is neither continuous (P[Z = 0] = 12 ), nor discrete (it can attain all values in
[0, ∞).
FZ (x)
1
1
2
x
Figure 6.3.: Cumulative distribution function of Z as defined in (6.17).
End of Lecture 9
We will now collect some general facts about cumulative distribution functions.
Lemma 6.7. Let X : (Ω, F) → (R, B(R)) be a real random variable. Its cumulative distribution
function F = FX satisfies the following properties:
(i) F (x) ∈ [0, 1] for all x ∈ R.
(ii) F is non-decreasing.
44
6. Random variables
(iii) F is right continuous, i.e.
lim F (x + ε) = F (x).
ε↓0
(6.19)
(iv) limx→−∞ F (x) = 0 and limx→∞ F (x) = 1.
Proof. Claim (i) follows since F (x) = P[X ∈ (−∞, x]] ∈ [0, 1].
For claim (ii), we use the fact that for x ≤ x0 , we have (−∞, x] ⊆ (−∞, x0 ] and so
F (x) = PX (−∞, x] ≤ PX (−∞, x0 ] = F (x0 ).
(6.20)
Claim (iii) from the fact that for any probability measure Q and a sequence (An )n∈N ⊆ F
with A1 ⊇ A2 ⊇ A3 ⊇ ..., we have
"∞
#
\
Q
An = lim Q[An ],
(6.21)
n→∞
n=1
which is an exercise. We apply this to the probability measure Q = PX and the sets An =
(−∞, xn ], where xn ↓ x for some x ∈ R. Then
FX (xn ) = PX (−∞, xn ] → PX (−∞, x] = FX (x),
as n → ∞,
(6.22)
T
since ∞
n=1 (−∞, xn ] = (−∞, x].
For (iv), we first see that for a sequence (An )n∈N ⊆ F with A1 ⊆ A2 ⊆ A3 ⊆ ..., we have
"∞
#
[
Q
An = lim Q[An ],
(6.23)
n→∞
n=1
taking complements in (6.21). Now we consider a sequence of real numbers (an )n∈N with
an → ∞ as n → ∞. Then
FX (an ) = PX (−∞, an ] → PX [R] = 1,
as n → ∞,
(6.24)
S
since R = ∞
n=1 (−∞, an ] and (−∞, an ] ⊆ (−∞, an+1 ] for all n ∈ N. The other claim follows
similarly, again by using (6.21).
In fact, the above properties characterize distribution functions in the following sense:
Theorem 6.8. Let F : R → R satisfying the properties (i) – (iv) of Lemma 6.7. Then there exists a
probability space (Ω, F, P) and a random variable X : (Ω, F) → (R, B(R)) such that F = FX .
The law PX of X is uniquely determined by F .
Proof. We define X : ((0, 1), B((0, 1))) → (R, B(R)) as follows:
X(ω) = sup{y ∈ R ; F (y) < ω}.
45
(6.25)
6. Random variables
Note that
{ω ∈ (0, 1) ; X(ω) ≤ x} = {ω ∈ (0, 1) ; ω ≤ F (x)},
x ∈ R.
(6.26)
Indeed, if ω ≤ F (x), then x ∈
/ {y ∈ R ; F (y) < ω}, which implies x ≥ X(ω).
On the other hand, if ω ∈ (0, 1) with F (x) < ω, since F is right continuous, there is ε > 0
with F (x + ε) < ω. Therefore X(ω) ≥ x + ε > x. This means that F (x) < ω implies that
X(ω) > x. Consequently, x ≥ X(ω) implies ω ≤ F (x).
We then equip the space (0, 1), B((0, 1))) with the uniform distribution P = U((0, 1)).1
Then the law of X has cumulative distribution function given by F . Indeed:
(6.26)
FX (x) = P X ≤ x = P (0, F (x)] = F (x).
(6.27)
The proof of the second part (uniqueness of PX ) is omitted.
Let us briefly explain the role of the discontinuity points of a cumulative distribution function
F . If we look at (i), (iii) in Example 6.6, we see that a jump of the cumulative distribution
function at a point x ∈ R correspond to the probability PX [{x}] = P[X = x]. More formally:
Lemma 6.9. Let X : (Ω, F) → (R, B(R)) be a real random variable with cumulative distribution function F = FX . Then, for every x ∈ R, we have
P[X = x] = F (x) − F (x−),
(6.28)
where F (x−) (the left limit of F at x) is defined as
F (x−) = lim F (x − ε).
ε↓0
(6.29)
Proof. Note that since F is non-decreasing, the limit in (6.29) is well defined and equal to
limn→∞ F (x − n1 ). The claim (6.28) then follows from (6.21), the fact that
{x} =
∞
\
(x − n1 , x]
(6.30)
n=1
and
F (x) − F (x − n1 ) = P (x − n1 , x] .
(6.31)
To summarize the previous results, we see that a cumuluative distribution function F uniquely
determines a probability measure P on (R, B(R)) and vice versa. Let us also stress the fact that
the cumulative distribution function is really associated to the law of a random variable, and
not the random variable itself. This motivates the following definition.
1
Here, U((0, 1)) stands
for the uniform distribution on the open interval (0, 1), viewed as a probability measure
on (0, 1), B (0, 1) . Since continuous distributions give zero mass to points, this is essentially the same
distribution as the uniform distribution U([0, 1]) on [0, 1], viewed as a probability measure on (R, B(R)).
46
6. Random variables
Definition 6.10. Let (Ω, F, P) be a probability space and X, Y : (Ω, F) → (R, B(R)) two
random variables. We say that X and Y are equal in distribution or identically distributed if
PX = PY
(⇔ FX (x) = FY (x), for all x ∈ R).
(6.32)
d
This is denoted as X = Y .
Example 6.11.
(i) Consider again throwing two dice. We use the probability space (6.1). The
result of the first and second die are given by the random variables
X : {1, ..., 6}2 → {1, ..., 6},
X(ω1 , ω2 ) = ω1 ,
2
Y (ω1 , ω2 ) = ω2 .
Y : {1, ..., 6} → {1, ..., 6},
(6.33)
d
We see that X ∼ U({1, ..., 6}) and Y ∼ U({1, ..., 6}), so X = Y . Note that of course
X 6= Y , since for instance X(1, 2) = 1 6= 2 = Y (2, 1).
(ii) Let X be any continuous random variable and fX the probability density of its law.
Assume that fX is an even function, i.e.
for all x ∈ R.
fX (x) = fX (−x),
(6.34)
d
Then X = −X. Indeed, we have
F−X (x) = P[−X ≤ x] = P[X ≥ −x] = 1 − P[X < −x]
Z −x
=1−
fX (t)dt
−∞
Z x
=1+
fX (−t)dt
Z∞∞
=1−
fX (t)dt
(6.35)
x
= P[X ≤ x] = FX (x).
Here we repeatedly used the results of Lemma 5.9. A more concrete example: If X ∼
N(0, σ 2 ), σ > 0, we have that −X ∼ N(0, σ 2 ) as well.
The example above already gives a hint that random variables are a useful tool for algebraic
manipulations. We will see more of this in the next section.
6.3. Transformation of random variables
Example 6.12. We measure the temperature of a liquid in
◦
C and want to transform it into
◦
C : random variable X,
◦
F : random variable Y .
47
◦
F.
6. Random variables
We can use the known formula
9
Y = · X + 32,
more generally Y = a · X + b,
(6.36)
5
for a 6= 0, b ∈ R. We assume that X is a continuous random variable with probability density
(distribution) function fX (FX ), for instance X ∼ N(µ, σ 2 ). What is the distribution of Y ,
i.e. how do fY or FY look like? For simplicity, we also assume a > 0.
h
i
y−b
FY (y) = PY (−∞, y] = P Y ≤ y = P aX + b ≤ y = P X ≤ y−b
=
F
X
a
a
1
d
⇒
fY (y) =
FX y−b
= · fX y−b
.
a
a
dy
a
(6.37)
If Y = a · X + b with general a 6= 0, we have
fY (y) =
1
.
· fX y−b
a
|a|
(6.38)
In the special case where X ∼ N(µ, σ 2 ), we have
1
2
1
e− 2σ2 (x−µ)
2πσ
2
1
1
e− 2σ2 a2 (y−b−aµ)
⇒
fY (y) = √
2πσ|a|
2 2
Y = aX + b ∼ N(aµ + b, a σ ).
fX (x) = √
⇒
(6.39)
Formula (6.38) is the linear transformation rule for continuous random variables. We now
show a general transformation rule for continuous random variables.
Theorem 6.13. Let X be a continuous random variable and fX the probability density function
of its law. Suppose that g : R → R is strictly increasing or strictly decreasing and differentiable.2
Then
Y = g(X)
(6.40)
is also a continuous random variable and has density
(
d −1
fX g −1 (y) · dy
g (y) , y = g(x) with fX (x) > 0,
fY (y) =
0,
else.
Proof. Let g be strictly increasing. Then, for [a, b) ⊆ {g(x) ; fX (x) > 0}, we have
PY [a, b) = P g(X) ∈ [a, b) = P X ∈ [g −1 (a), g −1 (b))
Z g−1 (b)
Z b
d −1
=
fX (x)dx =
fX g −1 (y)
g (y) dy.
dy
g −1 (a)
a
|
{z
}
(6.41)
(6.42)
=fY (y)
The proof for g strictly decreasing is similar.
2
To be very precise, we also need to make sure that Y = g(X) is still a random variable (i.e. that it is F − B(R)measurable. This follows from the fact that g is differentiable and thus B(R) − B(R)-measurable (in fact
continuity is sufficient) and the simple fact that the composition of measurable functions is measurable.
48
6. Random variables
Example 6.14. Let X ∼ U([0, 1]) and consider Y = exp(X) = eX . The function exp satisfies
the requirements of the previous theorem, and its inverse is log. Moreover, fX (x) 6= 0 if and
only if x ∈ [0, 1]. We have
(
d
log(y) = y1 , y ∈ [1, e],
fY (y) = dy
(6.43)
0,
y∈
/ [1, e].
In the next example, we use the same method as in Theorem 6.13 to introduce the χ2
distribution (with one degree of freedom).
Example 6.15. Let X ∼ N(0, 1) and Y = X 2 . We want to calculate fY (y). Unfortunately the
function g : R → R, x 7→ x2 is not strictly increasing / decreasing, but we can still use the
√
same idea as in the proof of the transformation rule. Indeed, we see that g −1 (y) = y and
d −1
1
√
dy g (y) = 2 y for y > 0. Then, for 0 ≤ a < b < ∞, we find
√
√
√ √
P Y ∈ [a, b) = P X ∈ [ a, b) + P X ∈ (− b, − a]
√ √
= 2P X ∈ [ a, b)
Z b
y
1
1
√ e− 2 √ dy.
=2
2
y
2π
a
(6.44)
We used the symmetry of X ∼ N(0, 1) and the fact that PX does not give mass to points. It
follows that
y
1
1
fY (y) = √ y − 2 e− 2 1[0,∞) (y).
(6.45)
2π
We say that a random variable Y with a law given by the density is χ2 -distributed (with one
degree of freedom).
49
7. Expectation, variance and higher moments of random variables
7. Expectation, variance and higher
moments of random variables
(Reference: [1, Chapter 7], or [2, Sections 4,1,4.3])
7.1. Expectation
We now introduce the notion of the expectation or expected value of a real random variable. The
idea is to somehow quantify the “typical” or “average” value of a random variable X. Let us
motivate the definition with an example.
Example 7.1. We consider a game where we throw a die once, and get the following rewards:
I $ 1 if the die shows 1 or 2,
I $ 2 if the die shows 3 or 4,
I $ 4 if the die shows 5, and
I $ 8 if the die shows 6.
What would be a “fair price” for playing this game? We would like to stakes to be such that we
do not lose money on average by playing. Assume that we play n ∈ N times, and n1 , n2 , ..., n6
denotes the number of 1, 2, ..., 6 that show up in n rounds. Our return (in $) after n steps is
1 · n1 + 1 · n2 + 2 · n3 + 2 · n4 + 4 · n5 + 8 · n6 .
(7.1)
So, the average return in one round is
1·
n1
n2
n3
n4
n5
n6
+1·
+2·
+2·
+4·
+8· .
n
n
n
n
n
n
The idea is now that for large n, the relative fractions
X ∼ U({1, ..., 6}). This gives us the value
ni
n
should be close to P[X = i] for
E[X] = 1 · P[X = 1] + 1 · P[X = 2] + 2 · P[X = 3] + 2 · P[X = 4]
1
+ 4 · P[X = 5] + 8 · P[X = 6] = 18 · = 3.
6
This somehow suggests that the “fair” price to play the game is $ 3.
This example motivates the definition of the expectation.
50
(7.2)
(7.3)
7. Expectation, variance and higher moments of random variables
Definition 7.2.
(i) Let X be a discrete real random variable with values in ΩX (⊆ R) and
let pX be the probability mass function of its law PX . We define the expectation of X as
X
E X =
ω · pX (ω),
(7.4)
ω∈ΩX
if
P
ω∈ΩX
|ω|pX (ω) < ∞.
(ii) Let X be a continuous real random variable and let fX be the probability density function
of its law PX . We define the expectation of X as
Z ∞
E X =
x · fX (x)dx,
(7.5)
−∞
if
R∞
−∞ |x|fX (x)dx
< ∞.
End of Lecture 10
Remark 7.3.
(i) For real random variables X that are neither discrete nor continuous (such
as we saw in Example 6.6, (iii)), we typically cannot easily define the expectation by a
formula as above. We refer to Section [2, Sections 4,1] for the case of general X.
d
(ii) The expectation only depends on the law PX of X. In other words: If X = Y and the
expectation of X exists, then the expectation of Y exists as well and E[X] = E[Y ].
Let us give a couple of examples.
(i) Let X = c ∈ R. Then
Example 7.4.
E X = c,
(7.6)
since X is a discrete random variable and ΩX = {c}, P[X = c] = 1.
(ii) Let (Ω, F, P) be a probability space and A ∈ F. Then 1A , defined by
(
1, ω ∈ A,
1A (ω) =
0, ω ∈
/ A,
is a random variable and
E
Indeed, 1
have
−1
A (B)
∈
{∅, A, Ac , Ω}
E
1A = P[A].
(7.8)
for every B ∈ B(R), and Ω1A = {0, 1}. Therefore, we
1A = 0 · P[1A = 0] + 1 · P[1A = 1] .
(7.7)
|
{z
=P[A]
(7.9)
}
(iii) X ∼ P ois(λ), where λ > 0. The corresponding probability mass function is pX (k) =
λk −λ
(for k ∈ N0 ), and
k! e
∞
∞
∞
X
X
X
λk−1 −λ
λk−1
−λ
E X =
kpX (k) = λ
k·
e = λe
= λ.
k!
(k − 1)!
k=0
k=0
k=1
51
(7.10)
7. Expectation, variance and higher moments of random variables
(iv) X ∼ U([a, b]) for a < b. The corresponding probability density function is fX (x) =
1
b−a 1[a,b] (x). We have
E X =
b
Z
a
2 b
1
x
1
a+b
x·
dx =
=
.
b−a
b−a 2 a
2
(7.11)
(v) X ∼ N(µ, σ 2 ), µ ∈ R, σ > 0. Here the probability density function is fX (x) =
√ 1 e−
2πσ
(x−µ)2
2σ 2
.
Z ∞
(x−µ)2
y2
1
1
−
2
y√
x· √
e 2σ dx =
e− 2σ2 dy
2πσ
2πσ
−∞
−∞
Z ∞
2
y
1
√
+µ
e− 2σ2 dy = 0 + µ = µ,
2πσ
−∞
|
{z
}
E X =
Z
∞
(7.12)
=(∗)
where we used that the expression (∗) is again a probability density function of N(0, σ 2 ),
and therefore its integral is one.
(vi) X ∼ Geo(p), p ∈ (0, 1) has expectation
1
E X = .
p
(7.13)
1
E X = .
λ
(7.14)
(vii) X ∼ E(λ), λ > 0 has expectation
(viii) X ∼ Bin(n, p), n ∈ N, p ∈ (0, 1) has expectation
E X = np.
(7.15)
(ix) Consider the probability distribution on N characterized by the probability mass function
p(k) = π62 k12 .1 Let X ∼ P. Then the expectation of X does not exist. Indeed:
∞
X
k · pX (k) =
k=1
∞
X
6 1
= ∞.
π2 k
(7.16)
k=1
The claims in (vi) and (vii) are Exercises. Claim (viii) will be shown very easily later, after
introducing the notion of independent random variables.
Theorem 7.5. Let g : R → R.
1
The prefactor is chosen since
P∞
1
k=1 k2
=
π2
.
6
This can be shown using Fourier series.
52
7. Expectation, variance and higher moments of random variables
(i) If X is a discrete real random variable with probability mass function (pX (ω))ω∈ΩX , then
X
X
E g(X) =
g(ω)pX (ω),
if
|g(ω)|pX (ω) < ∞.
(7.17)
ω∈ΩX
ω∈ΩX
(ii) If X is a continuous real random variable with probability density function fX (and g
piecewise continuous2 ),then
Z ∞
Z ∞
E g(X) =
g(x)fX (x)dx,
if
|g(x)|fX (x)dx < ∞.
(7.18)
−∞
−∞
−1
Proof. For
S∞ (i), let Y = g(X), then ΩY = {y1 , y2 , ...}. Let Ai = g ({yi }) for i ∈ N. Clearly,
ΩX = i=1 Ai , and the Aj are pairwise disjoint. We have
∞
∞
X
X
X
E g(X) = E Y =
yi · PY [{yi }] =
yi
pX (ωj )
i=1
=
∞
X
X
i=1
ωj ∈Ai
(7.19)
g(ωj ) · pX (ωj )
i=1 ωj ∈Ai
=
X
g(ω)pX (ω).
ω∈ΩX
For (ii), the proof of the general case is more complicated and relies on Measure Theory. For
the special case where g is strictly increasing / strictly decreasing and differentiable, we can
however use Theorem 6.13 and see that for Y = g(X)
Z ∞
Z ∞
Z ∞
d −1
E g(X) =
yfY (y)dy =
yfX g −1 (y) · dy
g (y) dy =
g(x)fX (x)dx.
−∞
−∞
−∞
(7.20)
Corollary 7.6. Let X be a real random variable with finite expectation E X . Then, for a, b ∈ R,
E aX + b = aE X + b.
(7.21)
Proof. We only prove this statement for X continuous or discrete. Without loss of generality,
assume that X is continuous (the discrete case proceeds analogously). Consider g(x) = ax + b.
Then
Z ∞
Z ∞
Z ∞
E g(X) =
(ax + b)fX (x)dx = a
xfX (x)dx +b
fX (x)dx .
(7.22)
−∞
−∞
−∞
|
{z
}
{z
}
|
=1
=E[X]
A similar calculation shows that the integral
2
R∞
−∞ |g(x)|fX (x)dx
is finite.
This can be weakened to requiring that g : (R, B(R)) → (R, B(R)) measurable.
53
7. Expectation, variance and higher moments of random variables
Theorem 7.7. Let X, Y be two real random variables, both with finite expectation. Then
E X +Y =E X +E Y .
(7.23)
We will show this later after introducing the joint distribution of random variables in the
next section.
7.2. Variance
Definition 7.8. Let X be a continuous or discrete real random variable.
(i) For k ∈ N, the k-th moment of X is defined by
µk = E X k ,
if E |X|k < ∞.
(7.24)
(ii) Assume that X has a finite second moment, E X 2 < ∞. We define the variance of X as
Var[X] = E (X − E[X])2 .
(7.25)
We also define the standard deviation of X by
p
σ(X) = Var[X].
(7.26)
The variance is a measure how much the distribution of X typically spreads around its
expectation. If it is large, the distribution is well spread-out. If it is small, the the distribution is
concentrated around the expectation. Both the notion of kth moment and variance only depend
on the law PX of X, similar as for the expectation.
Proposition 7.9. Let X be a continuous or discrete real random variable with E X 2 < ∞ and
a, b ∈ R. Then
(i)
2
Var X = E X 2 − E X
= µ2 − µ21 .
(7.27)
Var aX + b = a2 Var X .
(7.28)
(ii)
Proof. We first prove (i):
Var X = E (X − E[X])2 = E X 2 − 2XE[X] + E[X]2
= E X 2 − 2E[X]2 + E[X]2 = E X 2 − E[X]2 .
(7.29)
For (ii), we calculate
Var[aX + b] = E (aX + b − aE[X] − b)2
= E a2 (X − E[X])2 = a2 Var[X].
54
(7.30)
7. Expectation, variance and higher moments of random variables
End of Lecture 11
Let us give some examples.
Example 7.10.
(i) Let X ∼ Ber(p), p ∈ (0, 1). We have
E X = 0 · (1 − p) + 1 · p = p,
E X 2 = 02 · (1 − p) + 12 · p = p,
(7.31)
2
Var[X] = p − p = p(1 − p).
(ii) Let X ∼ U({1, ..., 6}). We have
6
X
E X =
k · P[X = k] = 3.5,
E X2 =
k=1
6
X
k 2 · P[X = k] =
k=1
⇒
Var[X] =
91
,
6
(7.32)
91 49
70
35
−
=
=
≈ 2.92.
6
4
24
12
(iii) Let X ∼ N(0, 1). We already saw that E X = 0 (see (7.12)). We now evaluate
Z ∞
2
Var[X] = E (X − E[X]) =
x2 ϕ(x)dx
| {z }
−∞
Z ∞ =0
1 2
1
x2 e− 2 x dx
=√
2π −∞
Z ∞
i
1 2 ∞
1 2
1 h
1
−xe− 2 x
e− 2 x dx
=√
+√
−∞
2π
2π −∞
= 0 + 1 = 1.
(7.33)
Now let Y = σX + µ for µ ∈ R, σ > 0. Then
Y ∼N(µ, σ 2 )
(by (6.39))
Var[Y ] = σ 2 Var[X] = σ 2
(by (7.28)).
(7.34)
We have established that
Z ∼ N(µ, σ 2 )
⇒
E Z = µ, Var[Z] = σ 2 .
In other words: The standard deviation of N(µ, σ 2 ) is exactly σ.
55
(7.35)
7. Expectation, variance and higher moments of random variables
(iv) X ∼ P ois(λ). Recall that E[X] = λ. Then
∞
X
λk
E X(X − 1) =
k(k − 1) e−λ
k!
k=0
∞
X
λk−2 −λ
e = λ2
(k − 2)!
k=2
2
E X = E X(X − 1) + E[X] = λ2 + λ,
2
Var[X] = E X 2 − E X = λ2 + λ − λ2 = λ.
= λ2
⇒
⇒
(7.36)
(v) X ∼ Bin(n, p), n ∈ N, p ∈ (0, 1) has variance
Var[X] = np(1 − p).
(7.37)
We now argue why the variance is indeed a useful quantification how the distribution is
spread out. We study the expression P [|X − E[X]| ≥ ε] for ε > 0. The following inequality is
called the Markov inequality.
Theorem 7.11. Let X be a non-negative random variable with finite expectation. Then, for ε > 0,
E X
.
(7.38)
P X≥ε ≤
ε
Proof. We prove the case where X is discrete. The case of continuous X is similar. Let ΩX =
{ω1 , ω2 , ...}. Then
∞
X
P X = ωi
P X≥ε =
≤
≤
i=1
ωi ≥ε
∞
X
ωi
P X = ωi
ε
i=1
ωi ≥ε
∞
X
1
ε
i=1
(7.39)
1
ωi P X = ωi = E X .
ε
From the Markov inequality, we have the following result, called the Chebyshev inequality.
Theorem 7.12. Let E X 2 < ∞. For any a ∈ R and ε > 0, one has
E (X − a)2
P |X − a| ≥ ε ≤
.
(7.40)
ε2
In particular, for a = E[X] one has
Var[X]
P |X − E[X]| ≥ ε ≤
.
ε2
56
(7.41)
7. Expectation, variance and higher moments of random variables
Proof. We apply the Markov inequality (7.38) to the non-negative random variable (X − a)2 . It
follows that
P |X − a| ≥ ε = P (X − a)2 ≥ ε2
(7.42)
(7.38) E (X − a)2
≤
.
ε2
Chebyshev’s inequality gives a bound how likely it is that the result of a random number
deviates by a certain amount from its expectation. In particular, we have the following “kσrules”:
Corollary 7.13. Let E X 2 < ∞.
p
(i) If σ = σ(X) = Var(X) > 0, we have
1
P |X − E[X]| ≥ kσ ≤ 2 ,
k
(7.43)
1
P |X − E[X]| ≥ 2σ ≤ ,
4
1
P |X − E[X]| ≥ 3σ ≤ .
9
(7.44)
P X = E[X] = 1.
(7.45)
for k > 0. In particular:
(ii) If Var X = 0, then
Remark 7.14. The Chebyshev inequality is very general, but only gives very rough bounds.
Consider for instance X ∼ N(µ, σ 2 ) for µ ∈ R, σ > 0. Then
X −µ
Purchase answer to see full
attachment