Sampling Distribution for a
Mean or Proportion
R
When we take a sample from a population we often do so with the notion
I
that our sample will provide a good estimate of a population parameter. Thus
Csame as the mean of the
the mean of the sample is expected to be about the
population. In other words, our sample will represent
A the population well.
However, rarely will the sample estimate match the mean of the population
exactly, and if we take different samples, even ifRthey are taken randomly
with the same sample size, we would likely get slightly
different estimates
D
of the population mean or proportion. In fact, we would expect that samples
,
will vary from one to another and we can demonstrate
this mathematically
and empirically.
chapter
9
A the 44th president of the
On November 6, 2009 Barak Obama was elected as
United States. Usually presidents start out their D
terms with high approval
ratings, but the business of being president can be difficult, and most experience a decline in approval ratings within the first R
year. Figure 9.1 shows the
approval rating for Barack Obama for his first year
I in office. The data come
from repeated samples of approximately 1,600 adults in the United States
E
from the Gallup Poll (source: www.pollingreport.com).
N
The variability in favorability estimates in Figure 9.1 reflects changes that
N
take place over time. A single company, the Gallup organization, takes
E
each value from a different sample of approximately
1,600 adults in the
2
4
7
9
T
S
Daily Estimate of President Obama’s Approval Ratings from January 21, 2010 to
March 8, 2010 by the Gallup Poll
K11352_Ilvento_CH09.indd 169
figure 9.1
7/12/13 2:43 PM
170
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
United States. However, many polling companies working for different news
organizations also take polls. If we looked at approval ratings across several
polling organizations at roughly the same time period (February 3, 2010 to
March 3, 2010) we see that the estimates are similar, but there are differences.
Figure 9.2 shows the estimates the approval ratings estimated from seven
different polling organizations. Each poll used slightly different methodologies and sample sizes, but all would be considered reasonable estimates of
the approval rating for President Obama in roughly the same time period.
These estimates vary from a low of 46% to a high of 53%. These differences
are not large, but politically it makes a difference whether the approval rating
is above or below 50%, and these estimates are not consistent.
The results in Figure 9.2 are
Rsomewhat like a small sampling distribution for
the proportion of U.S. citizens who support President Obama in the period
I
February 3, 2010 to March 3, 2010. We expect estimates from a population to
C This sampling variation of proportions, just like
vary from sample to sample.
that for means, tends to follow a normal distribution. The estimates will cenA
ter around the true population proportion, but there will be variability from
R
sample to sample.
D
This is perhaps one of the most important chapters in the book. Sampling
distributions are key to the, logic of inference and we need to have a good
sense of how they work in order to move forward to topics such as confidence intervals and hypothesis tests. A sampling distribution is based on
the notion of taking many, A
many samples from a population, all of the same
size n, and making an estimate
D from each sample. The distribution of these
estimates will follow. In a research setting we typically only take one sample,
R as one of many possible samples that can be
but we can think of our sample
taken from the population,I and our estimate as one of many estimates that
could have been made. In this chapter we will learn that most sampling disE
tributions tend to follow a specific probability distribution and we can use the
distribution of known probabilities
to make inferences. One such distribution
N
is the normal distribution, and that will be our focus in this chapter as we look
N
at the sampling distribution for the mean.
E
The variability in the sample estimates above is called sampling error, and
is an expected outcome of taking a sample to represent the population.
2
4
7
9
T
S
figure 9.2
K11352_Ilvento_CH09.indd 170
Seven Estimates of President Obama’s Approval Ratings from February 3, 2010 to
March 3, 2010
7/12/13 2:43 PM
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
171
The sample estimate of the mean or proportion will not exactly match
the population mean or proportion. We will expect some variability from
sample to sample. The key is whether we know what the variability will
look like and how it behaves. It turns out that statisticians do know how
it behaves and the way it is distributed. This leads us to the sampling distribution of the mean—the distribution of means generated from many
samples of the same size n.
As we move forward in the discussion of sampling distributions, it is important to remember that we noted that a parameter is a numerical descriptive
measure of the population, such as the mean or the variance. We use Greek
terms to represent population parameters. These parameters are hardly
ever known—you are generally doing the research
R to gain an estimate or
understanding of the population parameter. In contrast, a sample statistic is
I
a numerical descriptive measure from a sample—i.e., based on the observations in the sample. We will want the sample to be
C derived from a random
process in order to feel our sample represents the population. Inferential staA
tistics requires the sample be drawn in a random fashion.
R
D
A Sampling Distribution Experiment: Rolling, a Die Three Times
I want you to be involved in a simple sampling experiment. All you need is a
single die—you can get one from several board games in your home. Here is
A your rolls:
what I want you to do. A table is provided to record
D
1. Toss a die three times
Rface value—either 1, 2, 3,
2. Each time you toss the die, note and record the
4, 5, or 6
I
3. Calculate the mean and median for the three rolls—the median will
E
always be the middle value
4. Repeat this experiment ten times
N
N
Table 9.1 is given below for you to fill out. For example if I roll the following
E
sequence:
5
1
6 the mean is (5 + 1 + 6)/3 = 12/3 = 4 and the median is 5, the
middle value in an ordered sequence.2
4
7
9 Times and Calculating the
Recording Table for an Experiment of Rolling a Die Three
Mean and Median T
Sample
Roll 1
Roll 2
Roll 3 S Mean
Median
table 9.1
1
2
3
4
5
6
7
8
9
10
K11352_Ilvento_CH09.indd 171
7/12/13 2:43 PM
172
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
We are going to think of each three-roll sequence as a different sample of
three rolls. Each sample will generate a mean, which we will think of as our
sample estimate of the population mean. Each three-roll sequence is a different sample and each is likely to generate a different sample estimate of the
mean. Using many samples brings us into the realm of a sampling distribution—many samples of the same size n. We will also compute the median
as a way to compare the mean versus the median as a way to estimate the
population mean. I know it may seem silly to use the median from the sample to estimate the mean of the population, but think of it as an alternative
estimate of the central tendency of the population.
Our experiment uses a very simple approach—multiple samples of size
three (the sample size is R
3, n = 3). I chose this approach because using
probability theory from past modules I can work out exactly what the mean
I
and variance is for the roll of a die. I can also work out all the possible
combinations of this sampling
C distribution, and I can do it using an a priori
probability of 1/6 for each face of the die. If you look at your own simple
A
experiment, the values of the mean can range from 1 (rolling three ones) to
R
6 (rolling three sixes). Remember,
each three-roll sequence will be thought
of as a sample.
D
, can note the possible outcomes of this experiUsing probability theory we
ment as a discrete random variable. We start with the mean and variance of
rolling a single die, represented as a following table (Table 9.2).
A
A priori we have the following
D expectation:
Mean = E(X) = 1(.1667)R
+ 2(.1667) + 3(.1667) + 4(.1667) + 5(.1667)
+ 6(.1667)
I
Mean = E(X) = 3.500 E
N
Variance = E(X − μ)2 = (1 − 3.5)2(.1667) + (2 − 3.5)2(.1667) + (3 − 3.5)2(.1667)
N
+ (4 − 3.5)2(.1667) + (5 − 3.5)2(.1667) + (6 − 3.5)2(.1667)
E
= 2.916667
2
Standard Deviation = 1.7078
4
There are 6*6*6 = 216 different combinations of outcomes of rolling a die
7 are 216 different sample combinations that
three times. In essence, there
can result from rolling a die
9 three times. Rather than work with many random samples, I can work out all the possible combinations of each roll and
T of those combinations. I can take the mean
the mean and median of each
and median of each possible
S outcome and examine the summary statistics
using JMP.
table 9.2
Recording Table for an Experiment of Rolling a Die Three Times and Calculating
the Mean and Median
X
P (X)
K11352_Ilvento_CH09.indd 172
1
.1667
2
.1667
3
.1667
4
.1667
5
.1667
6
.1667
7/12/13 2:43 PM
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
R
JMP Output for the Descriptive Statistics for the MeanIof the 216 Different Outcomes
of Rolling a Die Three Times
C
A
Notice several things from Figure 9.3, which is the
R sampling distribution of
rolling three die and calculating the mean.
D
• There are 216 observations in the count, the, number of possible out-
173
figure 9.3
comes of rolling a die three times.
• The minimum value is 1 (a mean from rolling three ones) and the maximum is 6 (the mean of rolling three sixes).
A
• The mean of each of the sample means is in fact the population mean of
D
3.5. Thus the mean of the sampling distribution will equal the mean of the
population.
R
• The median of the sampling distribution is also 3.5, which agrees exactly
I
with the estimate using the mean.
E
• The histogram shows a symmetrical, mound-shaped
distribution with
the center at 3.5.
N
• The standard deviation for this variable is .989.
N
We noted that the standard deviation of the population
E was 1.7078, which is
considerably larger than the standard deviation of the means of each sample.
However, if we divide this figure by the square root of the sample size for
our experiment (n = 3, so the square root of 3 = 1.7321),
we get the following
2
result: 1.7078/1.7321 = .986 or rounded off to .99, which is very close to the
value of .989 in the table. The standard deviation of4a sampling distribution is
called the standard error. I will return in a bit to the
7 standard error and why
we used the square root of the sample size to make this calculation. For now
9
we can note that the standard deviation of our sampling
distribution for the
mean is smaller than the population value and it can
T be expressed as a function of the sample size.
S
As we move forward we are going to think of our sample estimate, i.e., the
mean, as a reasonable estimator of the population value, μ. However, there
may be more than one estimator available to us. And the question arises,
which estimator is the best one? In anticipation of this question, let us allow
that the sample median might be a better estimate of the population parameter than the mean. In other words, we will consider whether the sample
median might be a preferred way to estimate the mean of a population.
K11352_Ilvento_CH09.indd 173
7/12/13 2:43 PM
174
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
figure 9.4
JMP Output for the DescriptiveR
Statistics for the Median of the 216 Different Outcomes
of Rolling a Die Three Times I
C
Figure 9.4 shows the JMP output
for the median of each three-roll sample of
A
die. The average of the medians from each of the outcomes is also 3.5, the
R
population value. The histogram
also shows a symmetrical, mound-shaped
distribution for the median.
However,
the standard deviation of the 216
D
median values is 1.374, considerably higher than the value for the distribu, that the sampling distribution of the median
tion of the means. This shows
has a larger variance than that of the mean. We will say that the mean has
minimum variance in comparison to the median as an estimator of the popuA
lation value, μ.
D
R
The Sampling Distribution of the Mean
I
Sample statistics are random variables. By this we mean that the sample
E
statistic, in this case the mean, that is estimated from each sample will vary
N
from sample to sample. Sample
statistics have a probability distribution
based on repeating the sampling
experiment
many times. We will get slightly
N
different sample statistics each time, even though we would consider each
as a reasonable estimate ofEthe population mean. Repeating the experiment
( obtaining a sample and calculating the mean of the sample) many times
results in a sampling distribution. The sampling distribution of a sample sta2 of n measurements results in a probability
tistic calculated from a sample
distribution of the statistic.4In order to better know what the probability distribution is, we need to know how the sampling distribution is distributed.
7 I hope to make it clear shortly.
I know that is a mouthful, but
9
I should start with the notion of an estimator. It is not an easy concept for
T
some students to grasp because
our conclusion will seem so obvious. The
estimator is the strategy weS
use to estimate the population parameter. It could
be as simple as a random guess or a combination of guesses from an expert
panel. For example, if we want an estimate of the average systolic blood
pressure of all adult males in Delaware in March of 2010 I could simply guess
that it is 125. My guess may not be particularly good, but it is an estimate.
I might ask a panel of doctors and nurses in Delaware to help me estimate the
average, and they might end up saying that a better estimate is 131. It would
not be too much of a stretch to think that their estimate is better than mine,
but that still would not say this alternative strategy gave a good estimate.
After all, doctors and nurses tend to see a disproportionate number of people
who are sick or need assistance. Their estimate might not represent the true
average. An even better way to estimate the population parameter is to take
K11352_Ilvento_CH09.indd 174
7/12/13 2:43 PM
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
175
a random sample of adult males, take their blood pressure, and then take the
sample average as the estimate of the population value. This makes sense—
use the sample mean to represent the population mean.
In most cases, our estimator will be the sample statistic. As I said, this seems
obvious. We will use the sample mean to make an estimate of the population
parameter μ. But statisticians often need something more than saying “it is
obvious” as evidence that an estimator is a good one. One set of criteria is
that we want our estimators to be BLUE. BLUE stands for, best linear unbiased estimator. Without going into too much detail, here are the basics for a
BLUE estimator.
Best
Best refers to a criterion that ourRestimator has minimum
variance. This means that if we took repeated measures of
I around the true value
the sample estimates, the variability
would be less than any other estimator.
C
Linear
Linear refers to a class of estimators
A that tend to be simple
and straightforward. All things being equal, a simple estiR
mate is thought to be better.
Unbiased
D our estimates from
Unbiased refers to a property where
sample to sample tend to cluster around
the true value with,
out missing the target in one direction or another.
Estimator
This reflects the notion that there could be many different
A
ways to estimate a population parameter,
and we will pick
the one that is best, linear, and unbiased.
D
R
One of the best analogies for an estimator that I have heard is to think of it as
hitting a bull’s eye by shooting a rifle (Kementa, John,
I Elements of Econometrics, Macmillian Publishing Co., New York, 1971). The bull’s eye on the target is
E
the population parameter. The rifle is an estimator. We prefer to pick a rifle that
N
is straightforward and simple rather one that is complicated,
such as a machine
gun. This makes the rifle linear. While we would like
to
hit
the
bull’s eye every
N
time, we know that will not be possible. However, we want our shots to center
E as above right or below
around the bull’s eye and not miss systematically, such
left. This would make our rifle unbiased. And finally, we want a rifle that tends
to have a tight pattern around the bull’s eye rather than a wider dispersion. We
2 target.
want the rifle to have minimum variance around the
4
There is one last component to our analogy with target shooting that is
7
important to our discussion of estimators and sampling
distributions. The
further away one is from the target, the less accurate
9 the shots will be. Think
of the distance from the target as being inversely rated to the sample size.
A large sample size implies you are close to theTtarget and therefore better able to have a tight pattern around the bull’s eye.
S A small sample would
imply being further away. All things being equal, we will tend to make good
estimates if our estimator uses a larger sample, has minimum variance, is
unbiased, and is linear (think of this as being simple).
We will use the sample mean, taken from a random sample, as our estimator of the population mean. Here is what we would like to see in our sampling distribution of the mean. If the sample mean is a good estimator of the
population mean, we would expect the values of the sample means taken
from many samples to cluster around the true population mean. We would
not want them to tend to cluster at a point above or below the true value,
or else we would consider them to be a biased estimate. And, we might
K11352_Ilvento_CH09.indd 175
7/12/13 2:43 PM
176
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
say our estimator is “good” if the cluster of the sample means around the
population mean is tighter than the sampling distribution of some other
possible estimator. This property is called minimum variance. In relation to
the sample mean as being a good estimator of the population parameter μ,
we already showed with the die example that the variance/standard deviation of the sampling distribution of the mean is smaller than the sampling
distribution of the median.
Let us use the following example to set up our discussion of a sampling distribution. Suppose we are looking at the blood pressure for the population of
adult males (ages 18 to 85) in Delaware in 2010. We believe there is an average blood pressure of this population, designated as µ. However, because
it would expensive and extremely
difficult (perhaps even impossible) to get
R
information on all males, we want to take a sample to estimate µ.
I
If we take a sample of 300C
adult males on a random basis, we can use the
sample mean (the sum of all 100 values divided by 100) as our estimator
A
of the population mean. Likewise, we can use the sample variance (using
n − 1 as the denominator)R
as an unbiased estimator of the variance of the
population.
D
,
n
n
2
X=
∑X
i =1
i
and
2
s =
∑ (X
i =1
i
− X)
with s = s 2
(n − 1)
A
D
The standard deviation represents the average deviation around the sample
R sample out of an infinite number of possible
mean. But we only took one
samples. A reasonable question
would be: what is the spread of our estimaI
tor (i.e., the sample mean)? In other words, if we took many, many of samples
E of each sample, what would the distribution of
of 300 and recorded the mean
these sample means look like?
N
N distribution is a normal distribution with μ = 150
It turns out that the sampling
and σ equal to and a function
E of the sample size (more shown later). Sampling
n
theory tells us that the mean of the sampling distribution (many samples of
the same sample size n) will equal the population mean. However, we have to
remember that if we could2take an infinite number of samples, each sample
would yield a different sample mean. Yet, each one would be expressed as a
4
reasonable estimate of the true population mean. So, if we were able to take
repeated samples, each of 7
sample size n, what would be the standard deviation of the sample estimates?
9 And, the variance of the sampling distribution
will equal the variance of the population divided by the sample size. We can
convert the variance of theTsampling distribution to a standard deviation by
dividing sigma by the square
S root of n.
∞
∑X
i =1
∞
i
=
and
σ X2 =
σ2
n
and
σX =
σ
n
The latter value is called the standard error of the mean. The standard error
of the mean is the standard deviation of a sampling distribution of a mean
from a population with parameters equal to μ and σ (mu and sigma). If we do
not know sigma, we use the unbiased sample estimate of s to estimate the
sampling variance of the mean.
K11352_Ilvento_CH09.indd 176
7/12/13 2:43 PM
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
177
∞
∑X
i =1
∞
i
=
and
s X2 =
s2
n
and
sX =
s
n
These relationships are expressed in the equations above. How these formulas are derived is beyond the scope of this course. We have to accept it on
faith that the statisticians who worked out these formulas knew what they
were doing. We benefit from their work. We will be able to demonstrate this
with simulations of taking many, many sample means from a population of
a known μ and σ.
R
A Simulated Example of a Sampling Distribution
I
The notion of taking repeated samples seems strange
C to some people. It is
important to remember that we rarely would ever take more than a single
A
sample in a research project. What is important is that we can think of taking
more than one sample, or that our sample is oneR
of many samples that can
be taken randomly from the population. Statisticians can tell us what the
D
sampling distribution would look like for some estimators, such as the mean.
,
Let us look at an example of taking many samples from a population that is
distributed normally with μ = 150 and σ = 30 (X~N(150, 30). I used Excel and
A
JMP to help with this example, and I took 1,000 samples
of sample size 49
in classic statistical theory,
(n = 49). A sample size of 49 is considered “large” D
though it might seem moderate or even small in modern research. For each
R deviation.
sample I calculated the mean, median, and standard
I
First, let us look at just 10 samples from this exercise. Table 9.3 shows the
results for just 10 samples. I will use this to makeE
a point. The sample mean
for each of the 10 samples tends to be close to the
Npopulation parameter of
150, but none of the samples equals 150 exactly. The sample estimates range
N
from a low of 145.527 to a high of 154.929. The fact that some estimates are
lower and some are higher is to be expected, especially
if the estimator is to
E
be considered unbiased. The estimates for the standard deviation also center
around the population parameter of 30.
2
4
Descriptive Statistics of 10 Samples from a Population7~N(150, 30) with Sample Size
n = 49
9
Column
N
Mean
Std Dev
Maximum
TMinimum
S1
49
146.351
29.147
205.789
S 91.7885
S2
49
153.872
28.115
109.944
224.931
S3
S4
S5
S6
S7
S8
S9
S10
K11352_Ilvento_CH09.indd 177
49
49
49
49
49
49
49
49
146.783
154.929
149.236
152.505
154.126
145.527
152.183
152.078
30.342
27.867
32.380
28.552
35.044
33.043
25.252
30.652
81.7468
89.9620
57.6809
92.7285
73.6521
60.0667
80.3653
91.0217
table 9.3
215.869
209.204
222.263
206.216
246.100
213.805
203.197
225.725
7/12/13 2:43 PM
178
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
Figure 9.5 shows the sampling distribution of the 1,000 samples of size 49.
This is a large number of samples, but it is not quite the same as a sampling distribution. However, for our purposes, using 1,000 samples of size
49 taken from a population that is normally distributed will work very well.
If we look at the histogram we can see that the distribution very much
resembles a normal distribution. In fact, I super-imposed a normal distribution over the histogram and it fits very well. There are a few extreme
values from our sample estimates, from a low of 135.697 to a high of
161.954. However, most estimates fit very well. The mean of the sample
means is 149.942, which is very close to the population parameter of 150.
The center of our sampling distribution very much reflects the population value. The spread of our sampling distribution has a standard deviation of 4.202. This is considerably
smaller than the population value of 30.
R
However, we stated earlier that the sampling distribution will have a stanI
dard deviation equal to σ divided by the square root of n. In our case this
value would be:
C
A
StandardError
R = σX =
σ
n
=
30
49
=
30
= 4.286
7
D
Our example value of 4.202,comes pretty close to the expected value of 4.286.
Our simulation of 1,000 samples of size 49 from a population ~N(150, 30)
shows the following.
A
• The sampling distribution follows a normal distribution.
D
• The center of the distribution
(i.e., the mean of the means) is very close
to the population valueR
of 150.
• The standard deviation of the sample estimates, referred to as a standard
I
error, is less than the standard deviation of the population by a factor of
the square root of the sample
size.
E
N
I also calculated a sampling distribution based on 1,000 samples from the
same population, but withNa sample size of 16. This would be considered
a small sample. This new E
sampling distribution also resembles a normal
distribution and the center is very close to the population parameter of 150.
However, because the sample size of each estimate is smaller, i.e., 16, there
2
4
7
9
T
S
figure 9.5
K11352_Ilvento_CH09.indd 178
Sampling Distribution Statistics of 1,000 Samples from a Population ~N(150, 30) with
Sample Size n = 49
7/12/13 2:43 PM
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
179
is more spread in this sampling distribution. The standard error is 7.349 (read
this as the standard deviation of the sampling distribution). We expected it
to be:
StandardError = σ X =
σ
n
=
30
16
=
30
= 7.500
4
One of the sample estimates is as low as 123.727 and one is as high as
173.565. With a smaller sample size, even when drawing from the sample
population, we expect more variability from sample to sample. Sample size
is a key factor in how well we can make estimates from a population.
R
It is important to note that the standard error is smaller than the standard deviaI
tion of the population. We expect that the standard deviation
of a sampling distribution of the estimator (in this case the mean) willCbe smaller than that of the
population or the samples themselves. This is because we expect some variability across samples, but not as much as we would findAin the population.Thus the
sampling error is smaller than the standard deviation
Rfor the population.
D
The size of the standard error depends upon two things.
,
1. The size of n (as n gets larger the standard error gets smaller)
2. The variance of the population variable itself. We can think of this as the
homogeneity of the population.
A
D
The larger the sample size, and the more homogeneous
the population, the
smaller the standard error will be for our estimator.
R
I
I have used the following table to help students remember that there are
E dealing with sampling
three distinct things we need to keep in mind when
distributions (Table 9.4).
N
1. The population variable we are interested in, N
which often is unobserved
but we can think that it exists.
E
2. Our sample, which we collect on a random basis and which we can
observe.
3. The sampling distribution, which is theoretical2(we do not observe it), but
we know what it will look like (distributed normally) and its mean and
standard deviation based on statistical theory.4
7
9
T
S
Sampling Distribution Statistics of 1,000 Samples from a Population ~N(150, 30)
with Sample Size n = 16
K11352_Ilvento_CH09.indd 179
figure 9.6
7/12/13 2:43 PM
180
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
table 9.4
Comparison of the Population, Sample, and Sampling Distribution
Population
Sample
Sampling
Distribution
Referred to as:
Parameters
Statistics
Statistics
How it is Viewed
Real but not
o
bserved
Observed from our
sample
Theoretical from
sampling theory
N
Mean
=
∑X
i =1
n
i
N
R
N
Variance
Standard
eviation
D
σ2 =
X=
∑I (X
i =1
C
A
σ
R
D
,
i
−
2
∑X
i =1
s2 =
N
i
n
n
)
∞
∑ (X
i =1
i
− X 2)
(n − 1)
s
=
∑X
i =1
σ X2 =
σX =
i
∞
σ2
n
σ
n
Two Theorems about the Sampling Distribution
In the case of the sampling distribution of the mean, we use two theorems
A distribution of the sample means and thus
that help us understand the
ultimately help in makingD
inferences from a sample to a population. The
first one depends upon the population variable being distributed normally,
or at least approximately R
normal. The second theorem, the central limit
theorem, relaxes this assumption
as long as the sample size is sufficiently
I
large.
E
Theorem 1: Concerning a Variable
That Is Normally Distributed. If repeated
N
samples of size n are drawn from a variable Y that is distributed normally
N
with mean μ and variance σ 2, the sampling distribution of the mean will also
E mean and variance equal to:
be a normal distribution with
∞
∑ Xi
2
= i =1
4
∞
7
σ2
σ X2 =
9
n
T
As long as the variable of interest
is distributed normally in the population,
S
the sampling distribution will also be normally distributed. This applies
to small samples and large samples, and thus we can use the normal
distribution to find probabilities associated with our estimate. Adjustments will be made depending upon whether σ 2 is known or whether we
use the sample estimate s 2, but Theorem 1 provides a basis for inference
for small sample problems via a confidence interval (Chapter 10) and
hypothesis test (Chapter 11). And, as long as the variable is approximately
normally distributed, as in a symmetric, mound-shaped distribution,
the sampling distribution of the mean will also be normally distributed.
However, if the variable is not normally distributed, the small sample test
is not valid.
K11352_Ilvento_CH09.indd 180
7/12/13 2:43 PM
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
181
Theorem 2: The Central Limit Theorem. The Central Limit Theorem says that
even if the variable in the population is not normally distributed, the sampling distribution for the mean will be normally distributed as long as the
sample size is sufficiently large. And, the larger the sample size, the more
normal the sampling distribution gets. This offers a tremendous advantage to us because many variables of interest are not normally distributed.
While we have a limitation for small sample sizes, large samples will still be
applicable for making inferences under the central limit theorem. The key is
what is a large sample? As it turns out, as we approach a sample size of 30,
the central limit theorem starts to take effect. A sample of 30 or more is not
really that great a burden, so the central limit theorem has very important
uses for making inferences in statistics. No matter how our variable is distributed in the population, the sampling distribution
R of the mean will become
more and more like a normal distribution as the sample size increases, and
I
almost always it will be ok to use the normal distribution for making inferC
ences with a sample size greater than 30.
A
R
How to Use the Sampling Distribution to Make an Inference
D
Up till now we have not really explained how we make an inference. In the
last section of this chapter I will present the basic, strategy that we will use
in making an inference, but only in a very basic form. In Chapters 10 and 11
I will provide details on confidence intervals and hypothesis tests that will use
A here depend upon the
the approach outlined here. All of the ideas presented
sample being drawn in a random fashion. This means
D that every subject has
an equal or near equal chance of being selected and there are no biases that
would lead to one subject or a group of subjects R
having a greater chance of
being selected. The most basic random sample is called
I a simple random sample, but there are other types, which are for the most part acceptable randomly
E
based samples. However, if the sample is based on a convenience sample of
Nrandom, the probabilities
available subjects, or some other means that is not
of our inference are not well known and the inference will not hold.
N
E
The following will be our strategy to make inferences
from a sample to a
population.
2
1. We draw a random sample.
2. We think of our sample as one of many possible
4 samples of size n from a
population with parameters µ and σ.
7
3. We use knowledge of the probabilities associated
with the theoretical
sampling distribution to make a probability statement.
9
T population (or we have
a. If the variable is distributed normally in the
strong reason to believe it is so), we can assume
S the sampling distribution of the mean is distributed normally to make inferences from the
sample to the population, even if the sample size is small (n < 30).
b. If the variable is not distributed normally, but our sample size (n) is
large enough (n ≥ 30), we can assume the sampling distribution of the
mean is distributed normally under the central limit theorem. Then we
can use this information to make an inference.
Most inferences will be made using a rare event approach. In this strategy,
we see how rare it was to take a sample and come up with the observed
estimate (either a mean, proportion, or some other estimate) when compared
to some other hypothesized value. We use our sample as the basis for an
K11352_Ilvento_CH09.indd 181
7/12/13 2:43 PM
182
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
inference by comparing it to a hypothesized mean from a population. We will
test to see how close or far away our sample estimate is to the hypothesized
mean in the context of a sampling distribution. Both the confidence interval
and hypothesis approaches will be explored in detail in future modules.
Let us look at an example to illustrate the basic logic of inference using a
rare event approach. LCD televisions have a backlight that is expected to last
between 30,000 and 60,000 hours. Let us say within the population the mean
is 45,000 hours with a standard deviation of 7500 hours. This would mean
that the backlight of an LCD television should last for 12.3 years if it were
used 10 hours a day, 365 days a year. For the sake of argument, we will
assume that the population distribution is approximately normal—LCD Life
~N(45,000, 7,500).
R
I
Let us say we are involved in a consumer study of Brand X LCD televisions.
We have a process that allows
C us to simulate television use on a random
sample of 40 televisions so that we have a good indication of how long
A
the Brand X TVs last. The results of our sample experiment yield a mean of
R
38,000 with a standard deviation
of 6,000. We want to know if our sample is
unusually low from the perspective
of a sampling distribution of LCD televiD
sion backlight lifespan. And we know the sampling distribution will follow a
,
normal distribution with expected
values for μ and σ based on the population
and the sample size. A key point is that we have to think of our sample of 40
TVs as one of many possible samples that could have been drawn from this
Aare asking is:
population. The question we
D
What is the chance (or probability) that we drew a random sample of 40 TVs
R
which resulted in a mean backlight
life of 38,000 hours, if the true value of the
population is distributed normally
with
μ = 45,000 and σ = 7,500?
I
E
Expressed this way, the problem
becomes a sampling distribution problem.
We did not ask for the probability
of any one TV, but instead that the sample
N
mean would equal 45,000. We already know that if we took repeated samples
N
from this population, we would have some sample means that are greater
E are less. And we could use a z-score and the
than 45,000 and some that
standard normal table to find the probability associated with a value for any
sample mean. In this problem we want to know the probability associated
2 from a sample of 40. The z-score for this probwith a sample mean of 38,000
lem is the following.
4
Z=
(38,7000 − 45, 000) −7, 000
=
= −5.90
1,185.85
9 7, 500
T 40
S
Notice that I used the standard error in the denominator of the z-score since
it is the standard deviation of the sampling distribution. I am thinking of my
sample of one of many possible samples from the sampling distribution with
μ = 45,000 and σ = 7,500/SQRT(40). If I had thought of 38,000 as one value in
my sample, the z-score would have been very different.
Z = (38,000 – 45,000)/6000 = −1.167.
K11352_Ilvento_CH09.indd 182
7/12/13 2:43 PM
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
183
This is not such an unusual value for a sample observation if the mean
were 45,000. However, we are looking at a sample mean of 38,000, not an
individual value. This makes it a sampling distribution problem and we use
the standard error instead of the standard deviation.
It turns out the value of 38,000 is 5.90 standard deviations below the mean
in the sampling distribution. This is a very large absolute value of a z-score!
In fact, looking at the standard normal table, our value is off the chart. Based
on the furthest reaches of our table, we note that the probability of this event
is less than .0007. I arrived at this figure because our table goes to a z-score
of 3.19 with a probability of .4993. We want the area into the left hand tail,
so I subtract this value from .5 (.5 − .4993 = .0007). We figure the probability
in these problems as the area in one or both tailsRof the distribution. In this
case we want the left hand side of the tail below a value of 38,000. Using
I
a program like Excel, I can calculate the exact probability out into the left
C take a sample of 40 and
tail as .0000000018. It is an extremely rare event to
observe a sample mean of 38,000 or less if the true population values are
A
really μ = 45,000 and σ = 7,500.
R
This even is so rare as to cast doubt upon the population
values. It is beyond
D
belief to take a sample of 40 and observe a sample mean of 38,000 if the
,
true population mean is 45,000. There must be something
different about my
sample. Perhaps my sample comes from a different population and Brand X
does not have near the LCD backlight life of other LCD televisions. Based on
A X has a lower backlight
the results of our test, we would conclude that Brand
life than other LCD televisions.
D
Rwill cover this in detail in
In this example we conducted a hypothesis test. We
Chapter 12. A hypothesis test is based on a pointI estimate from a sample;
in this case the sample mean is 38,000. We test our sample estimate against
E
a null value, in this case 45,000, which is the stated mean for all LCD televisions. We found that it would be a very rare event
N to draw a sample of 40
from a population where μ = 45,000 and σ = 7,500 and get a sample mean of
N
38,000. In fact, it is so rare that we can reject the notion that it came from a
E
population with μ = 45,000. As a result, we might conclude
that Brand X LCD
backlights last much less time.
An alternative to a point estimate and hypothesis 2
test is an interval estimate
using a confidence interval. In this approach we put
4 a bound of error around
our sample estimate based on sampling theory. To do this we take the sample
7 on so many standard
mean and put a plus and minus interval around it based
errors. The number of standard errors is based on
9 a probability level that
we have confidence in, e.g., 95% confidence. Conversely, 1—the confidence
T in our conclusion. For
level, or 1 − .95 = .05—is the chance of being wrong
example, if we want a 95% confidence interval around
S our estimate we use a
value of 1.96 standard errors added and subtracted from the mean. The 1.96
comes from the normal distribution and reflects 95% of the values in the
distribution. What we are saying, in so many words, is that we have a 95%
chance that the population mean is within this interval. The precise definition
of the confidence interval will come in Chapter 10. With a 95% confidence
interval we have a 5% chance of being wrong.
38, 000 ± 1.96 *
K11352_Ilvento_CH09.indd 183
6, 000
40
= 38, 000 ± 1.96 * 948.683 = 38, 000 ± 1859.419
7/12/13 2:43 PM
184
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
For Brand X TVs, our confidence interval says that we are 95% sure that the
true population value would lie between 36,140.58 hours and 39,859.42 hours.
We do not know exactly where the population mean lies, but we think it is in
this interval. But we could be wrong. In any case, 45,000 is nowhere near this
interval and we can safely conclude that Brand X backlights last much less
time than the industry norm.
Summary
Sampling distributions and the logic underlying them form the basis for
inference in statistics. I refer to this as the “logic of inference” when I talk
with students. It is the notion
R that we need to think of our observed sample
as being one of many possible samples from the population. And in that
I
framework, an estimate from my sample may be thought of as a reasonC parameter, but rarely will it exactly equal the
able estimate of the population
population value. The difference is referred to as sampling error. We expect
A
that the sampling error will be small, and it will vary from sample to sample.
We can show this by doingRa simulation of repeated samples from a known
population. The sampling error,
D or standard error, will be a function of the
population parameter σ and the sample size n.
,
If we know the distribution of the sampling distribution, we can use that
information to make an inference. It turns out that the sampling distribution
A distribution. We can use that information to
of the mean follows a normal
place a bound of error around
D our estimate in a confidence interval, or we
can conduct a hypothesis test for our estimate. In either case the inference is
placed within a probabilityRframework. We will seek to put the probabilities
vastly in our favor, as in being
I 95% sure, but there is always a small chance
we will be wrong in our conclusions. Chapter 10 will deal with confidence
E
intervals and Chapter 11 with hypothesis tests.
N
For the rest of the book, we will not deal with certainties. Our estimates will
N
be made in a probability framework. Inferential statistics from a sample are
not about certainty. We canE
always be wrong in our inferences. If you want to
be certain, you need to get the population rather than a sample. However, we
will keep the probability of being wrong rather small. And for the most part,
2 of error.
we can live with a small chance
4
7
Sampling Distribution Problems
9
1. In a sampling distribution for a mean, if we know the population mean and
T σ), then the distribution of the sample means folstandard deviation (μ and
lows a normal distribution,
S with a mean equal to μ and the standard deviation
equal to σ/SQRT(n). Note that σ/SQRT(n) is called the standard error. Based
on this information, if we took a sample of size n (n is given in the problem as
some number), what is the probability that the sample mean is greater than
(or less than) some value? This is simply a z-score and normal distribution
problem, similar to what we did in chapter 8. However, there is one important change with these problems. Now the denominator of the z-score is the
standard error and not σ. That said, answer the following questions.
A manufacturing process produces a product weighing an average of 150
grams, with a standard deviation of 12 grams (i.e., μ = 150 and σ = 12). If
K11352_Ilvento_CH09.indd 184
7/12/13 2:43 PM
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
185
the plant manager takes a sample of 36 observations, what is the probability that:
a. The sample mean is less than 148?
b. The sample mean is greater than 155?
c. The sample mean is between 147 and 153?
d. Suppose the sample size is now 100. How does the standard error
change? Recalculate the probability in part b and show how it changes
when the sample size is increased.
R
2. In a sampling distribution for a mean, if we know the population mean
I
and standard deviation (μ and σ), then the distribution of the sample
means follows a normal distribution, with a C
mean equal to μ and the
standard deviation equal to σ/SQRT(n). Note that σ/SQRT(n) is called the
A
standard error. Based on this information, if we took a sample of size n (n
Ris the probability that the
is given in the problem as some number), what
sample mean is greater than (or less than) some
D value? This is simply a
z-score and normal distribution problem, similar to what we did in chapter 8. However, there is one important change ,with these problems. Now
the denominator of the z-score is the standard error and not σ. That said,
answer the following questions.
A
A manufacturing process produces a product that
D contains an average of
3.5 liters of liquid with a standard deviation of 0.25 liters (i.e., μ = 3.5 and
R 64 observations, what is
σ = 0.25). If the plant manager takes a sample of
the probability that:
I
E
a. The sample mean is greater than 3.55?
N
b. The sample mean is less than 3.6?
N
c. The sample mean is between 3.45 and 3.5? E
d. The sample mean is exactly equal to 3.5?
2
3. In a sampling distribution for a mean, if we know
4 the population mean
and standard deviation (μ and σ), then the distribution of the sample
means follows a normal distribution, with a 7mean equal to μ and the
standard deviation equal to σ/SQRT(n). Note that
9 σ/SQRT(n) is called the
standard error. Based on this information, if we took a sample of size n (n
Tis the probability that the
is given in the problem as some number), what
sample mean is greater than (or less than) some
S value? This is simply a
z-score and normal distribution problem, similar to what we did in chapter 8. However, there is one important change with these problems. Now
the denominator of the z-score is the standard error and not σ. That said,
answer the following questions.
A manufacturing process produces a product that contains an average of
3.5 liters of liquid with a standard deviation of 0.25 liters (i.e., μ = 3.5 and
K11352_Ilvento_CH09.indd 185
7/12/13 2:43 PM
186
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
σ = 0.25). If the plant manager takes a sample of 64 observations, what is
the probability that:
a. The sample mean is less than 3.55?
b. The sample mean is greater than 3.6?
c. The sample mean is between 3.45 and 3.75?
d. The sample mean is exactly equal to 3.75?
4. In a sampling distribution for a mean, if we know the population mean
and standard deviationR(μ and σ), then the distribution of the sample
means follows a normal distribution, with a mean equal to μ and the
I
standard deviation equal to σ/SQRT(n). Note that σ/SQRT(n) is called the
C this information, if we took a sample of size n (n
standard error. Based on
is given in the problem as some number), what is the probability that the
A
sample mean is greater than (or less than) some value? This is simply a
R
z-score and normal distribution
problem, similar to what we did in chapter 8. However, there isD
one important change with these problems. Now
the denominator of the z-score is the standard error and not σ. That said,
,
answer the following questions.
Assume the systolic blood pressure of young adults in the U.S. aged 20
A distribution, with μ = 113.7 and σ = 11.7. If we
to 30 years follows a normal
take a random sample of
D 150 young adults, what is the probability that:
a. The sample mean is R
between 113 and 115?
I
E
c. The sample mean is N
greater than 111?
N
d. The sample mean is greater than 116.5?
E
b. The sample mean is less than 111?
5. In a sampling distribution for a mean, if we know the population mean
and standard deviation (μ and σ), then the distribution of the sample
means follows a normal2distribution with a mean equal to μ and the standard deviation equal to4
σ/SQRT(n). Note that σ/SQRT(n) is called the standard error. Based on this information, if we took a sample of size n (n is
7 some number), what is the probability that the
given in the problem as
sample mean is greater9than (or less than) some value? This is simply a
z-score and normal distribution problem, similar to what we did in chapter 8. However, there isT
one important change with these problems. Now
the denominator of theS
z-score is the standard error and not σ. That said,
answer the following questions.
Suppose the lifespan of the top-of-the-line car battery follows a normal
distribution with μ = 48.1 months and σ = 4.4 months based on regular
use and maintenance. If we take a random sample of 40 batteries, what
is the probability that:
a. The sample mean is between 45 and 51?
b. The sample mean is less than 46?
K11352_Ilvento_CH09.indd 186
7/12/13 2:43 PM
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
187
c. The sample mean is greater than 50?
d. The sample mean is greater than 46?
6. Suppose we work for a company that makes a popcorn product containing 1.2 ounces of unpopped kernels in a microwavable bag. However,
no manufacturing process is perfect, and there is variability from bag
to bag, which the factory manager seeks to keep to a minimum. Bags
that are underfilled can lead to consumer complaints and lawsuits, while
bags that are overfilled can result in lost profits and affect the quality of
the popping process. Based on previous experience, the distribution of
the popcorn bags is distributed normally with a mean of 1.21 oz. and a
standard deviation of 0.22, bag~N(1.21, 0.22). R
I
a. Suppose the manger takes a sample of 16 bags and observes a sample
C One of the bags in the
mean of 1.20 with a standard deviation of 0.20.
sample weighs 1.3 ounces. Calculate a z-score for the value of 1.3 and
A
interpret its meaning.
R
b. The manager asks the following question: “If
D the mean and standard
deviation of the population are true as given (i.e., μ = 1.21 and σ = 0.22),
if I took a random sample of 16 bags, what, is the probability that the
sample mean would be greater than 1.3 oz.? Note: This problem is a
sampling distribution problem. Use the standard error in calculating a
A
z-score.
D
c. The manager asks the same question as in part b, but in reference to 49
R the probabilities from
bags. He chooses a larger sample size. Compare
using a sample of 16 bags and to 49 bags. Explain
why the answers are
I
different.
E
7. The data below are means taken from random
N samples from a population that is distributed normally with μ = 75 and σ = 8. Samples of 30
N
observations were randomly drawn, and the mean was calculated for
each sample. This was done in Excel using theENORMINV and the RAND
functions. To demonstrate sampling distributions, 30 sample means were
examined (below). This is a small sample of the sample means, but none2
theless, it gives us insight into sampling distribution
theory.
Sample Means
72.9
74.3
75.6
72.3
74.3
75.6
72.5
74.7
75.6
73.5
74.8
75.8
4
7
Count
9
Sum
T
SumSq
S
Count
30
2245.70
168154.93
30.00
73.5
74.8
76.0
Min
72.30
73.7
74.9
76.3
Max
77.40
73.7
75.0
76.6
Q2
73.88
73.8
75.3
76.7
Q3
75.60
74.1
75.3
77.0
74.2
75.5
77.4
K11352_Ilvento_CH09.indd 187
7/12/13 2:43 PM
188
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
a. Construct a stem-and-leaf plot of the 30 sample means.
b. Calculate the mean, median, and mode of the sample means.
c. Calculate the variance and the standard deviation of the sample means.
d. Describe in words the distribution of the sample means using your
stem-and-leaf plot and the descriptive statistics. What is your conclusion regarding this small example of a sampling distribution from a
variable that is distributed normally?
e. Sampling theory tells us that the standard deviation of the sampling distribution should be σ/SQRT(n). For this data, that would be
8/SQRT(30) = 1.461. Compare
the standard deviation you calculated for
R
the 30 means with this figure.
I
8. The data below are means
C taken from random samples from a population that is distributed normally with μ = 75 and σ = 8. Samples of 60
A
observations were randomly
drawn, and the mean was calculated for
each sample. This was R
done in Excel using the NORMINV and the RAND
functions. To demonstrate sampling distributions, 30 sample means were
examined (below). ThisD
is a small sample of the sample means, but nonetheless, it gives us insight
, into sampling distribution theory.
Sample Means, Samples of 60
73.2
74.6
73.6
74.9
73.8
75.1
73.8
75.3
74.1
75.4
74.2
75.4
74.3
75.5
74.4
75.6
74.4
75.8
74.6
75.9
A
D
R
I
E
N
N
E
75.9
Count
75.9
Sum
75.9
SumSq
76.0
Count
30.00
76.0
Min
73.20
76.0
Max
77.00
76.1
Q2
74.40
76.1
Q3
75.90
30.0
2255.40
169587.96
76.6
77.0
2
4
b. Calculate the mean, median,
and mode of the sample means.
7
9
c. Calculate the variance and the standard deviation of the sample means.
T
d. Describe in words the distribution of the sample means using your
S
stem-and-leaf plot and the descriptive statistics. What is your conclua. Construct a stem-and-leaf plot of the 30 sample means.
sion regarding this small example of a sampling distribution from a
variable that is distributed normally?
e. Sampling theory tells us that the standard deviation of the sampling distribution should be σ/SQRT(n). For this data, that would be
8/SQRT(60) = 1.033. Compare the standard deviation you calculated for
the 30 means with this figure.
K11352_Ilvento_CH09.indd 188
7/12/13 2:43 PM
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
189
9. The data below are means taken from random samples from a population that is distributed as a uniform continuous distribution with parameters 0 and 10. A uniform continuous distribution would generate a
histogram that looks like a rectangle. With parameters 0 and 10 the mean
of this distribution is 5 and the standard deviation is 2.89 (think of these
as μ = 5 and σ = 2.89). A uniform distribution is decidedly not normal
or bell shaped and, as such, provides a good illustration of whether the
sampling distribution resembles a normal distribution. Samples of 36
observations were randomly drawn, and the mean was calculated for
each sample. This was done in Excel using the RAND function. To demonstrate sampling distributions, 36 different sample means were examined
(below). This is a small sample of the sample means, but nonetheless, it
gives us insight into sampling distribution theory.
R
4.4
4.9
5.3
I
C
Count
A
Sum
R
SumSq
D
Count
Min
,
4.5
4.9
5.3
Max
6.20
4.5
4.9
5.5
4.78
4.6
5.0
5.5
Q2
A
Sample Means, Samples of 36
4.1
4.8
5.2
4.2
4.9
5.2
4.2
4.9
5.2
4.4
4.9
5.3
Q3
D
4.7
5.0
5.5
R
4.8
5.1
5.6
I
4.8
5.1
5.6
E
4.8
5.1
6.2
N
a. Construct a stem-and-leaf plot of the 36 sample means.
N
b. Calculate the mean, median, and mode of the
E sample means.
36.00
178.90
896.11
36.00
4.10
5.23
c. Calculate the variance and the standard deviation of the sample means.
2
d. Describe in words the distribution of the sample means using your
4
stem-and-leaf plot and the descriptive statistics. What is your conclu7
sion regarding this small example of a sampling
distribution from a
variable that is distributed normally?
9
e. Sampling theory tells us that the standard T
deviation of the sampling
distribution should be σ/SQRT(n). For this S
data, that would be 2.89/
SQRT(36) = 0.482. Compare the standard deviation you calculated for
the 36 means with this figure.
K11352_Ilvento_CH09.indd 189
7/12/13 2:43 PM
190
c h a p t e r 9 Sampling Distribution for a Mean or Proportion
10. The data below are means taken from random samples from a population
that is distributed as a uniform continuous distribution with parameters 0
and 10. A uniform continuous distribution would generate a histogram that
looks like a rectangle. With parameters 0 and 10, the mean of this distribution is 5, and the standard deviation is 2.89 (think of these as μ = 5 and σ =
2.89). A uniform distribution is decidedly not normal or bell shaped and, as
such, provides a good illustration of whether the sampling distribution resembles a normal distribution. Samples of 20 observations were randomly
drawn and the mean was calculated for each sample. (Note: Samples of 20
should not generate normal distributions based on the Central Limit Theorem, but they should still be close.) This was done in Excel using the RAND
function. To demonstrate sampling distributions, 36 different sample means
were examined (below).RThis is a small sample of the sample means, but
nonetheless, it gives us insight into sampling distribution theory.
3.8
4.9
3.9
4.9
4.3
4.9
4.5
4.9
4.5
5
4.5
5
I
C
Sample
Means, Samples of 20
A
5.4
Count
5.5
Sum
R
5.5
SumSq
D
5.5
Count
,
36.0
183.2
944.5
36.0
5.6
Min
3.8
5.6
Max
6.4
A
4.5
5
5.8
Q2
D
4.6
5.1
5.8
Q3
R
4.6
5.2
5.8
I
4.7
5.3
5.9
4.8
5.3
6.1
E
4.8
5.3
6.4
N
N plot of the 36 sample means.
a. Construct a stem-and-leaf
E
4.7
5.5
b. Calculate the mean, median, and mode of the sample means.
c. Calculate the variance and
2 the standard deviation of the sample means.
4
d. Describe in words the distribution of the sample means using your stem7
and-leaf plot and the descriptive
statistics. What is your conclusion regarding this small example of a sampling distribution from a variable
9
that is distributed normally?
T
e. Sampling theory tells usSthat the standard deviation of the sampling distri-
bution should be σ/SQRT(n). For this data, that would be 2.89/SQRT(36) =
0.482. Compare the standard deviation you calculated for the 36 means
with this figure.
K11352_Ilvento_CH09.indd 190
7/12/13 2:43 PM
Continuous Random Variables
and the Normal Distribution
R
Unlike discrete random variables, continuous random variables take on any
I
point in the interval. Thus the probability distribution is continuous and is
referred to a probability density function or PDF (also
C represented as f (x)). In
contrast to a discrete random variable, it is not particularly useful to think of
A
a probability for a particular value of a continuous random variable. In fact,
with a continuous random variable the probabilityRthat X equals any value is
zero: P (X = k) = 0. Instead, we will tend to think ofD
the probability that X falls
between two values, is greater than a value, or is less than a value. We can do
this by finding areas under the probability density, function curve.
chapter
8
In this chapter we will focus exclusively on one continuous random variA
able, the normal distribution. There are other continuous
random variables,
such as the uniform distribution, and the exponential
distribution.
However,
D
the normal distribution has important significance in statistics, especially in
inferential statistics. For much of the remainder R
of this course we will be
working with some aspect of the normal distribution
in order to make an
I
inference from a sample to the population.
E
N
N
E
bell-shaped,
Normal Distribution
The normal distribution is one particular
symmetrical distribution. The normal distribution is defined by a mathematical formula (see
below) which is specified by two key parameters, the mean and the standard
deviation. The formula for the normal distribution 2
is given as:
4
P (X ) =
e
7
σ 2π
9
T
While the formula looks daunting, the key point is that the only things that
S deviation (sigma). For
vary in this formula is the mean (mu) and the standard
1
− ( X − )2 /(2σ 2 )
every distribution with a mean and a standard deviation there is a different
normal curve Thus, there are an infinite number of normal curves. If X is a
random variable and is distributed as a normal variable then it is designated
as: X ~ N(mean, std Dev).
For every combination of a mean and standard deviation there is a different
normal curve. Figures 8.1 and 8.2 show two normal distributions for μ = 100
and σ = 10 and 5. Changing σ from 10 to 5 changes the shape of the distribution, but both are normal distributions.
K11352_Ilvento_CH08.indd 147
7/12/13 2:41 PM
148
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
figure 8.1
R
Probability Density Function of a Normal Distribution with Mean 100 and Standard
I
Deviation 10
C
A
R
D
,
figure 8.2
A
D
R
I
Probability Density Function ofEa Normal Distribution with Mean 100 and Standard
Deviation 5
N
N
Since there are an infinite E
number of normal distributions depending upon
the combination of μ and σ, calculating probabilities for any one distribution
would be tedious. However, it is possible to transform any variable into a
z-score. Remember, a z-score
2 is a transformation for a particular value in a
variable where we subtract the mean and divide by the standard deviation
4
(see Chapter 4).
7
9
T
S
values
z=
(x
i
−X
s
)
Any variable where all
are transformed into z-scores will have a
mean equal to zero and a standard deviation equal to 1. This transformation will allow us to work with one normal distribution with μ = 0 and σ = 1.
This is referred to as the standard normal distribution and we will use the
designation, X ~ N(0, 1).
Properties of the Normal Distribution
The normal distribution reflects a probability and as such the total area under
the curve is equal to 1.0. This distribution is a symmetrical, bell-shaped curve.
This means that the area from the right of the center is equal to .5 and it is
K11352_Ilvento_CH08.indd 148
7/12/13 2:41 PM
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
149
exactly identical to the area on the left hand side of the curve. The normal
distribution is a continuous distribution so it will not be easy to reflect values
and outcomes in the frequency table. There are an infinite number of values
to the distribution. Instead we will think of solving for areas under the curve
between two values, or greater than or less than particular values. There will
in fact be a normal distribution table that we will learn about and work from.
This table will be tied to the standard normal distribution with a mean of zero
and a standard deviation of 1.0. And because the curve is symmetrical, we
often only tabulate probabilities for one half of the curve.
The normal distribution is expressed as a mathematical formula which is
specified by the mean and standard deviation. We tend to think that many
variables can be approximated by the normal distribution,
but in reality, perR
haps few things are defined as exactly normally distributed. However, there
I
are many continuous variables that reflect a mound-shaped symmetrical distribution with a center that is close to the meanC
and a shape that reflects
the standard deviation. And the normal distribution can reasonably reflect
A
the shape of those distributions. These can be the height of adult males, the
R
amount of soda in a bottle from a manufacturing process,
or the distribution
of intelligence in a population.
D
,
Here are some basic properties of the normal distribution.
• The area under a probability density function (PDF) of the normal distriA
bution is equal to 1.0.
• The normal distribution is defined by the mean
D(μ) and the standard deviation (σ). We will think of these as parameter values, and thus use μ and
R mean and the standard
σ, but at times I will simply refer to them as the
deviation of the distribution.
I
• The center of the distribution is specified by the mean.
E the mode. All measures
• The mean equals the median which also equals
of central tendency that we have discussed areNequal in the normal distribution.
N
• Most of the values in a variable that is normally distributed are found
E than 2/3 of the values
close to the center of the distribution. A little more
are within plus or minus one standard deviation from the mean. This
gives the distribution its unique “bell” shape.
• The normal distribution has an infinite range of2values, but the values out
in the tails of the distribution are increasingly rare.
4 Most values fall within
two standard deviations of the mean.
7
• There are other known dimensions to the normal
distribution which
relate to other measures of center and spread9of the data. For example,
the mean is located at the 50th percentile and the IQR (inter quartile range)
is 1.349 standard deviations wide (.6745 belowTor .6745 above the mean/
median).
S
• With the normal distribution, we can now be much more precise about
the central limit theorem and the areas associated with one, two, or three
standard deviations around the mean.
°° Plus or minus 1 standard deviation around the mean accounts for a
probability of .6826 – 68.26% of the values should fall within 1 standard
deviation of the mean.
°° Plus or minus 2 standard deviations around the mean accounts for a
probability of .9544 – 95.44% of the values should fall within 2 standard
deviations of the mean.
°° Plus or minus 3 standard deviations around the mean accounts for a
probability of .9974 – 99.74% of the values should fall within 3 standard
K11352_Ilvento_CH08.indd 149
7/12/13 2:41 PM
150
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
deviations of the mean. In other words, virtually all the values will be
within three standard deviations.
• In a normal distribution, it is a rare event for a value to be more than two
standard deviations away from the mean. It is an extremely rare event for
a value to be more than three standard deviations from the mean.
Since its properties are defined by a formula, we can define a priori probabilities associated with the curve. We can work out the probability of finding
a value of 110 or higher if the mean is 100 and the standard deviation is 10,
but this probability will be different if the mean is 100 and the standard deviation is 5. However, there is an easy way around this problem. If we convert
our normally distributed variable to a z-score, we make it possible to use one
table of probabilities for the
R normal distribution. This is called the standard
normal distribution. Remember, for z-scores if we convert all the values in
I
our variable to a z-score it would result in a new variable with a mean equal
to zero and a standard deviation
equal to one. This will allow us to use one
C
table, the standard normal table, to solve probability problems with variables
A
that are distributed normally.
R
Standard Normal Table D
In order to solve probability, problems you need to be able to understand and
work with the standard normal table. A full copy of the standard normal table
calculated from Excel is given in the Appendix, but we will start to learn how
to use the table by lookingA
at a partial table, in Table 8.1. The organization of
this table is that it reflectsD
probabilities from the center of the distribution,
where the mean equals zero, out into the right hand tail of the distribution.
R which shows 1/2 of the standard normal distriThis can be seen in Figure 8.3,
bution with a mean of 0 andI a standard deviation of 1.0. The right hand side of
the distribution, out into the right hand tail, reflects .5 probability and is idenE
tical in size and shape to the left hand side of the distribution. You should be
N to organize the standard normal table which
aware that there are other ways
would reflect probabilities from the left hand side up to the center, or from the
N
left tail to the right hand tail. We will use the format found in Table 8.1.
E
The rows in Table 8.1 reflect the whole number and the first decimal place
of a z-score. The columns reflect the second decimal place. For example, the
2 the row 0.3 and the column .03 for a z-score of
shaded cell in Table 8.1 reflects
0.33. The number .1293 in the
4 table reflects the probability from the center of
the distribution, where μ = 0, out .33 standard deviations to the right. The area
under the standard normal 7
curve for a z-score of .33 is .1293. We can read this
as the probability of the area
9 between the center out .33 standard deviations
to the right is .1293.
T
S
figure 8.3
K11352_Ilvento_CH08.indd 150
A Graph Showing the Right Hand Side of the Standard Normal Distribution
7/12/13 2:41 PM
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
151
table 8.1
A Partial Standard Normal Table
Standard Normal Curve Probability Distribution
The table is based on the upper right 1/2 of the normal distribution; total area shown is .5
The z-score values are represented by the column value + row value, up to two decimal places
The probabilities up to the z-score are in the cells
Z
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0000
0.0398
0.0793
0.1179
0.1554
0.1915
0.2257
0.2580
0.2881
0.3159
0.3413
0.0040
0.0438
0.0832
0.1217
0.1591
0.1950
0.2291
0.2611
0.2910
0.3186
0.3438
0.0080
0.0478
0.0871
0.1255
0.1628
0.1985
0.2324
0.2642
0.2939
0.3212
0.3461
0.0120
0.0517
0.0910
0.1293
0.1664
0.2019
0.2357
0.2673
0.2967
0.3238
0.3485
0.0160
0.0557
0.0948
0.1331
R
0.1700
0.2054
I
0.2389
C
0.2704
A
0.2995
0.3264
R
0.3508
0.0199
0.0596
0.0987
0.1368
0.1736
0.2088
0.2422
0.2734
0.3023
0.3289
0.3531
0.0239
0.0636
0.1026
0.1406
0.1772
0.2123
0.2454
0.2764
0.3051
0.3315
0.3554
0.0279
0.0675
0.1064
0.1443
0.1808
0.2157
0.2486
0.2794
0.3078
0.3340
0.3577
0.0319
0.0714
0.1103
0.1480
0.1844
0.2190
0.2517
0.2823
0.3106
0.3365
0.3599
0.0359
0.0753
0.1141
0.1517
0.1879
0.2224
0.2549
0.2852
0.3133
0.3389
0.3621
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
0.3643
0.3849
0.4032
0.4192
0.4332
0.4452
0.4554
0.4641
0.4713
0.4772
0.3665
0.3869
0.4049
0.4207
0.4345
0.4463
0.4564
0.4649
0.4719
0.4778
0.3686
0.3888
0.4066
0.4222
0.4357
0.4474
0.4573
0.4656
0.4726
0.4783
0.3708
0.3907
0.4082
0.4236
0.4370
0.4484
0.4582
0.4664
0.4732
0.4788
0.3925
0.4099
0.4251
A
0.4382
D
0.4495
0.4591
R
0.4671
I
0.4738
0.4793
E
0.3749
0.3944
0.4115
0.4265
0.4394
0.4505
0.4599
0.4678
0.4744
0.4798
0.3770
0.3962
0.4131
0.4279
0.4406
0.4515
0.4608
0.4686
0.4750
0.4803
0.3790
0.3980
0.4147
0.4292
0.4418
0.4525
0.4616
0.4693
0.4756
0.4808
0.3810
0.3997
0.4162
0.4306
0.4429
0.4535
0.4625
0.4699
0.4761
0.4812
0.3830
0.4015
0.4177
0.4319
0.4441
0.4545
0.4633
0.4706
0.4767
0.4817
D
0.3729
,
N
You can solve for other probabilities. For example,Nfind the probability associated with the following z-scores. In most cases I am expressing the probaEwhich is the way we tend
bility of the z-value as a range between two values,
to think of these problems. Some of the probabilities will be a little tricky, but
try to find an answer and I will give the correct answer and an explanation.
All of these can be solved by looking at Table 8.1. 2
a.
b.
c.
d.
e.
f.
P (0 ≤ Z ≤ 1.0) =
P (0 ≤ Z ≤ .89) =
P (0 ≤ Z ≤ 1.62) =
P (−1 ≤ Z ≤ 0) =
P (Z = 1) =
P (Z ≥ 1) =
4
7
9
T
S
The answers are given below, along with a brief explanation. Problems d,
e, and f were a little tricky. I did this to push you a little to see if you could
handle them, and to lead ito the next discussion.
a. P (0 ≤ Z ≤ 1.00) =
.3413
b. P (0 ≤ Z ≤ .89) =
.3133
K11352_Ilvento_CH08.indd 151
This comes directly from Table 8.1. The
probability associated with the area
from the center of the distribution out
to 1.00 standard deviation is .3413.
This comes directly from Table 8.1. The
probability associated with the area
7/12/13 2:41 PM
152
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
c.
d.
e.
f.
P (0 ≤ Z ≤ 1.62) =
.4474
P (−1.00 ≤ Z ≤ 0.00) =
.3413
P (Z = 1.00) =
R.0000
I
C
P (Z ≥ 1.00) =
A.1587
R
D
,
A
from the center of the distribution out
to .89 standard deviations is .3133.
This comes directly from Table 8.1. The
probability associated with the area
from the center of the distribution out
to 1.62 standard deviations is .4474.
This one is a little tricky, but it also
comes from Table 8.1 and knowledge
that the left hand side of the distribution, where we would deal with negative z-scores, is identical to the right
hand side.
In a continuous distribution, the probability of any single value is thought to
be zero. This definition was given in
the first paragraph of this chapter.
We can solve this problem from Table
8.1 and the knowledge that the total
area to the right of the center is .5.
From Table 8.1 we can find that the
probability from the center to 1 stan
dard deviation above the mean is
.3413. The area after this, that a z-value
is greater than 1.00, is equal to
.5 − .3413 = .1587.
D
The answer to problem d shows that the left hand side of the distribution is
R I do not need a table with negative z-values in
equal to the right hand side.
order to solve for these probabilities
as long as I remember that the left hand
I
side of the distribution is a mirror image of the right hand side. Problem f is
a reminder that sometimesEI want a different probability than what is in the
table. In the case of problem
N f, I wanted the rest of the probability out in the
right hand tail. To calculate this probability, I need to subtract the table probN
ability from .5. I call these types of calculations normal table gymnastics.
E
A Closer Look at the Standard Normal Table
2
Table 8.2 shows the full probabilities
for the standard normal distribution in
the upper right hand side of4the distribution. The table provides the probabilities from the center of the distribution, the mean, out to a particular z-score.
7 decimal places for a z-score, and four decimal
The table allows for up to two
places of a probability. In some
9 cases we may prefer more precision and we
will have to estimate between values or interpolate, but the precision of the
T of our applications.
table will work well for most
S
Let us begin by solving for the precise probabilities associated with one, two,
or three standard deviations about the mean in the normal distribution. We
noted in Chapter 4 that the empirical rule states that in a symmetrical, mound
shaped distribution, approximately 68% of the values should be within plus
or minus 1 standard deviation from the mean, 95% should be within 2 standard deviations, and 99.9% should be within 3 standard deviations. Now we
can precisely solve these values for the normal distribution, which is a particular symmetrical, mound-shaped distribution.
From Table 8.2, a z-value of 1.00 is associated with a probability of .3413. This
is the area of the distribution from the center out to one standard deviation
K11352_Ilvento_CH08.indd 152
7/12/13 2:41 PM
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
153
table 8.2
Probabilities Found under One-Half of the Standard Normal Distribution
Standard Normal Curve Probability Distribution
The table is based on the upper right half of the normal distribution; total area shown is .5
The z-score values are represented by the column value + row value, up to two decimal places
The probabilities up to the z-score are in the cells
Z
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0000
0.0398
0.0793
0.1179
0.1554
0.1915
0.2257
0.2580
0.2881
0.3159
0.3413
0.0040
0.0438
0.0832
0.1217
0.1591
0.1950
0.2291
0.2611
0.2910
0.3186
0.3438
0.0080
0.0478
0.0871
0.1255
0.1628
0.1985
0.2324
0.2642
0.2939
0.3212
0.3461
0.0120
0.0517
0.0910
0.1293
0.1664
0.2019
0.2357
0.2673
0.2967
0.3238
0.3485
0.0160
0.0557
0.0948
0.1331
R
0.1700
0.2054
I
0.2389
C
0.2704
A
0.2995
0.3264
R
0.3508
0.0199
0.0596
0.0987
0.1368
0.1736
0.2088
0.2422
0.2734
0.3023
0.3289
0.3531
0.0239
0.0636
0.1026
0.1406
0.1772
0.2123
0.2454
0.2764
0.3051
0.3315
0.3554
0.0279
0.0675
0.1064
0.1443
0.1808
0.2157
0.2486
0.2794
0.3078
0.3340
0.3577
0.0319
0.0714
0.1103
0.1480
0.1844
0.2190
0.2517
0.2823
0.3106
0.3365
0.3599
0.0359
0.0753
0.1141
0.1517
0.1879
0.2224
0.2549
0.2852
0.3133
0.3389
0.3621
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
0.3643
0.3849
0.4032
0.4192
0.4332
0.4452
0.4554
0.4641
0.4713
0.4772
0.3665
0.3869
0.4049
0.4207
0.4345
0.4463
0.4564
0.4649
0.4719
0.4778
0.3686
0.3888
0.4066
0.4222
0.4357
0.4474
0.4573
0.4656
0.4726
0.4783
0.3708
0.3907
0.4082
0.4236
0.4370
0.4484
0.4582
0.4664
0.4732
0.4788
0.3925
0.4099
0.4251
A
0.4382
D
0.4495
0.4591
R
0.4671
I
0.4738
0.4793
E
0.3749
0.3944
0.4115
0.4265
0.4394
0.4505
0.4599
0.4678
0.4744
0.4798
0.3770
0.3962
0.4131
0.4279
0.4406
0.4515
0.4608
0.4686
0.4750
0.4803
0.3790
0.3980
0.4147
0.4292
0.4418
0.4525
0.4616
0.4693
0.4756
0.4808
0.3810
0.3997
0.4162
0.4306
0.4429
0.4535
0.4625
0.4699
0.4761
0.4812
0.3830
0.4015
0.4177
0.4319
0.4441
0.4545
0.4633
0.4706
0.4767
0.4817
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
0.4821
0.4861
0.4893
0.4918
0.4938
0.4953
0.4965
0.4974
0.4981
0.4987
0.4826
0.4864
0.4896
0.4920
0.4940
0.4955
0.4966
0.4975
0.4982
0.4987
0.4830
0.4868
0.4898
0.4922
0.4941
0.4956
0.4967
0.4976
0.4982
0.4987
0.4834
0.4871
0.4901
0.4925
0.4943
0.4957
0.4968
0.4977
0.4983
0.4987
0.4842
0.4878
0.4906
0.4929
0.4946
0.4960
0.4970
0.4978
0.4984
0.4987
0.4846
0.4881
0.4909
0.4931
0.4948
0.4961
0.4971
0.4979
0.4985
0.4988
0.4850
0.4884
0.4911
0.4932
0.4949
0.4962
0.4972
0.4979
0.4985
0.4988
0.4854
0.4887
0.4913
0.4934
0.4951
0.4963
0.4973
0.4980
0.4986
0.4988
0.4857
0.4890
0.4916
0.4936
0.4952
0.4964
0.4974
0.4981
0.4986
0.4988
3.1
0.4990
0.4991
0.4992
0.4993
0.4993
D
0.3729
,
N
0.4838
N
0.4875
0.4904
E
0.4927
0.4945
0.4959
2
0.4969
4
0.4977
0.4984
7
0.4987
9
0.4991
0.4991
0.4992
0.4992
0.4992
T
S using Microsoft Excel
Probabilities computed
above the mean. Since the standard normal distribution is a symmetric distribution, the area 1 standard deviation below the mean is also .3413. Thus the
total area plus or minus 1 standard deviation around the mean is:
P (X ± 1σ) = .3413 + .3413 = .6826
This is very similar to what we said in the empirical rule; approximately 68%
of the area is within 1 standard deviation about the mean. Figure 8.4 shows
K11352_Ilvento_CH08.indd 153
7/12/13 2:41 PM
154
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
figure 8.4
Probabilities Associated with Plus or Minus 1.0 Standard Deviation about the Mean
in the Standard Normal Table
the area within plus or minus
R one standard deviation about the mean (in
white). The area out in the tails (shaded in black) is equal to:
I
P (X ≤ −1σ or X C
≥ 1σ) = (.5 − .3413)*2 = .1587*2 = .3174
A
We could have also solved for the areas in the tails by subtracting .6826 from 1.0.
R
Similarly we can solve for plus or minus two standard deviations or plus or
D
minus three standard deviations by the following calculations (see below).
These calculations result in, similar probabilities as noted from the empirical
rule. Plus or minus 2 standard deviations captures about 95% of the observations while plus or minus 3 standard deviations captures 99.7% of the values
A
in a normally distributed variable.
Because these values are so important for
inferential statistics, I have D
also included the graphs for each interval (Figures
8.5 and 8.6). The area in the tails for more than three standard deviations can
Rbecause it is so small. Any value that far out in
hardly be seen in the graph
the distribution is an extremely
rare event. The areas out in the tails of the
I
distribution, beyond two or three standard deviations from the mean, will be
E inference in upcoming chapters.
very important as we shift to
N
N
P (X ≤ −2σ or X ≥
E2σ) = (.5 − .4772)*2 = .0228*2 = .0456
P (X ± 2σ) = .4772 + .4772 = .9544
P (X ± 3σ) = .4987 + .4987 = .9974
2
4
7
Normal Table Gymnastics
9
The organization of a standard normal table with probabilities in half of
the probability distributionTis a choice that makes some problems easy to
solve while other ones require
S a few calculations. This approach is similar to
P (X ≤ −3σ or X ≥ 3σ) = (.5 − .4987)*2 = .0013*2 = .0026
the discussion for the cumulative binomial probability tables—it will require
some normal table gymnastics. This section will go through some typical
problems that we may face with the standard normal distribution and how
we might solve for probabilities.
In general, I tell students to follow the following strategy when working with
a problem involving the standard normal table. If you follow this approach
you will have success in solving these problems.
1. Draw out the problem using the normal distribution
2. Calculate the relevant z-scores
K11352_Ilvento_CH08.indd 154
7/12/13 2:41 PM
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
Probabilities Associated with Plus or Minus 2.0 Standard Deviations about the Mean
in the Standard Normal Table
R
I
C
A
R
D
, Deviations about the Mean
Probabilities Associated with Plus or Minus 3.0 Standard
in the Standard Normal Table
155
figure 8.5
figure 8.6
A
3. Look up probabilities in the standard normal table
4. Do any necessary calculations (the gymnastics)
D
R10). What is the probabilNormal Probability Problem 1. Suppose X ~ N(100,
ity that X is greater than 115?
I
E hand tail that is greater
In this problem we are looking for the area in the right
than 115. If we were drawing this problem out, the
N curve would reflect the
area shaded in Figure 8.7. This area requires us to calculate a z-score, find the
N
probability in the standard normal table, and subtract that probability from .5
E hand side of the distribuin order to find the remaining probability in the right
tion. This is because our standard normal table provides probabilities from
the center of the distribution out into the right hand tail. The answer shown
2 above 115 for a variable
in Table 8.7 is .0668 – 6.68% of the values should fall
distributed normally with a mean of 100 and a standard
deviation of 10.
4
1. Draw out the problem using the
normal distribution
7
9
T
S
2. Calculate the relevant z-scores
Z = (115 − 100)/10 = 1.50
3. Look up probabilities in the
standard normal table
1.50 in the standard normal table is associated
with a probability of .4332.
4. Do any necessary calculations (the
gymnastics)
We want the probability out into the right hand
tail. This requires us to subtract .4332 from .5.
P (X > 115) = .5 − .4332 = .0668
Solution for the Probability X > 115 for a Variable ~N(100, 10)
K11352_Ilvento_CH08.indd 155
figure 8.7
7/12/13 2:41 PM
156
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
Normal Probability Problem 2. Suppose X ~ N(100, 10). What is the probability that X is less than 115?
In this problem we are looking for the area left of 115. If we were drawing
this problem out, the drawing would reflect the area shaded in Figure 8.8.
This area requires us to calculate a z-score, find the probability in the standard normal table, and add that probability to .5 in order to find the remaining probability to the left of 115. This is because our standard normal table
provides probabilities from the center of the distribution out into the right
hand tail and we want to add the whole left hand side of the distribution. The
answer shown in Figure 8.8 is .9332 – 93.32% of the values should fall below
115 for a variable distributed normally with a mean of 100 and a standard
deviation of 10.
R
I
1. Draw out the problem using C
the normal distribution
A
R
D
,
2. Calculate the relevant z-scores Z = (115 − 100)/10 = 1.50
3. Look up probabilities in the A 1.50 in the standard normal table is associated with a
standard normal table
D probability of .4332.
4. Do any necessary calculations We want the probability to the left of 115. This requires
R us to add .4332 to .5.
(the gymnastics)
I P (X < 115) = .5 + .4332 = .9332
figure 8.8
E
Solution for the Probability X 115) = C
.5 − .4772 = .0228
A
Solution for the Probability X < 80 for a Variable ~N(100,
R 10)
figure 8.9
D
rawing would reflect the area shaded in Figure 8.10. This area requires us to
d
,
calculate two z-scores, find the probability in the standard normal table associated with each of them, and add them together. The answer shown in Figure
8.10 is .9104 – 91.04% of the values should fall between
A 80 and 115 for a variable distributed normally with a mean of 100 and a standard deviation of 10.
1. Draw out the problem using the
normal distribution
2. Calculate the relevant z-scores
D
R
I
E
N
N = −2.00
Z1 = (80 − 100)/10
Z2 = (115 − 100)/10
E = 1.50
3. Look up probabilities in the
standard normal table
−2.00 (converted to 2.00) in the standard
normal table is associated with a probability
2
of .4772.
1.50 in the standard
4 normal table is associated with a probability of .4332.
4. Do any necessary calculations
(the gymnastics)
We want the probability between the two
9
values. This requires
us to add .4772 and
.4332.
T
P (80 < X < 115) = .4772 + .4332 = .9104
7
S
Solution for the Probability X > 80 and X < 115 for a Variable ~N(100, 10)
figure 8.10
Normal Probability Problem 5. Suppose X ~ N(100, 10). What is the probability that X is between 110 and 125?
In this problem we are looking for the area between two values that are
on the same side of the mean (100). If we were drawing this problem out,
the curve would reflect the area shaded in Figure 8.11. This area requires us
to calculate two z-scores, find the probability in the standard normal table
K11352_Ilvento_CH08.indd 157
7/12/13 2:41 PM
158
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
ssociated with each of them, and then subtract the smaller value from the
a
larger value to get the probability between them. The answer shown in Figure
8.11 is .1525 – 15.25% of the values should fall between 110 and 125 for a variable distributed normally with a mean of 100 and a standard deviation of 10.
1. Draw out the problem
using the normal
distribution
2. Calculate the relevant
z-scores
3. Look up probabilities
in the standard normal
table
4. Do any necessary
calculations (the
gymnastics)
figure 8.11
R
I
ZC
= (110 − 100)/10 = 1.00
1
Z2 = (125 − 100)/10 = 2.50
A
1.00 in the standard normal table is associated with a
R
probability of .3413.
D in the standard normal table is associated with a
2.50
probability of .4938.
,
We want the probability between the two values. This
requires us to subtract .3413 from .4938.
PA
(110 < X < 125) = .4938 − .3413 = .1525
D
Solution for the Probability X >R110 and X < 125 for a Variable ~N(100, 10)
Calculating a Value of X Ifrom a Given Percentile
E
Another problem we might solve in the standard normal table is calculatN
ing the value of X at a given percentile. Suppose X ~ N(50, 5) and we want
to find the value of X at theN75th percentile. We can use the standard normal
table to help solve this problem, but we will use it in a slightly different way.
E
In these problems we will search for a probability in the table and read the
z-value associated with it. Then we use the z-value to calculate the value of X.
A partial standard normal table
2 is provided below to help in these problems
(Table 8.3). The shaded areas reflect probabilities for the problems.
4
Here are the basic steps to 7
solving a value of X at a given percentile.
9
1. Draw out the desired area.
2. Find the probability in the
T table associate with the desired percentile.
3. Read the z-score based on the percentile, being careful to keep the right
S
sign.
4. Solve for the value of X based on the z-score.
The formulas we need are:
Z=
(X − )
σ
and X = (σ ⋅ Z ) +
Percentile Problem 1. The height of adult males is distributed approximately
normally with a mean of 176 cm and a standard deviation of 7.8, Male height
~N(176, 7.8). What is the height at the 80th percentile?
K11352_Ilvento_CH08.indd 158
7/12/13 2:41 PM
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
Partial Probabilities Found under One-Half of the Standard Normal Distribution
for Solving Percentile Problems
159
table 8.3
Standard Normal Curve Probability Distribution
The table is based on the upper right half of the normal distribution; total area shown is .5
The z-score values are represented by the column value + row value, up to two decimal places
The probabilities up to the z-score are in the cells
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0000
0.0398
0.0793
0.1179
0.1554
0.1915
0.2257
0.2580
0.2881
0.3159
0.3413
0.0040
0.0438
0.0832
0.1217
0.1591
0.1950
0.2291
0.2611
0.2910
0.3186
0.3438
0.0080
0.0478
0.0871
0.1255
0.1628
0.1985
0.2324
0.2642
0.2939
0.3212
0.3461
0.0120
0.0517
0.0910
0.1293
0.1664
0.2019
0.2357
0.2673
0.2967
0.3238
0.3485
0.0160
0.0557
R
0.0948
0.1331
I
0.1700
C
0.2054
0.2389
A
0.2704
R
0.2995
0.3264
D
0.3508
0.0199
0.0596
0.0987
0.1368
0.1736
0.2088
0.2422
0.2734
0.3023
0.3289
0.3531
0.0239
0.0636
0.1026
0.1406
0.1772
0.2123
0.2454
0.2764
0.3051
0.3315
0.3554
0.0279
0.0675
0.1064
0.1443
0.1808
0.2157
0.2486
0.2794
0.3078
0.3340
0.3577
0.0319
0.0714
0.1103
0.1480
0.1844
0.2190
0.2517
0.2823
0.3106
0.3365
0.3599
0.0359
0.0753
0.1141
0.1517
0.1879
0.2224
0.2549
0.2852
0.3133
0.3389
0.3621
1.1
0.3643
0.3665
0.3686
0.3708
0.3729
0.3749
0.3770
0.3790
0.3810
0.3830
Z
,
A
D
R the standard normal
To solve this we look for a probability of .3000 inside
table. Why do we look for .3000? The 80th percentileI represents the .50 area of
the left hand side of the distribution along with an additional .30 probability
of the right hand side. While there is not an exactEprobability of .30 in Table
8.3, we can see a value of .2995 in the table (highlighted
in Table 8.3) which
N
is very close. This will do for our purposes. The z-value associated with this
probability is .84 (I read from the probability out toNthe row and column margins). We use this value to solve for the value of E
X. The answer is found in
th
Probabilities computed using Microsoft Excel
Figure 8.12. A value of 182.55 cm is the height of males at the 80 percentile.
1. Draw out the desired area.
2. Find the probability in the table
associate with the desired percentile.
2
4
7
9
T
We are looking
Sfor a probability close to .3000
in the standard normal distribution. A probability of .2995 can be found, which is very
close.
3. Read the z-score based on the percentile, being careful to keep the right sign.
The z-value associated with .2995 is .84.
4. Solve for the value of X based on the
z-score.
X = 7.8*.84 + 176 = 182.55
Solution for Solving the 80th Percentile for the Average Height of Males (in Centi
meters) That Is ~N(176, 7.8)
K11352_Ilvento_CH08.indd 159
figure 8.12
7/12/13 2:41 PM
160
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
Percentile Problem 2. The height of adult males is distributed approximately
normally with a mean of 176 cm and a standard deviation of 7.8, Male height
~N(176, 7.8). What is the height at the 30th percentile?
To solve this we look for a probability of .2000 inside the standard normal
table. Why do we look for .2000? The 30th percentile represents a .30 area
from the left hand side of the distribution. But the table works from the center
out to the tails. So a probability of .2000 is the area from the center and .3000
is the remaining area out in the left hand tail of the distribution. While there
is not an exact probability of .2000 in Table 8.3, we can see a value of .1985
and a second value of .2019 in the table (highlighted in Table 8.3). Roughly in
R would be .2000, so I will use a z-value of −.525
the middle of these two values
(the middle of −.52 and −.53).
I You should note that the z-score is negative for
this problem because we are dealing with a value below the mean. We use
this value to solve for the C
value of X. The answer is found in Figure 8.13. A
value of 171.91 cm is the height
A of males at the 30th percentile.
1. Draw out the desired area.
R
D
,
A
D
2. Find the probability in the
R
table associate with the desired
percentile.
I
E
N
3. Read the z-score based on the
percentile, being careful to keep the
N
right sign.
Eon
4. Solve for the value of X based
the z-score.
figure 8.13
We are looking for a probability close to .2000 in
the standard normal distribution. Probabilities of
.1985 and .2019 can be found which are very close,
and we will use a z-value in the middle of these
two probabilities.
The z-value associated with .2000 is −.525.
X = 7.8* − .525 + 176 = 171.91
2
Solution for Solving the 30th Percentile for the Average Height of Males (in Centime4
ters) That Is ~N(176, 7.8)
7
9 to the Binomial Distribution
The Normal Approximation
In Chapter 7 we noted that T
proportions for yes/no or other dichotomous variables can be thought of as binomial
random variables. Remember for a binoS
mial the focus is on N, the sample size; p, the probability of a success in a
random trial; and q, the probability of a failure (1 − p). It turns out that when
N is large relative to the size of p, the binomial distribution appears like a
normal distribution. Even though the binomial is a discrete distribution, the
continuous normal distribution provides a reasonable approximation and
can be far easier to calculate when dealing with probability problems and
later in inference.
The relationship between N and p and how they affect how normal the distribution appears to be is easy to show. Figure 8.14 shows the binomial distribution for p = .2 when N = 5 and N = 25. As N gets larger, the distribution
K11352_Ilvento_CH08.indd 160
7/12/13 2:41 PM
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
161
looks more like a symmetrical, mound-shaped distribution. And the larger
N gets, for any value of p, the more the distribution approximates a normal
distribution. The graph on the right looks more like a symmetrical distribution. If we showed the same plot for N = 100 it would appear even more like
the normal distribution.
R
I
C
A25
The Binomial Distribution for p = .20 and N = 5 and N =
R
The general rule of thumb for using the normal distribution to approximate
probabilities for proportions is that either p*N or qD
*N must equal at least 5.0.
,
figure 8.14
We require either N*p ≥ 5 or N*q ≥ 5
A N = 5 the calculation
Note that in the scenario depicted in Figure 8.14, when
of N*P equals only 1. However, in the second calculation, with n = 25, the
D
calculation of 25*.2 equals 5. Our rule of thumb focuses on p or q, whichever
R (i.e., with p or q less
reflects the smaller proportion. As a result, rare events
than .05) will require large sample sizes in order to
I use the normal approximation. My own feeling is that when we use this approach we should err on
E rare events. I prefer a
the safe side and be very careful when dealing with
calculation of 10 or more in order to use the normal
N approximation. Small
sample problems should use the binomial distribution to calculate problems.
N
Table 8.4 provides calculations for different combinations
of N and p to demE
onstrate what type of sample is needed for the normal approximation to
work. The probabilities only go to .5 because we only need to work with
the lower of p or q to generate the needed sample
2 size. The shaded area
represents all the combinations of N and p that generate a calculation of 5.0
4
or higher and the darkest shading shows the combinations that generate a
value of 10 or higher. It is clear from this table that
7 the closer p or q is to .5,
the smaller the sample size needed for the approximation.
9
Let us look at a proportion problem and see how T
the normal approximation
to the binomial distribution would work in practice.S
The proportion answering
yes to a public policy question is believed to be .60. If we randomly sampled
50 people, what is the probability that more that 35 people would support
this measure? This proportion can be thought of as a binomial random variable with p = .6 and N = 50. The expected value and variance for this distribution are given below.
Expected Value
E(X) = 50*.6 = 30.0
Variance
E(X − 30)2 = 50*.6*.4 = 12.0
Standard Deviation
SQRT(12.0) = 3.4641
K11352_Ilvento_CH08.indd 161
7/12/13 2:41 PM
162
c h a p t e r 8 Continuous Random Variables and the Normal Distribution
table 8.4
Combinations of p and N to Generate the Minimum Requirement for ...
Purchase answer to see full
attachment