7 Correlation and Causality
LEARNING GOALS
•
7.1 Seeking Correlation Define correlation, explore correlations with scatterplots,
and understand the correlation coefficient as a measure of the strength of a
correlation.
•
•
•
7.2 Interpreting Correlations Be aware of important cautions concerning the
interpretation of correlations, especially the effects of outliers, the effects of grouping
data, and the crucial fact that correlation does not necessarily imply causality.
7.3 Best-Fit Lines and Prediction Become familiar with the concept of a best-fit
line, recognize when such lines have predictive value and when they do not, and
understand the general concept of multiple regression.
7.4 The Search for Causality Understand the difficulty of establishing causality
from correlation, and investigate guidelines that can be used to help establish
confidence in causality.
FOCUS TOPICS
•
•
p. 271 Focus on Education: What Helps Children Learn to Read?
p. 273 Focus on Environment: What Is Causing Global Warming?
Does smoking cause lung cancer? Are drivers more dangerous when on their cell
phones? Is human activity causing global warming? A major goal of many statistical
studies is to search for relationships among different variables so that researchers can
then determine whether one factor causes another. Once a relationship is discovered, we
can try to determine whether there is an underlying cause. In this chapter, we will study
relationships known as correlations and explore how they are important to the more
difficult task of searching for causality.
The person who knows “how” will always have a job. The person who knows “why” will
always be his boss.
—Diane Ravitch
7.1 SEEKING CORRELATION
What does it mean when we say that smoking causes lung cancer? It certainly
does not mean that you’ll get lung cancer if you smoke a single cigarette. It does not
even mean that you’ll definitely get lung cancer if you smoke heavily for many years, as
some heavy smokers do not get lung cancer. Rather, it is a statistical statement meaning
that you are much more likely to get lung cancer if you smoke than if you don’t smoke.
How did researchers learn that smoking causes lung cancer? The process began with
informal observations, as doctors noticed that a surprisingly high proportion of their
patients with lung cancer were smokers. These observations led to carefully conducted
studies in which researchers compared lung cancer rates among smokers and
nonsmokers. These studies showed clearly that heavier smokers were more likely to get
lung cancer. In more formal terms, we say that there is a correlation between the
variables amount of smoking and likelihood of lung cancer. A correlation is a special
type of relationship between variables, in which a rise or fall in one goes along with a
corresponding rise or fall in the other.
•
Smoking is one of the leading causes of statistics. —Fletcher Knebel
Definition
A correlation exists between two variables when higher values of one variable
consistently go with higher values of another variable or when higher values of one
variable consistently go with lower values of another variable.
Here are a few other examples of correlations:
•
•
•
• There is a correlation between the variables height and weight for people; that is,
taller people tend to weigh more than shorter people.
• There is a correlation between the variables demand for apples and price of
apples;that is, demand tends to decrease as price increases.
• There is a correlation between practice time and skill among piano players; that is,
those who practice more tend to be more skilled.
It’s important to realize that establishing a correlation between two variables
does not mean that a change in one variable causes a change in the other. The
correlation between smoking and lung cancer did not by itself prove that smoking
causes lung cancer. We could imagine, for example, that some gene predisposes a
person both to smoking and to lung cancer. Nevertheless, identifying the correlation was
the crucial first step in learning that smoking causes lung cancer. We will discuss the
difficult task of establishing causality later in this chapter. For now, we concentrate on
how we look for, identify, and interpret correlations.
BY THE WAY
Smoking is linked to many serious diseases besides lung cancer, including heart disease
and emphysema. Smoking is also linked with many less lethal health conditions, such as
premature skin wrinkling and sexual impotence.
TIME
UT TO THINK
Suppose there really were a gene that made people prone to both smoking and lung
cancer. Explain why we would still find a strong correlation between smoking and lung
cancer in that case, but would not be able to say that smoking causes lung cancer.
Scatterplots
Table 7.1 lists data for a sample of gem-store diamonds—their prices and several
common measures that help determine their value. Because advertisements for
diamonds often quote only their weights (in carats), we might suspect a correlation
between the weights and the prices. We can look for such a correlation by making
a scatterplot (or scatter diagram) showing the relationship between the
variables weight and price.
TABLE 7.1 Prices and Characteristics of a Sample
of 23 Diamonds from Gem Dealers
Diamond
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Price
$6,958
$5,885
$6,333
$4,299
$9,589
$6,921
$4,426
$6,885
$5,826
$3,670
$7,176
$7,497
$5,170
$5,547
$7,521
$7,260
$8,139
$12,196
$14,998
$9,736
$9,859
$12,398
$11,008
Weight (carats)
1.00
1.00
1.01
1.01
1.02
1.04
1.04
1.07
1.07
1.11
1.12
1.16
1.20
1.23
1.29
1.50
1.51
1.67
1.72
1.76
1.80
1.88
2.03
Depth
60.5
59.2
62.3
64.4
63.9
60.0
62.0
63.6
61.6
60.4
60.2
59.5
62.6
59.2
59.6
61.1
63.0
58.7
58.5
57.9
59.6
62.9
62.0
Table
65
65
55
62
58
61
62
61
62
60
65
60
61
65
59
65
60
64
61
62
63
62
63
Color
3
5
4
5
2
4
5
4
5
9
2
5
6
7
6
6
6
3
4
8
5
6
8
Notes: Weight is measured in carats (1 carat = 0.2 gram). Depth is defined as 100 times
the ratio of height to diameter. Table is the size of the upper flat surface. (Depth and
table determine “cut.”) Color and clarity are each measured on standard scales, where 1
is best. For color, 1 = colorless, and increasing numbers indicate more yellow. For
clarity, 1 = flawless, and 6 indicates that defects can be seen by eye.
BY THE WAY
The word karats (with a k) used to describe gold does not have the same meaning as the
term carats (with a c) for diamonds and other gems. A carat is a measure of weight
equal to 0.2 gram. Karats are a measure of the purity of gold: 24-karat gold is 100% pure
gold; 18-karat gold is 75% pure (and 25% other metals); 12-karat gold is 50% pure (and
50% other metals); and so on.
Clarity
4
4
4
5
3
4
5
3
5
4
3
3
4
4
2
4
4
5
3
2
5
2
3
Definition
A scatterplot (or scatter diagram) is a graph in which each point represents the values
of two variables.
Figure 7.1 shows the scatterplot, which can be constructed with the following
procedure.
•
1. We assign one variable to each axis and label the axis with values that comfortably
fit all the data. Sometimes the axis selection is arbitrary, but if we suspect that one
variable depends on the other then we plot the explanatory variable on the
horizontal axis and the response variable on the vertical axis. In this case, we expect
the diamond price to depend at least in part on its weight; we therefore say
that weight is the explanatory variable (because it helps explain the price)
and price is the response variable (because it responds to changes in the explanatory
variable). We choose a range of 0 to 2.5 carats for the weight axis and $0 to $16,000
for the price axis.
Figure 7.1 Scatterplot showing the relationship
between the variables priceand weight for the
diamonds in Table 7.1. The dashed lines show
how we find the position of the point for
Diamond 10.
•
•
2. For each diamond in Table 7.1, we plot a single point at the horizontal position
corresponding to its weight and the vertical position corresponding to its price. For
example, the point for Diamond 10 goes at a position of 1.11 carats on the horizontal
axis and $3,670 on the vertical axis. The dashed lines on Figure 7.1 show how we
locate this point.
3. (Optional) We can label some (or all) of the data points, as is done for Diamonds
10, 16, and 19 in Figure 7.1.
Scatterplots get their name because the way in which the points are scattered may reveal
a relationship between the variables. In Figure 7.1, we see a general upward trend
indicating that diamonds with greater weight tend to be more expensive. The correlation
is not perfect. For example, the heaviest diamond is not the most expensive. But the
overall trend seems fairly clear.
TIME
UT TO THINK
Identify the points in Figure 7.1 that represent Diamonds 3, 7, and 23.
EXAMPLE
Color and Price
Using the data in Table 7.1, create a scatterplot to look for a correlation between a
diamond’s colorand price. Comment on the correlation.
SOLUTION We expect price to depend on color, so we plot the explanatory
variable color on the horizontal axis and the response variable price on the vertical axis
in Figure 7.2. (You should check a few of the points against the data in Table 7.1.) The
points appear much more scattered than in Figure 7.1. Nevertheless, you may notice a
weak trend diagonally downward from the upper left toward the lower right. This trend
represents a weak correlation in which diamonds with more yellow color (higher
numbers for color) are less expensive. This trend is consistent with what we would
expect, because colorless diamonds appear to sparkle more and are generally considered
more desirable.
Figure 7.2 Scatterplot for the color and price data
in Table 7.1.
TIME
UT TO THINK
Thanks to a large bonus at work, you have a budget of $6,000 for a diamond ring. A
dealer offers you the following two choices for that price. One diamond weighs 1.20
carats and has color = 4. The other weighs 1.18 carats and has color = 3. Assuming all
other characteristics of the diamonds are equal, which would you choose? Why?
Types of Correlation
We have seen two examples of correlation. Figure 7.1 shows a fairly strong correlation
between weight and price, while Figure 7.2 shows a weak correlation between color
and price. We are now ready to generalize about types of correlation. Figure 7.3 shows
eight scatterplots for variables called x and y. Note the following key features of these
diagrams:
•
•
•
•
• Parts a to c show positive correlations: The values of y tend to increase with
increasing values of x. The correlation becomes stronger as we proceed from a to c. In
fact, c shows a perfect positive correlation, in which all the points fall along a straight
line.
• Parts d to f show negative correlations: The values of y tend to decrease with
increasing values of x. The negative correlation becomes stronger as we proceed from
d to f. In fact, f shows a perfect negative correlation, in which all the points fall along
a straight line.
• Part g shows no correlation between x and y: Values of x do not appear to be linked
to values of y in any way.
• Part h shows a nonlinear relationship: x and y appear to be related but the
relationship does not correspond to a straight line. (Linear means along a straight
line, and nonlinear means not along a straight line.)
Figure 7.3 Types of correlation seen on
scatterplots.
Types of Correlation
Positive correlation: Both variables tend to increase (or decrease) together.
Negative correlation: The two variables tend to change in opposite directions, with
one increasing while the other decreases.
No correlation: There is no apparent (linear) relationship between the two variables.
Nonlinear relationship: The two variables are related, but the relationship results in
a scatterplot that does not follow a straight-line pattern.
TECHNICAL NOTE
In this text we use the term correlation only for linear relationships. Some statisticians
refer to nonlinear relationships as “nonlinear correlations.” There are techniques for
working with nonlinear relationships that are similar to those described in this text for
linear relationships.
EXAMPLE
Mortality
Life Expectancy and Infant
Figure 7.4 shows a scatterplot for the variables life expectancy and infant mortality in
16 countries. What type of correlation does it show? Does this correlation make sense?
Does it imply causality? Explain.
Figure 7.4 Scatterplot for life expectancy and
infant mortality data.
Source: United Nations.
SOLUTION The diagram shows a moderate negative correlation in which countries
with lowerinfant mortality tend to have higher life expectancy. It is
a negative correlation because the two variables vary in opposite directions. The
correlation makes sense because we would expect that countries with better health care
would have both lower infant mortality and higher life expectancy. However, it
does not imply causality between infant mortality and life expectancy: We would not
expect that a concerted effort to reduce infant mortality would increase life expectancy
significantly unless it was part of an overall effort to improve health care. (Reducing
infant mortality will slightly increase life expectancy because having fewer infant deaths
tends to raise the mean age of death for the population.)
Measuring the Strength of a Correlation
For most purposes, it is enough to state whether a correlation is strong, weak, or
nonexistent. However, sometimes it is useful to describe the strength of a correlation in
more precise terms. Statisticians measure the strength of a correlation with a number
called the correlation coefficient, represented by the letter r. The correlation
coefficient is easy to calculate in principle (see the optional section on p. 243), but the
actual work is tedious unless you use a calculator or computer.
We can explore the interpretation of correlation coefficients by studying Figure 7.3,
which shows the value of the correlation coefficient r for each scatterplot. Notice that
the correlation coefficient is always between –1 and 1. When points in a scatterplot lie
close to an ascending straight line, the correlation coefficient is positive and close to 1.
When all the points lie close to a descending straight line, the correlation coefficient is
negative with a value close to –1. Points that do not fit any type of straight-line pattern
or that lie close to a horizontal straight line (indicating that the y values have no
dependence on the x values) result in a correlation coefficient close to 0.
Properties of the Correlation Coefficient, r
•
•
•
•
• The correlation coefficient, r, is a measure of the strength of a correlation. Its value
can range only from –1 to 1.
• If there is no correlation, the points do not follow any ascending or descending
straight-line pattern, and the value of r is close to 0.
• If there is a positive correlation, the correlation coefficient is positive (0 < r ≤ 1):
Both variables increase together. A perfect positive correlation (in which all the
points on a scatterplot lie on an ascending straight line) has a correlation
coefficient r = 1. Values of rclose to 1 indicate a strong positive correlation and
positive values closer to 0 indicate a weak positive correlation.
• If there is a negative correlation, the correlation coefficient is negative(–1 ‰ r < 0):
When one variable increases, the other decreases. A perfect negative correlation (in
which all the points lie on a descending straight line) has a correlation coefficient r =
–1. Values of r close to –1 indicate a strong negative correlation and negative values
closer to 0 indicate a weak negative correlation.
TECHNICAL NOTE
For the methods of this section, there is a requirement that the two variables result in
data having a “bivariate normal distribution.” This basically means that for any fixed
value of one variable, the corresponding values of the other variable have a normal
distribution. This requirement is usually very difficult to check, so the check is often
reduced to verifying that both variables result in data that are normally distributed.
EXAMPLE
U.S. Farm Size
Figure 7.5 shows a scatterplot for the variables number of farms and mean farm
size in the United States. Each dot represents data from a single year between 1950 and
2000; on this diagram, the earlier years generally are on the right and the later years on
the left. Estimate the correlation coefficient by comparing this diagram to those
in Figure 7.3 and discuss the underlying reasons for the correlation.
Figure 7.5 Scatterplot for farm size data.
Source: U.S. Department of Agriculture.
SOLUTION The scatterplot shows a strong negative correlation that most closely
resembles the scatterplot in Figure 7.3f, suggesting a correlation coefficient around r =
–0.9. The correlation shows that when there were fewer farms, they tended to have a
larger mean size, and when there were more farms, then tended to have a smaller mean
size. This trend reflects a basic change in the nature of farming: Prior to 1950, most
farms were small family farms. Over time, these small farms were replaced by large
farms owned by agribusiness corporations.
BY THE WAY
In 1900, more than 40% of the U.S. population worked on farms; by 2000, less than 2%
of the population worked on farms.
EXAMPLE
Accuracy of Weather Forecasts
The scatterplots in Figure 7.6 show two weeks of data comparing the actual high
temperature for the day with the same-day forecast (part a) and the three-day forecast
(part b). Estimate the correlation coefficient for each data set and discuss what these
coefficients imply about weather forecasts.
Figure 7.6 Comparison of actual high
temperatures with (a) same-day and (b) three-day
forecasts.
SOLUTION If every forecast were perfect, each actual temperature would equal the
corresponding forecasted temperature. This would result in all points lying on a straight
line and a correlation coefficient of r = 1. In Figure 7.6a, in which the forecasts were
made at the beginning of the same day, the points lie fairly close to a straight line,
meaning that same-day forecasts are closely related to actual temperatures. By
comparing this scatterplot to the diagrams in Figure 7.3, we can reasonably estimate
this correlation coefficient to be about r = 0.8. The correlation is weaker in Figure
7.6b, indicating that forecasts made three days in advance aren’t as close to actual
temperatures as same-day forecasts. This correlation coefficient is about r = 0.6. These
results are unsurprising because we expect longer-term forecasts to be less accurate.
TIME
UT TO THINK
For further practice, visually estimate the correlation coefficients for the data for
diamond weight and price (Figure 7.1) and diamond color and price (Figure 7.2).
Calculating the Correlation Coefficient (Optional
Section)
The formula for the (linear) correlation coefficient r can be expressed in several
different ways that are all algebraically equivalent, which means that they produce the
same value. The following expression has the advantage of relating more directly to the
underlying rationale for r:
USING TECHNOLOGY—SCATTERPLOTS AND
CORRELATION COEFFICIENTS
EXCEL The screen shot below shows the process for making a scatterplot like that
in Figure 7.1:
•
•
•
•
•
1. Enter the data, which are shown in Columns B (weight) and C (price).
2. Select the columns for the two variables on the scatterplot; in this case, Columns B
and C.
3. Choose “XY Scatter” as the chart type, with no connecting lines. You can then use
the “chart options” (which comes up with a right-click in the graph) to customize the
design, axis range, labels, and more.
4. To calculate the correlation coefficient, shown in row 26, use the built-in function
CORREL.
5. [Optional] The straight line on the graph, called a best-fit line, is added by
choosing the option to “Add Trendline”; be sure to choose the “linear” option for the
trendline. You’ll also find options that add the two items shown in the upper left of
the graph: the equation of the line and the value R2, which is the square of the
correlation coefficient. Best-fit lines and R2 are discussed in Section 7.3.
Microsoft Excel 2008 for Mac.
STATDISK Enter the paired data in columns of the STATDISK Data Window.
Select Analysis from the main menu bar, then select the option Correlation and
Regression. Select the columns of data to be used, then click on the Evaluate button.
The STATDISK display will include the value of the linear correlation coefficient r and
other. A scatterplot can also be obtained by clicking on the PLOTbutton.
TI-83/84 Plus Enter the paired data in lists L1 and L2, then press
and
select TESTS. Using the option of LinRegTTest will result in several displayed values,
including the value of the linear correlation coefficient r.
To obtain a scatterplot, press
Press
, then
(for STAT PLOT).
to turn Plot 1 on, then select the first graph type, which resembles
a scatterplot. Set the X list and Y list labels to L1 and L2 and press
select ZoomStat and press
.
, then
In the above expression, division by n – 1(where n is the number of pairs of data) shows
that r is a type of average, so it does not increase simply because more pairs of data
values are included. The symbol sx denotes the standard deviation of the x values (or the
values of the first variable), and sydenotes the standard deviation of the y values. The
expression (x – x)/sx is in the same format as the standard score introduced in Section
5.2. By using the standard scores for x and y, we ensure that the value of r does not
change simply because a different scale of values is used. The key to understanding the
rationale for r is to focus on the product of the standard scores for x and the standard
scores for y. Those products tend to be positive when there is a positive correlation, and
they tend to be negative when there is a negative correlation. For data with no
correlation, some of the products are positive and some are negative, with the net effect
that the sum is relatively close to 0.
The following alternative formula for r has the advantage of simplifying calculations, so
it is often used whenever manual calculations are necessary. The following formula is
also easy to program into statistical software or calculators:
This formula is straightforward to use, at least in principle: First calculate each of the
required sums, then substitute the values into the formula. Be sure to note that (Σx2) and
(Σx)2 are not equal: (Σx2) tells you to first square all the values of the variable x and then
add them; (Σx)2 tells you to add the x values first and then square this sum. In other
words, perform the operation within the parentheses first. Similarly, (Σy2) and (Σy)2 are
not the same.
Section 7.1 Exercises
Statistical Literacy and Critical Thinking
1.
Correlation. In the context of correlation, what does r measure, and what is it
called?
2.
Scatterplot. What is a scatterplot, and how does it help us investigate
correlation?
3.
Correlation. After computing the correlation coefficient r from 50 pairs of data,
you find that r = 0. Does it follow that there is no relationship between the two
variables? Why or why not?
4.
Scatterplot. One set of paired data results in r = 1 and a second set of paired
data results in r= –1. How do the corresponding scatterplots differ?
Does It Make Sense? For Exercises 5–8, decide whether the statement makes sense
(or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of
these statements have definitive answers, so your explanation is more important than
your chosen answer.
5.
Births. A study showed that for one town, as the stork population increased, the
number of births in the town also increased. It therefore follows that the increase
in the stork population caused the number of births to increase.
6.
Positive Effect. An engineer for a car company finds that by reducing the
weights of various cars, mileage (mi/gal) increases. Because this is a positive
result, we say that there is a positive correlation.
7.
Correlation. Two studies both found a correlation between low birth weight
and weakened immune systems. The second study had a much larger sample
size, so the correlation it found must be stronger.
8.
Interpreting r. In investigating correlations between many different pairs of
variables, in each case the correlation coefficient r must fall between –1 and 1.
Concepts and Applications
Types of Correlation. Exercises 9–16, list pairs of variables. For each pair, state
whether you believe the two variables are correlated. If you believe they are correlated,
state whether the correlation is positive or negative. Explain your reasoning.
9.
Weight/Cost. The weights and costs of 50 different bags of apples
10.
IQ/Hat Size. The IQ scores and hat sizes of randomly selected adults
11.
Weight/Fuel Efficiency. The total weights of airliners flying from New York to
San Francisco and the fuel efficiency as measured in miles per gallon
12.
Weight/Fuel Consumption. The total weights of airliners flying from New
York to San Francisco and the total amounts of fuel that they consume
13.
Points and DJIA. The total number of points scored in Super Bowl football
games and the changes in the Dow Jones Industrial stock index in the years
following those games
14.
Altitude/Temperature. The outside air temperature and the altitude of
aircraft
15.
Height/SAT Score. The heights and SAT scores of randomly selected subjects
who take the SAT
16.
Golf Score/Prize Money. Golf scores and prize money won by professional
golfers
17.
Crickets and Temperature. One classic application of correlation involves the
association between the temperature and the number of times a cricket chirps in
a minute. The scatterplot in Figure 7.7 shows the relationship for eight different
pairs of temperature/chirps data. Estimate the correlation coefficient and
determine whether there appears to be a correlation between the temperature
and the number of times a cricket chirps in a minute.
Figure 7.7Scatterplot for cricket chirps and
temperature.
Source: Based on data from The Song of Insects by George W. Pierce, Harvard
University Press.
18.
Two-Day Forecast. Figure 7.8 shows a scatterplot in which the actual high
temperature for the day is compared with a forecast made two days in advance.
Estimate the correlation coefficient and discuss what these data imply about
weather forecasts. Do you think you would get similar results if you made similar
diagrams for other two-week periods? Why or why not?
Figure 7.8
19.
Safe Speeds? Consider the following table showing speed limits and death rates
from automobile accidents in selected countries.
Country
Norway
United States
Finland
Britain
Denmark
Canada
Japan
Australia
Netherlands
Italy
Death rate (per 100 million vehicle-miles)
3.0
3.3
3.4
3.5
4.1
4.3
4.7
4.9
5.1
6.1
Speed limit (miles per hou
55
55
55
70
55
60
55
65
60
75
Source: D. J. Rivkin, New York Times.
•
•
a.Construct a scatterplot of the data.
b.Briefly characterize the correlation in words (for example, strong positive
correlation, weak negative correlation) and estimate the correlation coefficient of the
data. (Or calculate the correlation coefficient exactly with the aid of a calculator or
software.)
•
c.In the newspaper, these data were presented in an article titled “Fifty-five mph
speed limit is no safety guarantee.” Based on the data, do you agree with this claim?
Explain.
20.
Population Growth. Consider the following table showing percentage change
in population and birth rate (per 1,000 of population) for 10 states over a period
of 10 years.
State
Nevada
California
New Hampshire
Utah
Colorado
Minnesota
Montana
Illinois
Iowa
West Virginia
Percentage change in population
50.1%
25.7%
20.5%
17.9%
14.0%
7.3%
1.6%
0%
–4.7%
–8.0%
Birth rate
16.3
16.9
12.5
21.0
14.6
13.7
12.3
15.5
13.0
11.4
Source: U.S. Census Bureau and Department of Health and Human Services.
•
•
•
a.Construct a scatterplot for the data.
b.Briefly characterize the correlation in words and estimate the correlation
coefficient.
c.Overall, does birth rate appear to be a good predictor of a state’s population growth
rate? If not, what other factor(s) may be affecting the growth rate?
21.
Brain Size and Intelligence. The table below lists brain sizes (in cm3) and
Wechsler IQ scores of subjects (based on data from “Brain Size, Head Size, and
Intelligence Quotient in Monozygatic Twins,” by Tramo et al, Neurology, Vol. 50,
No. 5). Is there sufficient evidence to conclude that there is a linear correlation
between brain size and IQ score? Does it appear that people with larger brains
are more intelligent?
Brain Size
965
1,029
1,030
1,285
1,049
1,077
1,037
IQ
90
85
86
102
103
97
124
Brain Size
1,068
1,176
1,105
•
•
•
IQ
125
102
114
a.Construct a scatterplot for the data.
b.Briefly characterize the correlation in words and estimate the correlation
coefficient.
c.Do these data suggest that people with larger brains are more intelligent? Explain.
22.
Movie Data. Consider the following table showing total box office receipts and
total attendance for all American films.
Year
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Total Gross Receipts (billions of dollars)
8.4
9.2
9.2
9.4
8.8
9.2
9.7
9.6
10.6
10.6
Tickets Sold (billions)
1.49
1.58
1.53
1.51
1.38
1.41
1.40
1.34
1.41
1.34
Source: Motion Picture Association of America.
•
•
a.Construct a scatterplot of the data.
b.Briefly characterize the correlation in words and estimate the correlation
coefficient.
23.
TV Time. Consider the following table showing the average hours of television
watched in households in five categories of annual income.
Household income
Less than $30,000
$30,000 – $40,000
$40,000 – $50,000
$50,000 – $60,000
More than $60,000
Source: Nielsen Media Research.
Weekly TV hours
56.3
51.0
50.5
49.7
48.7
•
•
•
a.Construct a scatterplot for the data. To locate the dots, use the midpoint of each
income category. Use a value of $25,000 for the category “less than $30,000,” and
use $70,000 for “more than $60,000.”
b.Briefly characterize the correlation in words and estimate the correlation
coefficient.
c.Suggest a reason why families with higher incomes watch less TV. Do you think
these data imply that you can increase your income simply by watching less TV?
Explain.
24.
January Weather. Consider the following table showing January mean
monthly precipitation and mean daily high temperature for ten Northern
Hemisphere cities (National Oceanic and Atmospheric Administration).
City
Athens
Bombay
Copenhagen
Jerusalem
London
Montreal
Oslo
Rome
Tokyo
Vienna
Mean daily high temperature for January
(°F)
54
88
36
55
44
21
30
54
47
34
Mean January precipitation
(inches)
2.2
0.1
1.6
5.1
2.0
3.8
1.7
3.3
1.9
1.5
Source: The New York Times Almanac.
•
•
•
a.Construct a scatterplot for the data.
b.Briefly characterize the correlation in words and estimate the correlation
coefficient.
c.Can you draw any general conclusions about January temperatures and
precipitation from these data? Explain.
25.
Retail Sales. Consider the following table showing one year’s total sales
(revenue) and profits for eight large retailers in the United States.
Company
Wal-Mart
Kroger
Home Depot
Costco
Target
Total sales (billions of dollars)
315.6
60.6
81.5
60.1
52.6
Profits (billions of dollars)
11.2
0.98
5.8
1.1
2.4
Company
Starbuck’s
The Gap
Best Buy
Total sales (billions of dollars)
7.8
16.0
30.8
Profits (billions of dollars)
0.6
1.1
1.1
Source: Fortune.com.
•
•
•
a.Construct a scatterplot for the data.
b.Briefly characterize the correlation in words and estimate the correlation
coefficient.
c.Discuss your observations. Does higher sales volume necessarily translate into
greater earnings? Why or why not?
26.
Calories and Infant Mortality. Consider the following table showing mean
daily caloric intake (all residents) and infant mortality rate (per 1,000 births) for
10 countries.
Country
Afghanistan
Austria
Burundi
Colombia
Ethiopia
Germany
Liberia
New Zealand
Turkey
United States
•
•
•
Mean daily calories
1,523
3,495
1,941
2,678
1,610
3,443
1,640
3,362
3,429
3,671
Infant mortality rate (per 1,000 births)
154
6
114
24
107
6
153
7
44
7
a.Construct a scatterplot for the data.
b.Briefly characterize the correlation in words and estimate the correlation
coefficient.
c.Discuss any patterns you observe and any general conclusions that you can reach.
Properties of the Correlation Coefficient. For Exercises 27 and 28, determine
whether the given property is true, and explain your answer.
27.
Interchanging Variables. The correlation coefficient remains unchanged if
we interchange the variables x and y.
28.
Changing Units of Measurement. The correlation coefficient remains
unchanged if we change the units used to measure x, y, or both.
PROJECTS FOR
THE INTERNET & BEYOND
29.
Unemployment and Inflation. Use the Bureau of Labor Statistics Web page
to find monthly unemployment rates and inflation rates over the past year.
Construct a scatter-plot for the data. Do you see any trends?
30.
Success in the NFL. Find last season’s NFL team statistics. Construct a table
showing the following for each team: number of wins, average yards gained on
offense per game, and average yards allowed on defense per game. Make
scatterplots to explore the correlations between offense and wins and between
defense and wins. Discuss your findings. Do you think that there are other team
statistics that would yield stronger correlations with the number of wins?
31.
Statistical Abstract. Explore the “frequently requested tables” at the Web site
for the Statistical Abstract of the United States. Choose data that are of interest
to you and explore at least two correlations. Briefly discuss what you learn from
the correlations.
32.
Height and Arm Span. Select a sample of at least eight people and measure
each person’s height and arm span. (When you measure arm span, the person
should stand with arms extended like the wings on an airplane.) Using the paired
sample data, construct a scatterplot and estimate or calculate the value of the
correlation coefficient. What do you conclude?
33.
Height and Pulse Rate. Select a sample of at least eight people and record
each person’s pulse rate by counting the number of heartbeats in 1 minute. Also
record each person’s height. Using the paired sample data, construct a scatterplot
and estimate or calculate the value of the correlation coefficient. What do you
conclude?
IN THE NEWS
34.
Correlations in the News. Find a recent news report that discusses some type
of correlation. Describe the correlation. Does the article give any sense of the
strength of the correlation? Does it suggest that the correlation reflects any
underlying causality? Briefly discuss whether you believe the implications the
article makes with respect to the correlation.
35.
Your Own Positive Correlations. Give examples of two variables that you
expect to be positively correlated. Explain why the variables are correlated and
why the correlation is (or is not) important.
36.
Your Own Negative Correlations. Give examples of two variables that you
expect to be negatively correlated. Explain why the variables are correlated and
why the correlation is (or is not) important.
7.2 INTERPRETING CORRELATIONS
•
Statistics show that of those who contract the habit of eating, very few survive. —
Wallace Irwin
Researchers sifting through statistical data are constantly looking for meaningful
correlations, and the discovery of a new and surprising correlation often leads to a flood
of news reports. You may recall hearing about some of these discovered correlations:
dark chocolate consumption correlated with reduced risk of heart disease; musical
talent correlated with good grades in mathematics; or eating less correlated with
increased longevity. Unfortunately, the task of interpreting such correlations is far more
difficult than discovering them in the first place. Long after the news reports have faded,
we may still be unsure of whether the correlations are significant and, if so, whether
they tell us anything of practical importance. In this section, we discuss some of the
common difficulties associated with interpreting correlations.
Beware of Outliers
Examine the scatterplot in Figure 7.9. Your eye probably tells you that there is a
positive correlation in which larger values of x tend to mean larger values of y. Indeed, if
you calculate the correlation coefficient for these data, you’ll find that it is a relatively
high r = 0.880, suggesting a very strong correlation.
Figure 7.9 How does the outlier affect the
correlation?
However, if you place your thumb over the data point in the upper right corner
of Figure 7.9, the apparent correlation disappears. In fact, without this data point, the
correlation coefficient is zero! In other words, removing this one data point changes the
correlation coefficient from r = 0.880 to r= 0.
This example shows that correlations can be very sensitive to outliers. Recall that
an outlier is a data value that is extreme compared to most other values in a data set
(see Section 4.1). We must therefore examine outliers and their effects carefully before
interpreting a correlation. On the one hand, if the outliers are mistakes in the data set,
they can produce apparent correlations that are not real or mask the presence of real
correlations. On the other hand, if the outliers represent real and correct data points,
they may be telling us about relationships that would otherwise be difficult to see.
Note that while we should examine outliers carefully, we should not remove them unless
we have strong reason to believe that they do not belong in the data set. Even in that
case, good research principles demand that we report the outliers along with an
explanation of why we thought it legitimate to remove them.
EXAMPLE
Masked Correlation
You’ve conducted a study to determine how the number of calories a person consumes
in a day correlates with time spent in vigorous bicycling. Your sample consisted of ten
women cyclists, all of approximately the same height and weight. Over a period of two
weeks, you asked each woman to record the amount of time she spent cycling each day
and what she ate on each of those days. You used the eating records to calculate the
calories consumed each day. Figure 7.10 shows a scatterplot with each woman’s mean
time spent cycling on the horizontal axis and mean caloric intake on the vertical axis. Do
higher cycling times correspond to higher intake of calories?
Figure 7.10 Data from the cycling study.
SOLUTION If you look at the data as a whole, your eye will probably tell you that there
is a positive correlation in which greater cycling time tends to go with higher caloric
intake. But the correlation is very weak, with a correlation coefficient of r = 0.374.
However, notice that two points are outliers: one representing a cyclist who cycled about
a half-hour per day and consumed more than 3,000 calories, and the other representing
a cyclist who cycled more than 2 hours per day on only 1,200 calories. It’s difficult to
explain the two outliers, given that all the women in the sample have similar heights and
weights. We might therefore suspect that these two women either recorded their data
incorrectly or were not following their usual habits during the two-week study. If we can
confirm this suspicion, then we would have reason to delete the two data points as
invalid. Figure 7.11 shows that the correlation is quite strong without those two outlier
points, and suggests that the number of calories consumed rises by a little more than
500 calories for each hour of cycling. Of course, we should not remove the outliers
without confirming our suspicion that they were invalid data points, and we should
report our reasons for leaving them out.
Figure 7.11 The data from Figure 7.10 without the
two outliers.
Beware of Inappropriate Grouping
Correlations can also be misinterpreted when data are grouped inappropriately. In some
cases, grouping data hides correlations. Consider a (hypothetical) study in which
researchers seek a correlation between hours of TV watched per week and high school
grade point average (GPA). They collect the 21 data pairs in Table 7.2.
The scatterplot (Figure 7.12) shows virtually no correlation; the correlation coefficient
for the data is about r = –0.063. The lack of correlation seems to suggest that TV
viewing habits are unrelated to academic achievement. However, one astute researcher
realizes that some of the students watched mostly educational programs, while others
tended to watch comedies, dramas, and movies. She therefore divides the data set into
two groups, one for the students who watched mostly educational television and one for
the other students. Table 7.3 shows her results with the students divided into these two
groups.
Figure 7.12 The full set of data concerning hours of
TV and GPA shows virtually no correlation.
TABLE 7.2 Hours of TV and High School GPA
(hypothetical data)
Hours per week of TV
2
4
4
5
5
5
6
7
7
8
9
9
10
12
12
GPA
3.2
3.0
3.1
2.5
2.9
3.0
2.5
2.7
2.8
2.7
2.5
2.9
3.4
3.6
2.5
Hours per week of TV
14
14
15
16
20
20
GPA
3.5
2.3
3.7
2.0
3.6
1.9
Now we find two very strong correlations (Figure 7.13): a strong positive correlation
for the students who watched educational programs (r = 0.855) and a strong negative
correlation for the other students (r = –0.951). The moral of this story is that the
original data set hid an important (hypothetical) correlation between TV and GPA:
Watching educational TV correlated positively with GPA and watching non-educational
TV correlated negatively with GPA. Only when the data were grouped appropriately
could this discovery be made.
TABLE 7.3 Hours of TV and High School GPA—
Grouped Data (hypothetical data)
Group 1: watched educational programs
Hours per week of TV
5
7
8
9
10
12
14
15
20
GPA
2.5
2.8
2.7
2.9
3.4
3.6
3.5
3.7
3.6
Group 2: watched regular TV
Hours per week of TV
2
4
4
5
5
6
7
9
12
14
16
20
BY THE WAY
Children ages 2–5 watch an average of 26 hours of television per week, while children
ages 6–11 watch an average of 20 hours of television per week (Nielsen Media
Research). Adult viewership averages more than 25 hours per week. If the average adult
replaced television time with a job paying just $8 per hour, his or her annual income
would rise by more than $10,000.
GPA
3.2
3.0
3.1
2.9
3.0
2.5
2.7
2.5
2.5
2.3
2.0
1.9
Figure 7.13 These scatterplots show the same data
as Figure 7.12, separated into the two groups
identified in Table 7.3.
In other cases, a data set may show a stronger correlation than actually exists among
subgroups. Consider the (hypothetical) data in Table 7.4, showing the relationship
between the weights and prices of selected cars. Figure 7.14 shows the scatterplot.
The data set as a whole shows a strong correlation; the correlation coefficient is r =
0.949. However, on closer examination, we see that the data fall into two rather distinct
categories corresponding to light and heavy cars. If we analyze these subgroups
separately, neither shows any correlation: The light cars alone (top six in Table 7.4)
have a correlation coefficient r = 0.019 and the heavy cars alone (bottom six in Table
7.4) have a correlation coefficient r = –0.022. You can see the problem by looking
at Figure 7.14. The apparent correlation of the full data set occurs because of the
separation between the two clusters of points; there’s no correlation within either
cluster.
TABLE 7.4 Car Weights and Prices (hypothetical
data)
Weight (pounds)
1,500
1,600
1,700
1,750
1,800
1,800
Price (dollars)
9,500
8,000
8,200
9,500
9,200
8,700
Weight (pounds)
3,000
3,500
3,700
4,000
3,600
3,200
Price (dollars)
29,000
25,000
27,000
31,000
25,000
30,000
Figure 7.14 Scatterplot for the car weight and price
data in Table 7.4.
TIME
UT TO THINK
Suppose you were shopping for a compact car. If you looked at only the overall data and
correlation coefficient from Figure 7.14, would it be reasonable to consider weight as
an important factor in price? What if you looked at the data for light and heavy cars
separately? Explain.
ASE STUDY Fishing for Correlations
Oxford physician Richard Peto submitted a paper to the British medical
journal Lancet showing that heart-attack victims had a better chance of survival if they
were given aspirin within a few hours after their heart attacks. The editors
of Lancet asked Peto to break down the data into subsets, to see whether the benefits of
the aspirin were different for different groups of patients. For example, was aspirin
more effective for patients of a certain age or for patients with certain dietary habits?
Breaking the data into subsets can reveal important facts, such as whether men and
women respond to the treatment differently. However, Peto felt that the editors were
asking him to divide his sample into too many subgroups. He therefore objected to the
request, arguing that it would result in purely coincidental correlations. Writing about
this story in the Washington Post, journalist Rick Weiss said, “When the editors
insisted, Peto capitulated, but among other things he divided his patients by zodiac birth
signs and demanded that his findings be included in the published paper. Today, like a
warning sign to the statistically uninitiated, the wacky numbers are there for all to see:
Aspirin is useless for Gemini and Libra heart-attack victims but is a lifesaver for people
born under any other sign.”
The moral of this story is that a “fishing expedition” for correlations can often produce
them. That doesn’t make the correlations meaningful, even though they may appear
significant by standard statistical measures.
Correlation Does Not Imply Causality
Perhaps the most important caution about interpreting correlations is one we’ve already
mentioned: Correlation does not necessarily imply causality. In general,
correlations can appear for any of the following three reasons.
Possible Explanations for a Correlation
•
•
•
1. The correlation may be a coincidence.
2. Both correlation variables might be directly influenced by some common
underlying cause.
3. One of the correlated variables may actually be a cause of the other. But note that,
even in this case, it may be just one of several causes.
For example, the correlation between infant mortality and life expectancy in Figure
7.4 is a case of common underlying cause: Both variables respond to the underlying
variable quality of health care. The correlation between smoking and lung cancer
reflects the fact that smoking causes lung cancer (see the discussion in Section 7.4).
Coincidental correlations are also quite common; Example 2 below discusses one such
case.
Caution about causality is particularly important in light of the fact that many statistical
studies are designed to look for causes. Because these studies generally begin with the
search for correlations, it’s tempting to think that the work is over as soon as a
correlation is found. However, as we will discuss in Section 7.4, establishing causality
can be very difficult.
EXAMPLE
(Maybe)
How to Get Rich in the Stock Market
Every financial advisor has a strategy for predicting the direction of the stock market.
Most focus on fundamental economic data, such as interest rates and corporate profits.
But an alternative strategy might rely on a famous correlation between the Super Bowl
winner in January and the direction of the stock market for the rest of the year: The
stock market tends to rise when a team from the old, pre-1970 NFL wins the Super Bowl
and tends to fall when the winner is not from the old NFL. This correlation successfully
matched 28 of the first 32 Super Bowls to the stock market, which made the “Super
Bowl Indicator” a far more reliable predictor of the stock market than any professional
stock broker during the same period. In fact, detailed calculations show that the
probability of such success by pure chance is less than 1 in 100,000. Should you
therefore make a decision about whether to invest in the stock market based on the NFL
origins of the most recent Super Bowl winner?
SOLUTION The extremely strong correlation might make it seem like a good idea to
base your investments on the Super Bowl Indicator, but sometimes you need to apply a
bit of common sense. No matter how strong the correlation might be, it seems
inconceivable to imagine that the origin of the winning team actually causes the stock
market to move in a particular direction. The correlation is undoubtedly a coincidence,
and the fact that its probability of occurring by pure chance was less than 1 in 100,000 is
just another illustration of the fact that you can turn up surprising correlations if you go
fishing for them. This fact was borne out in more recent Super Bowls: Following Super
Bowl 32, the indicator successfully predicted the stock market direction in only 5 of the
next 10 years—exactly the fraction that would be expected by pure chance.
ASE STUDY Oat Bran and Heart Disease
If you buy a product that contains oat bran, there’s a good chance that the label will tout
the healthful effects of eating oats. Indeed, several studies have found correlations in
which people who eat more oat bran tend to have lower rates of heart disease. But does
this mean that everyone should eat more oats?
Not necessarily. Just because oat bran consumption is correlated with reduced risk of
heart disease does not mean that it causes reduced risk of heart disease. In fact, the
question of causality is quite controversial in this case. Other studies suggest that people
who eat a lot of oat bran tend to have generally healthful diets. Thus, the correlation
between oat bran consumption and reduced risk of heart disease may be a case of a
common underlying cause: Having a healthy diet leads people both to consume more
oat bran and to have a lower risk of heart disease. In that case, for some people, adding
oat bran to their diets might be a bad idea because it could cause them to gain weight,
and weight gain is associated with increased risk of heart disease.
This example shows the importance of using caution when considering issues of
correlation and causality. It may be a long time before medical researchers know for
sure whether adding oat bran to your diet actually causes a reduced risk of heart disease.
Useful Interpretations of Correlation
In discussing uses of correlation that might lead to wrong interpretations, we have
described the effects of outliers, inappropriate groupings, fishing for correlations, and
incorrectly concluding that correlation implies causality. But there are many correct and
useful interpretations of correlation, some of which we have already studied. So while
you should be cautious in interpreting correlations, they remain a valuable tool in any
field in which statistical research plays a role.
Section 7.2 Exercises
Statistical Literacy and Critical Thinking
1.
Correlation and Causality. In clinical trials of the drug Lisinopril, it is found
that increased dosages of the drug correlated with lower blood pressure levels.
Based on the correlation, can we conclude that Lisinopril treatments cause lower
blood pressure? Why or why not?
2.
SIDS. An article in the New York Times on infant deaths included a statement
that, based on the study results, putting infants to sleep in the supine position
decreased deaths due to SIDS (sudden infant death syndrome). What is wrong
with that statement?
3.
Outliers. When studying salaries paid to CEOs of large companies, it is found
that almost all of them range from a few hundred thousand dollars to several
million dollars, but one CEO is paid a salary of $1. Is that salary of $1 an outlier?
In general, how might outliers affect conclusions about correlation?
4.
Scatterplot. Does a scatterplot reveal anything about a cause and effect
relationship between two variables?
Does It Make Sense? For Exercises 5–8, decide whether the statement makes sense
(or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of
these statements have definitive answers, so your explanation is more important than
your chosen answer.
5.
Scatterplot. A set of paired sample data results in a correlation coefficient
of r = 0, so the scatterplot will show that there is no pattern of the plotted points.
6.
Causation. If we have 20 pairs of sample data with a correlation coefficient of 1,
then we know that one of the two variables is definitely the cause of the other.
7.
Causation. If we conduct a study showing that there is a strong negative
correlation between resting pulse rate and amounts of time spent in rigorous
exercise, we can conclude decreases in resting pulse rates are somehow
associated with increases in exercise.
8.
Causation. If we have two variables with one being the direct cause of the other,
then there may or may not be a correlation between those two variables.
Concepts and Applications
Correlation and Causality. Exercises 9–16 make statements about a correlation. In
each case, state the correlation clearly. (For example, we might state that “there is a
positive correlation between variable A and variable B.”) Then state whether the
correlation is most likely due to coincidence, a common underlying cause, or a direct
cause. Explain your answer.
9.
Guns and Crime Rate. In one state, the number of unregistered handguns
steadily increased over the past several years, and the crime rate increased as
well.
10.
Running and Weight. It has been found that people who exercise regularly by
running tend to weigh less than those who do not run, and those who run longer
distances tend to weigh less than those who run shorter distances.
11.
Study Time. Statistics students find that as they spend more time studying,
their test scores are higher.
12.
Vehicles and Waiting Time. It has been found that as the number of
registered vehicles increases, the time drivers spend sitting in traffic also
increases.
13.
Traffic Lights and Car Crashes. It has been found that as the number of
traffic lights increases, the number of car crashes also increases.
14.
Galaxies. Astronomers have discovered that, with the exception of a few nearby
galaxies, all galaxies in the universe are moving away from us. Moreover, the
farther the galaxy, the faster it is moving away. That is, the more distant a galaxy,
the greater the speed at which it is moving away from us.
15.
Gas and Driving. It has been found that as gas prices increase, the distances
vehicles are driven tend to get shorter.
16.
Melanoma and Latitude. Some studies have shown that, for certain ethnic
groups, the incidence of melanoma (the most dangerous form of skin cancer)
increases as latitude decreases.
17.
Outlier Effects. Consider the scatterplot in Figure 7.15.
Figure 7.15
•
•
a.Which point is an outlier? Ignoring the outlier, estimate or compute the correlation
coefficient for the remaining points.
b.Now include the outlier. How does the outlier affect the correlation coefficient?
Estimate or compute the correlation coefficient for the complete data set.
18.
Outlier Effects. Consider the scatterplot in Figure 7.16.
Figure 7.16
•
•
a.Which point is an outlier? Ignoring the outlier, estimate or compute the correlation
coefficient for the remaining points.
b.Now include the outlier. How does the outlier affect the correlation coefficient?
Estimate or compute the correlation coefficient for the complete data set.
19.
Grouped Shoe Data. The following table gives measurements of weight and
shoe size for 10 people (including both men and women).
•
a.Construct a scatterplot for the data. Estimate or compute the correlation
coefficient. Based on this correlation coefficient, would you conclude that shoe size
and weight are correlated? Explain.
Weight (pounds)
105
112
115
123
135
155
165
170
180
190
•
Shoe size
6
4.5
6
5
6
10
11
9
10
12
b.You later learn that the first five data values in the table are for women and the
next five are for men. How does this change your view of the correlation? Is it still
reasonable to conclude that shoe size and weight are correlated?
20.
Grouped Temperature Data. The following table shows the average January
high temperature and the average July high temperature for 10 major cities
around the world.
City
Berlin
Geneva
Kabul
Montreal
Prague
Auckland
Buenos Aires
Sydney
Santiago
Melbourne
January high
35
39
36
21
34
73
85
78
85
78
July high
74
77
92
78
74
56
57
60
59
56
•
•
a.Construct a scatterplot for the data. Estimate or compute the correlation
coefficient. Based on this correlation coefficient, would you conclude that January
and July temperatures are correlated for these cities? Explain.
b.Notice that the first five cities in the table are in the Northern Hemisphere and the
next five are in the Southern Hemisphere. How does this change your view of the
correlation? Would you now conclude that January and July temperatures are
correlated for these cities? Explain.
21.
Birth and Death Rates. Figure 7.17 shows the birth and death rates for
different countries, measured in births and deaths per 1,000 population.
Figure 7.17Birth and death rates for different
countries.
Source: United Nations.
•
•
a.Estimate the correlation coefficient and discuss whether there is a strong
correlation between the variables.
b.Notice that there appear to be two groups of data points within the full data set.
Make a reasonable guess as to the makeup of these groups. In which group might you
find a relatively wealthy country like Sweden? In which group might you find a
relatively poor country like Uganda?
•
c.Assuming that your guess about groups in part b is correct, do there appear to be
correlations within the groups? Explain. How could you confirm your guess about the
groups?
22.
Reading and Test Scores. The following (hypothetical) data set gives the
number of hours 10 sixth-graders read per week and their performance on a
standardized verbal test (maximum of 100).
Reading time per week
1
1
2
3
3
4
5
6
10
12
•
•
•
Verbal test score
50
65
56
62
65
60
75
50
88
38
a.Construct a scatterplot for these data. Estimate or compute the correlation
coefficient. Based on this correlation coefficient, would you conclude that reading
time and test scores are correlated? Explain.
b.Suppose you learn that five of the children read only comic books while the other
five read regular books. Make a guess as to which data points fall in which group.
How could you confirm your guess about the groups?
c.Assuming that your guess in part b is correct, how does it change your view of the
correlation between reading time and test scores? Explain.
PROJECTS FOR
THE INTERNET & BEYOND
23.
Football-Stock Update. Find data for recent years concerning the Super Bowl
winner and the end-of-year change in the stock market (positive or negative). Do
recent results still agree with the correlation described in Example 2? Explain.
24.
Real Correlations.
•
a.Describe a real situation in which there is a positive correlation that is the result of
coincidence.
•
•
•
•
•
b.Describe a real situation in which there is a positive correlation that is the result of
a common underlying cause.
c.Describe a real situation in which there is a positive correlation that is the result of
a direct cause.
d.Describe a real situation in which there is a negative correlation that is the result of
coincidence.
e.Describe a real situation in which there is a negative correlation that is the result of
a common underlying cause.
f.Describe a real situation in which there is a negative correlation that is the result of
a direct cause.
IN THE NEWS
25.
Misinterpreted Correlations. Find a recent news report in which you believe
that a correlation may have been misinterpreted. Describe the correlation, the
reported interpretation, and the problems you see in the interpretation.
26.
Well-Interpreted Correlations. Find a recent news report in which you
believe that a correlation has been presented with a reasonable interpretation.
Describe the correlation and the reported interpretation, and explain why you
think the interpretation is valid.
7.3 BEST-FIT LINES AND PREDICTION
Suppose you are lucky enough to win a 1.5-carat diamond in a contest. Based on the
correlation between weight and price in Figure 7.1, it should be possible to predict the
approximate value of the diamond. We need only study the graph carefully and decide
where a point corresponding to 1.5 carats is most likely to fall. To do this, it is helpful to
draw a best-fit line (also called a regression line) through the data, as shown
in Figure 7.18. This line is a “best fit” in the sense that, according to a standard
statistical measure (which we discuss shortly), the data points lie closer to this line than
to any other straight line that we could draw through the data.
Figure 7.18 Best-fit line for the data from Figure
7.1.
BY THE WAY
The term regression comes from an 1877 study by Sir Francis Galton. He found that the
heights of boys with short or tall fathers were closer to the mean than were the heights
of their fathers. He therefore said that the heights of the children regress toward the
mean, from which we get the term regression. The term is now used even for data that
have nothing to do with a tendency to regress toward a mean.
Definition
The best-fit line (or regression line) on a scatterplot is a line that lies closer to the data
points than any other possible line (according to a standard statistical measure of
closeness).
Of all the possible straight lines that can be drawn on a diagram, how do you know
which one is the best-fit line? In many cases, you can make a good estimate of the bestfit line simply by looking at the data and drawing the line that visually appears to pass
closest to all the data points. This method involves drawing the best-fit line “by eye.” As
you might guess, there are methods for calculating the precise equation of a best-fit line
(see the optional topic at the end of this section), and many computer programs and
calculators can do these calculations automatically. For our purposes in this text, a fit by
eye will generally be sufficient.
Predictions with Best-Fit Lines
We can use the best-fit line in Figure 7.18 to predict the price of a 1.5-carat diamond.
As indicated by the dashed lines in the figure, the best-fit line predicts that the diamond
will cost about $9,000. Notice, however, that two actual data points in the figure
correspond to 1.5-carat diamonds, and both of these diamonds cost less than $9,000.
That is, although the predicted price of $9,000 sounds reasonable, it is certainly not
guaranteed. In fact, the degree of scatter among the data points in this case tells us that
we should not trust the best-fit line to predict accurately the price for any individual
diamond. Instead, the prediction is meaningful only in a statistical sense: It tells us that
if we examined many 1.5-carat diamonds, their mean price would be about $9,000.
This is only the first of several important cautions about interpreting predictions with
best-fit lines. A second caution is to beware of using best-fit lines to make predictions
that go beyond the bounds of the available data. Figure 7.19 shows a best-fit line for
the correlation between infant mortality and longevity from Figure 7.4. According to
this line, a country with a life expectancy of more than about 80 years would have
a negative infant mortality rate, which is impossible.
•
It is a capital mistake to theorize before one has data. —Arthur Conan Doyle
Figure 7.19 A best-fit line for the correlation
between infant mortality and longevity
from Figure 7.4.
Source: United Nations.
A third caution is to avoid using best-fit lines from old data sets to make predictions
about current or future results. For example, economists studying historical data found
a strong negative correlation between unemployment and the rate of inflation.
According to this correlation, inflation should have risen dramatically in the mid-2000s
when the unemployment rate fell below 6%. But inflation remained low, showing that
the correlation from old data did not continue to hold.
Fourth, a correlation discovered with a sample drawn from a particular population
cannot generally be used to make predictions about other populations. For example, we
can’t expect that the correlation between aspirin consumption and heart attacks in an
experiment involving only men will also apply to women.
•
It’s tough to make predictions, especially about the future. —attributed to Niels
Bohr, Yogi Berra, and others
Fifth, remember that we can draw a best-fit line through any data set, but that line is
meaningless when the correlation is not significant or when the relationship is
nonlinear. For example, there is no correlation between shoe size and IQ, so we could
not use shoe size to predict IQ.
Cautions in Making Predictions from Best-Fit
Lines
•
•
•
•
•
1. Don’t expect a best-fit line to give a good prediction unless the correlation is strong
and there are many data points. If the sample points lie very close to the best-fit line,
the correlation is very strong and the prediction is more likely to be accurate. If the
sample points lie away from the best-fit line by substantial amounts, the correlation
is weak and predictions tend to be much less accurate.
2. Don’t use a best-fit line to make predictions beyond the bounds of the data points
to which the line was fit.
3. A best-fit line based on past data is not necessarily valid now and might not result
in valid predictions of the future.
4. Don’t make predictions about a population that is different from the population
from which the sample data were drawn.
5. Remember that a best-fit line is meaningless when there is no significant
correlation or when the relationship is nonlinear.
EXAMPLE
Valid Predictions?
State whether the prediction (or implied prediction) should be trusted in each of the
following cases, and explain why or why not.
•
a. You’ve found a best-fit line for a correlation between the number of hours per day
that people exercise and the number of calories they consume each day. You’ve used
this correlation to predict that a person who exercises 18 hours per day would
consume 15,000 calories per day.
•
•
•
•
•
b. There is a well-known but weak correlation between SAT scores and college
grades. You use this correlation to predict the college grades of your best friend from
her SAT scores.
c. Historical data have shown a strong negative correlation between national birth
rates and affluence. That is, countries with greater affluence tend to have lower birth
rates. These data predict a high birth rate in Russia.
d. A study in China has discovered correlations that are useful in designing museum
exhibits that Chinese children enjoy. A curator suggests using this information to
design a new museum exhibit for Atlanta-area school children.
e. Scientific studies have shown a very strong correlation between children’s
ingesting of lead and mental retardation. Based on this correlation, paints containing
lead were banned.
f. Based on a large data set, you’ve made a scatterplot for salsa consumption (per
person) versus years of education. The diagram shows no significant correlation, but
you’ve drawn a best-fit line anyway. The line predicts that someone who consumes a
pint of salsa per week has at least 13 years of education.
SOLUTION
•
•
•
•
•
•
a. No one exercises 18 hours per day on an ongoing basis, so this much exercise must
be beyond the bounds of any data collected. Therefore, a prediction about someone
who exercises 18 hours per day should not be trusted.
b. The fact that the correlation between SAT scores and college grades is weak means
there is much scatter in the data. As a result, we should not expect great accuracy if
we use this weak correlation to make a prediction about a single individual.
c. We cannot automatically assume that the historical data still apply today. In fact,
Russia currently has a very low birth rate, despite also having a low level of affluence.
d. The suggestion to use information from the Chinese study for an Atlanta exhibit
assumes that predictions made from correlations in China also apply to Atlanta.
However, given the cultural differences between China and Atlanta, the curator’s
suggestion should not be considered without more information to back it up.
e. Given the strength of the correlation and the severity of the consequences, this
prediction and the ban that followed seem quite reasonable. In fact, later studies
established lead as an actual cause of mental retardation, making the rationale
behind the ban even stronger.
f. Because there is no significant correlation, the best-fit line and any predictions
made from it are meaningless.
BY THE WAY
In the United States, lead was banned from house paint in 1978 and from food cans in
1991, and a 25-year phaseout of lead in gasoline was completed in 1995. Nevertheless,
many young children—especially children living in poor areas—still have enough lead in
their blood to damage their health. Major sources of ongoing lead hazards include paint
in older housing and soil near major roads, which has high lead content from past use of
leaded gasoline.
EXAMPLE
Will Women Be Faster Than Men?
Figure 7.20 shows data and best-fit lines for both men’s and women’s world record
times in the 1-mile race. Based on these data, predict when the women’s world record
will be faster than the men’s world record. Comment on the prediction.
Figure 7.20 World record times in the mile (men
and women).
SOLUTION If we accept the best-fit lines as drawn, the women’s world record will
equal the men’s world record by about 2040. However, this is not a valid prediction
because it is based on extending the best-fit lines beyond the range of the actual data. In
fact, notice that the most recent world records (as of 2011) date all the way back to 1999
for men and 1996 for women, while the best-fit lines predict that the records should
have fallen by several more seconds since those dates.
The Correlation Coefficient and Best-Fit Lines
Earlier, we discussed the correlation coefficient as one way of measuring the strength of
a correlation. We can also use the correlation coefficient to say something about the
validity of predictions with best-fit lines.
For mathematical reasons (not discussed in this text), the square of the correlation
coefficient, or r2, is the proportion of the variation in a variable that is accounted for by
the best-fit line (or, more technically, by the linear relationship that the best-fit line
expresses). For example, the correlation coefficient for the diamond weight and price
data (see Figure 7.18) turns out to be r = 0.777. If we square this value, we get r2 =
0.604 which we can interpret as follows: About 0.6, or 60%, of the variation in the
diamond prices is accounted for by the best-fit line relating weight and price. That
leaves 40% of the variation in price that must be due to other factors, presumably such
things as depth, table, color, and clarity—which is why predictions made with the bestfit line in Figure 7.18are not very precise.
A best-fit line can give precise predictions only in the case of a perfect correlation (r = 1
or r = –1); we then find r2 = 1, which means that 100% of the variation in a variable can
be accounted for by the best-fit line. In this special case of r2 = 1, predictions should be
exactly correct, except for the fact that the sample data might not be a true
representation of the population data.
Best-Fit Lines and r2
The square of the correlation coefficient, or r2, is the proportion of the variation in a
variable that is accounted for by the best-fit line.
TECHNICAL NOTE
Statisticians call r2 the coefficient of determination.
EXAMPLE
Retail Hiring
You are the manager of a large department store. Over the years, you’ve found a strong
correlation between your September sales and the number of employees you’ll need to
hire for peak efficiency during the holiday season; the correlation coefficient is 0.950.
This year your September sales are fairly strong. Should you start advertising for help
based on the best-fit line?
SOLUTION In this case, we find that r2 = 0.9502 = 0.903, which means that 90% of the
variation in the number of peak employees can be accounted for by a linear relationship
with September sales. That leaves only 10% of the variation in the number of peak
employees unaccounted for. Because 90% is so high, we conclude that the best-fit line
accounts for the data quite well, so it seems reasonable to use it to predict the number of
employees you’ll need for this year’s holiday season.
EXAMPLE
Voter Turnout and Unemployment
Political scientists are interested in knowing what factors affect voter turnout in
elections. One such factor is the unemployment rate. Data collected in presidential
election years since 1964 show a very weak negative correlation between voter turnout
and the unemployment rate, with a correlation coefficient of about r = –0.1 (Figure
7.21). Based on this correlation, should we use the unemployment rate to predict voter
turnout in the next presidential election?
Figure 7.21 Data on voter turnout and
unemployment, 1964–2008.
Source: U.S. Bureau of Labor Statistics.
SOLUTION The square of the correlation coefficient is r2 = (–0.1)2 = 0.01, which means
that only about 1% of the variation in the data is accounted for by the best-fit line.
Nearly all of the variation in the data must therefore be explained by other factors. We
conclude that unemployment is not a reliable predictor of voter turnout.
Multiple Regression
If you’ve ever purchased a diamond, you might have been surprised that we found such
a weak correlation between color and price in Figure 7.2. Surely a diamond cannot be
very valuable if it has poor color quality. Perhaps color helps to explain why the
correlation between weight and price is not perfect. For example, maybe differences in
color explain why two diamonds with the same weight can have different prices. To
check this idea, it would be nice to look for a correlation between the price and some
combination of weight and color together.
•
All who drink his remedy recover in a short time, except those whom it does not
help, who all die. Therefore, it is obvious that it fails only in incurable cases. —
Galen, Roman “doctor”
TIME
UT TO THINK
Check this idea in Table 7.1. Notice, for example, that Diamonds 4 and 5 have nearly
identical weights, but Diamond 4 costs only $4,299 while Diamond 5 costs $9,589. Can
differences in their color explain the different prices? Study other examples in Table
7.1 in which two diamonds have similar weights but different prices. Overall, do you
think that the correlation with price would be stronger if we used weight and color
together instead of either one alone? Explain.
There is a method for investigating a correlation between one variable (such as price)
and a combination of two or more other variables (such as weight and color). The
technique is called multiple regression, and it essentially allows us to find a best-fit
equation that relates three or more variables (instead of just two). Because it involves
more than two variables, we cannot make simple diagrams to show best-fit equations for
multiple regression. However, it is still possible to calculate a measure of how well the
data fit a linear equation. The most common measure in multiple regression is
the coefficient of determination, denoted R2. It tells us how much of the scatter in the
data is accounted for by the best-fit equation. If R2 is close to 1, the best-fit equation
should be very useful for making predictions within the range of the data values. If R2 is
close to zero, then predictions with the best-fit equation are essentially useless.
Definition
The use of multiple regression allows the calculation of a best-fit equation that
represents the best fit between one variable (such as price) and a combination of two or
more other variables (such as weight and color). The coefficient of determination, R2,
tells us the proportion of the scatter in the data accounted for by the best-fit equation.
In this text, we will not describe methods for finding best-fit equations by multiple
regression. However, you can use the value of R2 to interpret results from multiple
regression. For example, the correlation between price and weight and color
together results in a value of R2 = 0.79. This is somewhat higher than the r2 = 0.61 that
we found for the correlation between price and weight alone. Statisticians who study
diamond pricing know that they can get stronger correlations by including additional
variables in the multiple regression (such as depth, table, and clarity). Given the billions
of dollars spent annually on diamonds, you can be sure that statisticians play prominent
roles in helping diamond dealers realize the largest possible profits.
BY THE WAY
One study of alumni donations found that, in developing a multiple regression equation,
one should include these variables: income, age, marital status, whether the donor
belonged to a fraternity or sorority, whether the donor is active in alumni affairs, the
donor’s distance from the college, and the nation’s unemployment rate, used as a
measure of the economy (Bruggink and Siddiqui, “An Econometric Model of Alumni
Giving: A Case Study for a Liberal Arts College,” The American Economist, Vol. 39, No.
2).
EXAMPLE
Alumni Contributions
You’ve been hired by your college’s alumni association to research how past
contributions were associated with alumni income and years that have passed since
graduation. It is found that R2 = 0.36. What does that result tell us?
SOLUTION With R2 = 0.36, we conclude that 36% of the variation in past
contributions can be explained by the variation in alumni income and years since
graduation. It follows that 64% of the variation in past contributions can be explained by
factors other than alumni income level and years since graduation. Because such a large
proportion of the variation can be explained by other factors, it would make sense to try
to identify any other factors that might have a strong effect on past contributions.
Finding Equations for Best-Fit Lines (Optional
Section)
The mathematical technique for finding the equation of a best-fit line is based on the
following basic ideas. If we draw any line on a scatterplot, we can measure
the vertical distance between each data point and that line. One measure of how well the
line fits the data is the sum of the squares of these vertical distances. A large sum means
that the vertical distances of data points from the line are fairly large and hence the line
is not a very good fit. A small sum means the data points lie close to the line and the fit is
good. Of all possible lines, the best-fit line is the line that minimizes the sum of the
squares of the vertical distances. Because of this property, the best-fit line is sometimes
called the least squares line.
You may recall that the equation of any straight line can be written in the general form
where m is the slope of the line and b is the y-intercept of the line. The formulas for the
slope and y-intercept of the best-fit line are as follows:
In the above expressions, r is the correlation coefficient, sx denotes the standard
deviation of the xvalues (or the values of the first variable), sy denotes the standard
deviation of the y values, xrepresents the mean of the values of the variable x,
and y represents the mean of the values of the variable y. Because these formulas are
tedious with manual calculations, we usually use a calculator or computer to find the
slope and y-intercept of best-fit lines. Statistical software packages and some
calculators, such as the TI-83/84 Plus family of calculators, are designed to
automatically generate the equation of a best-fit line.
When software or a calculator is used to find the slope and intercept of the best-fit line,
results are commonly expressed in the format y = b0 + b1x, where b0 is the intercept
and b1 is the slope, so be careful to correctly identify those two values.
Section 7.3 Exercises
Statistical Literacy and Critical Thinking
1.
Best-Fit Line. What is a best-fit line (also called a regression line)? How is a
best-fit line useful?
2.
r2. For a study involving paired sample data, it is found that r = –0.4. What is the
value of r2? In general, what is r2 called, what does it measure, and how can it be
interpreted? That is, what does its value tell us about the variables?
3.
Regression. An investigator has data consisting of heights of daughters and the
heights of the corresponding mothers and fathers. She wants to analyze the data
to see the effect that the height of the mother and the height of the father has on
the height of the daughter. Should she use a (linear) regression or multiple
regression? What is the basic difference between (linear) regression and multiple
regression?
4.
R2. Using data described in Exercise 3, it is found that R2 = 0.68. Interpret that
value. That is, what does that value tell us about the data?
Does It Make Sense? For Exercises 5–8, decide whether the statement makes sense
(or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of
these statements have definitive answers, so your explanation is more important than
your chosen answer.
5.
r2 Value. A value of r2 = 1 is obtained from a sample of paired data with one
variable representing the amount of gas (gallons) purchased and the total cost of
the gas.
6.
r2 Value. A value of r2 = –0.040 is obtained from a sample of men, with each
pair of data consisting of the height in inches and the SAT score for one man.
7.
Height and Weight. Using data from the National Health Survey, the equation
of the best-fit line for women’s heights and weights is obtained, and it shows that
a woman 120 inches tall is predicted to weigh 430 pounds.
8.
Old Faithful. Using paired sample data consisting of the duration time (in
seconds) of eruptions of Old Faithful geyser and the time interval (in minutes)
after the eruption, a value of r2 = 0.926 is calculated, indicating that about 93% of
the variation in the interval after eruption can be explained by the relationship
between those two variables as described by the best-fit line.
Concepts and Applications
Best-Fit Lines on Scatterplots. For Exercises 9–12, do the following.
•
•
•
a. Insert a best-fit line in the given scatterplot.
b. Estimate or compute r and r2. Based on your value for r2, determine how much of
the variation in the variable can be accounted for by the best-fit line.
c. Briefly discuss whether you could make valid predictions from this best-fit line.
9.
Use the scatterplot for color and price in Figure 7.2.
10.
Use the scatterplot for life expectancy and infant mortality in Figure 7.4.
11.
Use the scatterplot for number of farms and size of farms in Figure 7.5.
12.
Use both scatterplots for actual and predicted temperature in Figure 7.6.
Best-Fit Lines. Exercises 13–20 refer to the tables in the Section 7.1 Exercises. In
each case, do the following.
•
•
•
•
a. Construct a scatterplot and, based on visual inspection, draw the best-fit line by
eye.
b. Briefly discuss the strength of the correlation. Estimate or compute r and r2. Based
on your value for r2, identify how much of the variation in the variable can be
accounted for by the best-fit line.
c. Identify any outliers on the scatterplot and discuss their effects on the strength of
the correlation and on the best-fit line.
d. For this case, do you believe that the best-fit line gives reliable predictions outside
the range of the data on the scatterplot? Explain.
13.
Use the data in Exercise 19 of Section 7.1.
14.
Use the data in Exercise 20 of Section 7.1.
15.
Use the data in Exercise 21 of Section 7.1.
16.
Use the data in Exercise 22 of Section 7.1.
17.
Use the data in Exercise 23 of Section 7.1. To locate the points, use the
midpoint of each income category; use a value of $25,000 for the category “less
than $30,000,” and use a value of $70,000 for the category “more than
$60,000.”
18.
Use the data in Exercise 24 of Section 7.1.
19.
Use the data in Exercise 25 of Section 7.1.
20.
Use the data in Exercise 26 of Section 7.1.
PROJECTS FOR
THE INTERNET & BEYOND
21.
Lead Poisoning. Research lead poisoning, its sources, and its effects. Discuss
the correlations that have helped researchers understand lead poisoning. Discuss
efforts to prevent it.
22.
Asbestos. Research asbestos, its sources, and its effects. Discuss the
correlations that have helped researchers understand adverse health effects from
asbestos exposure. Discuss efforts to prevent those adverse health effects.
23.
Worldwide Population Indicators. The following table gives five population
indicators for eleven selected countries. Study these data and try to identify
possible correlations. Doing additional research if necessary, discuss the possible
correlations you have found, speculate on the reasons for the correlations, and
discuss whether they suggest a causal relationship. Birth and death rates are per
1,000 population; fertility rate is per woman.
Country
Afghanistan
Argentina
Australia
Canada
Egypt
El Salvador
France
Israel
Japan
Laos
United
States
Birth
rate
50
21
15
14
29
30
13
21
10
45
Death
rate
22
8
7
7
8
6
9
7
7
15
Life
expectancy
43
72
78
78
64
68
78
77
79
51
Percent
urban
20
88
85
77
45
45
73
91
78
22
Fertility
rate
6.9
2.6
1.9
1.6
3.4
3.1
1.6
2.8
1.5
6.7
16
9
76
76
2.0
Source: The New York Times Almanac.
IN THE NEWS
24.
Predictions in the News. Find a recent news report in which a correlation is
used to make a prediction. Evaluate the validity of the prediction, considering all
of the cautions described in this section. Overall, do you think the prediction is
valid? Why or why not?
25.
Best-Fit Line in the News. Although scatterplots are rare in the news, they
are not unheard of. Find a scatterplot of any kind in a news article (recent or
not). Draw a best-fit line by eye. Discuss what predictions, if any, can be made
from your best-fit line.
26.
Your Own Multiple Regression. Come up with an example from your own
life or work in which a multiple regression analysis might reveal important
trends. Without actually doing any analysis, describe in words what you would
look for through the multiple regression and how the answers might be useful.
7.4 THE SEARCH FOR CAUSALITY
A correlation may suggest causality, but by itself a correlation
never establishes causality. Much more evidence is required to establish that one
factor causes another. Earlier, we found that a correlation between two variables may be
the result of either (1) coincidence, (2) a common underlying cause, or (3) one variable
actually having a direct influence on the other. The process of establishing causality is
essentially a process of ruling out the first two explanations.
In principle, we can rule out the first two explanations by conducting experiments:
•
•
• We can rule out coincidence by repeating the experiment many times (or by using a
large number of subjects in the experiment). Because coincidences occur randomly,
the same coincidence is unlikely to occur in repeated trials of an experiment.
• We can rule out a common underlying cause by controlling and randomizing the
experiment to eliminate the effects of confounding variables (see Section 1.3). If the
controls rule out confounding variables, any remaining effects must be caused by the
variables of interest.
Unfortunately, these ideas are often difficult to put into practice. In the case of ruling
out coincidence, it may be too time-consuming or expensive to repeat an experiment a
sufficient number of times. To rule out a common underlying cause, the experiment
must control for everything except the variables of interest, and this is often impossible.
Moreover, there are many cases in which experiments are impractical or unethical, so
we can gather only observational data. Because observational studies cannot definitively
establish causality, we must find other ways of trying to establish causality.
Establishing Causality
Suppose you have discovered a correlation and suspect causality. How can you test your
suspicion? Let’s return to the issue of smoking and lung cancer. The strong correlation
between smoking and lung cancer did not by itself prove that smoking causes lung
cancer. In principle, we could have looked for proof with a controlled experiment. But
such an experiment would be unethical because it would require forcing a group of
randomly selected people to smoke cigarettes. So how was smoking established as a
cause of lung cancer?
The answer involves several lines of evidence. First, researchers found correlations
between smoking and lung cancer among many groups of people: women, men, and
people of different races and cultures. Second, among groups of people that seemed
otherwise identical, lung cancer was found to be more rare in nonsmokers. Third, people
who smoked more and for longer periods of time were found to have higher rates of lung
cancer. Fourth, when researchers accounted for other potential causes of lung cancer
(such as exposure to radon gas or asbestos), they found that almost all the remaining
lung cancer cases occurred among smokers (or people exposed to second-hand smoke).
BY THE WAY
Statistical methods cannot prove that smoking causes cancer, but statistical methods
can be used to identify an association, and physical proof of causation can then be
sought by researchers. Dr. David Sidransky of Johns Hopkins University and other
researchers found a direct physical link that involves mutations of a specific gene among
smokers. Molecular analysis of genetic changes allows researchers to determine whether
cigarette smoking is the cause of a cancer. (See “Association Between Cigarette Smoking
and Mutation of the p53 Gene in Squamous-Cell Carcinoma of the Head and Neck,” by
Brennan, Boyle et al., New England Journal of Medicine, Vol 332, No. 11.)
These four lines of evidence made a strong case, but still did not rule out the possibility
that some other factor, such as genetics, predisposes people both to smoking and to lung
cancer. However, two additional lines of evidence made this possibility highly unlikely.
One line of evidence came from animal experiments. In controlled experiments, animals
were divided into randomly chosen treatment and control groups. The experiments still
found a correlation between inhalation of cigarette smoke and lung cancer, which seems
to rule out a genetic factor, at least in the animals. The final line of evidence came from
biologists studying small samples of human lung tissue. The biologists discovered the
basic process by which ingredients in cigarette smoke create cancer-causing mutations.
This process does not appear to depend in any way on specific genetic factors, making
it all but certain that lung cancer is caused by smoking and not by any preexisting
genetic factor. The fact that second-hand smoke exposure is also associated with some
cases of lung cancer further argues against a genetic factor (since second-hand smoke
affects non-smokers) but is consistent with the idea that ingredients in cigarette smoke
create cancer-causing mutations.
The following box summarizes these ideas about establishing causality. Generally
speaking, the case for causality is stronger when more of these guidelines are met.
Guidelines for Establishing Causality
If you suspect that a particular variable (the suspected cause) is causing some effect:
•
•
•
•
•
1. Look for situations in which the effect is correlated with the suspected cause even
while other factors vary.
2. Among groups that differ only in the presence or absence of the suspected cause,
check that the effect is similarly present or absent.
3. Look for evidence that larger amounts of the suspected cause produce larger
amounts of the effect.
4. If the effect might be produced by other potential causes (besides your suspected
cause), make sure that the effect still remains after accounting for these other
potential causes.
5. If possible, test the suspected cause with an experiment. If the experiment cannot
be performed with humans for ethical reasons, consider doing the experiment with
animals, cell cultures, or computer models.
•
6. Try to determine the physical mechanism by which the suspected cause produces
the effect.
BY THE WAY
The first four guidelines to the left are called Mill’s methods after John Stuart Mill
(1806–1873). Mill was a leading scholar of his time and an early advocate of women’s
right to vote. In philosophy, the four methods are called, respectively, the methods of
agreement, difference, concomitant variation, and residues.
TIME
UT TO THINK
There’s a great deal of controversy concerning whether animal experiments are ethical.
What is your opinion of animal experiments? Defend your opinion.
ASE STUDY Air Bags and Children
By the mid-1990s, passenger-side air bags had become commonplace in cars. Statistical
studies showed that the air bags saved many lives in moderate- to high-speed collisions.
But a disturbing pattern also appeared. In at least some cases, young children, especially
infants and toddlers in child car seats, were killed by air bags in low-speed collisions.
At first, many safety advocates found it difficult to believe that air bags could be the
cause of the deaths. But the observational evidence became stronger, meeting the first
four guidelines for establishing causality. For example, the greater risk to infants in
child car seats fit Guideline 3, because it indicated that being closer to the air bags
increased the risk of death. (A child car seat sits on top of the built-in seat, thereby
putting a child closer to the air bags than the child would be otherwise.)
To seal the case, safety experts undertook experiments using dummies. They found that
children, because of their small size, often sit where they could be easily hurt by the
explosive opening of an air bag. The experiments also showed that an air bag could
impact a child car seat hard enough to cause death, thereby revealing the physical
mechanism by which the deaths occurred.
BY THE WAY
Based on these studies, the government now recommends that child car seats never be
used on the front seat and that children under age 12 (or under 4 feet, 9 inches tall) sit in
the back seat whenever possible.
ASE STUDY Cardiac Bypass Surgery
Cardiac bypass surgery is performed on people who have severe blockage of arteries that
supply the heart with blood (the coronary arteries). If blood flow stops in these arteries,
a patient may suffer a heart attack and die. Bypass surgery essentially involves grafting
new blood vessels onto the blocked arteries so that blood can flow around the blocked
areas. By the mid-1980s, many doctors were convinced that the surgery was prolonging
the lives of their patients.
However, a few early retrospective studies turned up a disconcerting result: Statistically,
the surgery appeared to be making little difference. In other words, patients who had the
surgery seemed to be faring no better on average than similar patients who did not have
it. If this were true, it meant that the surgery was not worth the pain, risk, and expense
involved.
Because these results flew in the face of what many doctors thought they had observed
in their own patients, researchers began to dig more deeply. Soon, they found
confounding variables that had not been accounted for in the early studies. For example,
they found that patients getting the surgery tended to have more severe blockage of their
arteries, apparently because doctors recommended the surgery more strongly to these
patients. Because these patients were in worse shape to begin with, a comparison of
longevity between them and other patients was not really valid.
More important, the research soon turned up substantial differences in the results
among patients who had the surgery in different hospitals. In particular, a few hospitals
were achieving remarkable success with bypass surgery and their patients fared far
better than patients who did not have the surgery or had it at other hospitals. Clearly,
the surgical techniques used by doctors at the successful hospitals were somehow
different and superior. Doctors studied the differences to ensure that all doctors could
be trained in the superior techniques.
In summary, the confounding variables of amount of blockage and surgical
technique had prevented the early studies from finding a real correlation between
cardiac bypass surgery and prolonged life. Today, cardiac bypass surgery is accepted as
a cause of prolonged life in patients with blocked coronary arteries. It is now among the
most common types of surgery, and it typically adds decades to the lives of the patients
who undergo it.
BY THE WAY
As you might guess, it is also difficult to define reasonable doubt. For criminal trials, the
Supreme Court endorsed this guidance from Justice Ruth Bader Ginsburg: “Proof
beyond a reasonable doubt is proof that leaves you firmly convinced of the defendant’s
guilt. There are very few things in this world that we know with absolute certainty, and
in criminal cases the law does not require proof that overcomes every possible doubt. If,
based on your consideration of the evidence, you are firmly convinced that the
defendant is guilty of the crime charged, you must find him guilty. If on the other hand,
you think there is a real possibility that he is not guilty, you must give him the benefit of
the doubt and find him not guilty.”
Hidden Causality
So far we have discussed how to establish causality after first discovering a correlation.
However, sometimes a correlation—or the lack of a correlation—can hide an underlying
causality. As the next case study shows, such hidden causality often occurs because of
confounding variables.
Confidence in Causality
The six guidelines offer us a way to examine the strength of a case for causality, but we
often must make decisions before a case of causality is fully established. Consider, for
example, the well-known case of global warming. It may never be possible to prove
beyond all doubt that the burning of fossil fuels is causing global warming (see the Focus
on Environment at the end of this chapter), so we must decide whether to act while we
still face some uncertainty about causation. How much must we know before we decide
to act?
In other areas of statistics, accepted techniques help us deal with this type of uncertainty
by allowing us to calculate a numerical level of confidence or significance. But there are
no accepted ways to assign such numbers to the uncertainty that comes with questions
of causality. Fortunately, another area of study has dealt with practical problems of
causality for hundreds of years: our legal system. You may be familiar wi...
Purchase answer to see full
attachment