Safe Speeds?, Statistic Help

User Generated

pcvgpusbeq

Mathematics

Description

Correlation (r- value), Scatterplots, Trendline (regression) Equation, Prediction

HOW TO VIDEO Dr Ami Gates MOOT Intro Stats Correlation Scatterplots Prediction Trendline (Regression Line & Prediction) Excel

https://www.youtube.com/watch?v=P6LwhM_pDuA

HOW TO VIDEO Dr Ami Gates Using Excel Correlation scatterplots regression

https://www.youtube.com/watch?v=nKAd1U_1yl8

HOW TO Video Dr Ami Gates Data Analysis from Google Forms And Correlation in Excel

https://www.youtube.com/watch?v=UDRtF1Evq8I

In Unit 5, there are three main topics (problem types): describing correlations, measuring correlations, and evaluating here scatterplots. You will be exposed to all three topics by either formulating a response (main post) or commenting on a classmate’s response to each one.

In your original response (main post), you will choose one type of problem to solve and post a complete solution. Then, in your responses to two classmates, you will comment on posts which involve the other two types of problems. This Discussion topic has multiple parts. Please read and complete each part.

To complete your main post, you may choose any exercise in Chapter 7. As soon as you choose a particular problem to solve from Chapter 7, write it out in full in the Discussion Area. Include the page number and the problem number along with your post. Be sure not to choose a problem that one of your classmates has already posted. This means that before you can choose a problem, you will have to review all current posts. Once you choose, quickly post your problem selection by writing out the entire question and including the page and problem number.

Next, write a complete solution, showing all steps used in solving the problem. Attach any graphs that you might need using Add/Remove. Explain each step as if you were the expert explaining it to a novice.

Unformatted Attachment Preview

7 Correlation and Causality LEARNING GOALS • 7.1 Seeking Correlation Define correlation, explore correlations with scatterplots, and understand the correlation coefficient as a measure of the strength of a correlation. • • • 7.2 Interpreting Correlations Be aware of important cautions concerning the interpretation of correlations, especially the effects of outliers, the effects of grouping data, and the crucial fact that correlation does not necessarily imply causality. 7.3 Best-Fit Lines and Prediction Become familiar with the concept of a best-fit line, recognize when such lines have predictive value and when they do not, and understand the general concept of multiple regression. 7.4 The Search for Causality Understand the difficulty of establishing causality from correlation, and investigate guidelines that can be used to help establish confidence in causality. FOCUS TOPICS • • p. 271 Focus on Education: What Helps Children Learn to Read? p. 273 Focus on Environment: What Is Causing Global Warming? Does smoking cause lung cancer? Are drivers more dangerous when on their cell phones? Is human activity causing global warming? A major goal of many statistical studies is to search for relationships among different variables so that researchers can then determine whether one factor causes another. Once a relationship is discovered, we can try to determine whether there is an underlying cause. In this chapter, we will study relationships known as correlations and explore how they are important to the more difficult task of searching for causality. The person who knows “how” will always have a job. The person who knows “why” will always be his boss. —Diane Ravitch 7.1 SEEKING CORRELATION What does it mean when we say that smoking causes lung cancer? It certainly does not mean that you’ll get lung cancer if you smoke a single cigarette. It does not even mean that you’ll definitely get lung cancer if you smoke heavily for many years, as some heavy smokers do not get lung cancer. Rather, it is a statistical statement meaning that you are much more likely to get lung cancer if you smoke than if you don’t smoke. How did researchers learn that smoking causes lung cancer? The process began with informal observations, as doctors noticed that a surprisingly high proportion of their patients with lung cancer were smokers. These observations led to carefully conducted studies in which researchers compared lung cancer rates among smokers and nonsmokers. These studies showed clearly that heavier smokers were more likely to get lung cancer. In more formal terms, we say that there is a correlation between the variables amount of smoking and likelihood of lung cancer. A correlation is a special type of relationship between variables, in which a rise or fall in one goes along with a corresponding rise or fall in the other. • Smoking is one of the leading causes of statistics. —Fletcher Knebel Definition A correlation exists between two variables when higher values of one variable consistently go with higher values of another variable or when higher values of one variable consistently go with lower values of another variable. Here are a few other examples of correlations: • • • • There is a correlation between the variables height and weight for people; that is, taller people tend to weigh more than shorter people. • There is a correlation between the variables demand for apples and price of apples;that is, demand tends to decrease as price increases. • There is a correlation between practice time and skill among piano players; that is, those who practice more tend to be more skilled. It’s important to realize that establishing a correlation between two variables does not mean that a change in one variable causes a change in the other. The correlation between smoking and lung cancer did not by itself prove that smoking causes lung cancer. We could imagine, for example, that some gene predisposes a person both to smoking and to lung cancer. Nevertheless, identifying the correlation was the crucial first step in learning that smoking causes lung cancer. We will discuss the difficult task of establishing causality later in this chapter. For now, we concentrate on how we look for, identify, and interpret correlations. BY THE WAY Smoking is linked to many serious diseases besides lung cancer, including heart disease and emphysema. Smoking is also linked with many less lethal health conditions, such as premature skin wrinkling and sexual impotence. TIME UT TO THINK Suppose there really were a gene that made people prone to both smoking and lung cancer. Explain why we would still find a strong correlation between smoking and lung cancer in that case, but would not be able to say that smoking causes lung cancer. Scatterplots Table 7.1 lists data for a sample of gem-store diamonds—their prices and several common measures that help determine their value. Because advertisements for diamonds often quote only their weights (in carats), we might suspect a correlation between the weights and the prices. We can look for such a correlation by making a scatterplot (or scatter diagram) showing the relationship between the variables weight and price. TABLE 7.1 Prices and Characteristics of a Sample of 23 Diamonds from Gem Dealers Diamond 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Price $6,958 $5,885 $6,333 $4,299 $9,589 $6,921 $4,426 $6,885 $5,826 $3,670 $7,176 $7,497 $5,170 $5,547 $7,521 $7,260 $8,139 $12,196 $14,998 $9,736 $9,859 $12,398 $11,008 Weight (carats) 1.00 1.00 1.01 1.01 1.02 1.04 1.04 1.07 1.07 1.11 1.12 1.16 1.20 1.23 1.29 1.50 1.51 1.67 1.72 1.76 1.80 1.88 2.03 Depth 60.5 59.2 62.3 64.4 63.9 60.0 62.0 63.6 61.6 60.4 60.2 59.5 62.6 59.2 59.6 61.1 63.0 58.7 58.5 57.9 59.6 62.9 62.0 Table 65 65 55 62 58 61 62 61 62 60 65 60 61 65 59 65 60 64 61 62 63 62 63 Color 3 5 4 5 2 4 5 4 5 9 2 5 6 7 6 6 6 3 4 8 5 6 8 Notes: Weight is measured in carats (1 carat = 0.2 gram). Depth is defined as 100 times the ratio of height to diameter. Table is the size of the upper flat surface. (Depth and table determine “cut.”) Color and clarity are each measured on standard scales, where 1 is best. For color, 1 = colorless, and increasing numbers indicate more yellow. For clarity, 1 = flawless, and 6 indicates that defects can be seen by eye. BY THE WAY The word karats (with a k) used to describe gold does not have the same meaning as the term carats (with a c) for diamonds and other gems. A carat is a measure of weight equal to 0.2 gram. Karats are a measure of the purity of gold: 24-karat gold is 100% pure gold; 18-karat gold is 75% pure (and 25% other metals); 12-karat gold is 50% pure (and 50% other metals); and so on. Clarity 4 4 4 5 3 4 5 3 5 4 3 3 4 4 2 4 4 5 3 2 5 2 3 Definition A scatterplot (or scatter diagram) is a graph in which each point represents the values of two variables. Figure 7.1 shows the scatterplot, which can be constructed with the following procedure. • 1. We assign one variable to each axis and label the axis with values that comfortably fit all the data. Sometimes the axis selection is arbitrary, but if we suspect that one variable depends on the other then we plot the explanatory variable on the horizontal axis and the response variable on the vertical axis. In this case, we expect the diamond price to depend at least in part on its weight; we therefore say that weight is the explanatory variable (because it helps explain the price) and price is the response variable (because it responds to changes in the explanatory variable). We choose a range of 0 to 2.5 carats for the weight axis and $0 to $16,000 for the price axis. Figure 7.1 Scatterplot showing the relationship between the variables priceand weight for the diamonds in Table 7.1. The dashed lines show how we find the position of the point for Diamond 10. • • 2. For each diamond in Table 7.1, we plot a single point at the horizontal position corresponding to its weight and the vertical position corresponding to its price. For example, the point for Diamond 10 goes at a position of 1.11 carats on the horizontal axis and $3,670 on the vertical axis. The dashed lines on Figure 7.1 show how we locate this point. 3. (Optional) We can label some (or all) of the data points, as is done for Diamonds 10, 16, and 19 in Figure 7.1. Scatterplots get their name because the way in which the points are scattered may reveal a relationship between the variables. In Figure 7.1, we see a general upward trend indicating that diamonds with greater weight tend to be more expensive. The correlation is not perfect. For example, the heaviest diamond is not the most expensive. But the overall trend seems fairly clear. TIME UT TO THINK Identify the points in Figure 7.1 that represent Diamonds 3, 7, and 23. EXAMPLE Color and Price Using the data in Table 7.1, create a scatterplot to look for a correlation between a diamond’s colorand price. Comment on the correlation. SOLUTION We expect price to depend on color, so we plot the explanatory variable color on the horizontal axis and the response variable price on the vertical axis in Figure 7.2. (You should check a few of the points against the data in Table 7.1.) The points appear much more scattered than in Figure 7.1. Nevertheless, you may notice a weak trend diagonally downward from the upper left toward the lower right. This trend represents a weak correlation in which diamonds with more yellow color (higher numbers for color) are less expensive. This trend is consistent with what we would expect, because colorless diamonds appear to sparkle more and are generally considered more desirable. Figure 7.2 Scatterplot for the color and price data in Table 7.1. TIME UT TO THINK Thanks to a large bonus at work, you have a budget of $6,000 for a diamond ring. A dealer offers you the following two choices for that price. One diamond weighs 1.20 carats and has color = 4. The other weighs 1.18 carats and has color = 3. Assuming all other characteristics of the diamonds are equal, which would you choose? Why? Types of Correlation We have seen two examples of correlation. Figure 7.1 shows a fairly strong correlation between weight and price, while Figure 7.2 shows a weak correlation between color and price. We are now ready to generalize about types of correlation. Figure 7.3 shows eight scatterplots for variables called x and y. Note the following key features of these diagrams: • • • • • Parts a to c show positive correlations: The values of y tend to increase with increasing values of x. The correlation becomes stronger as we proceed from a to c. In fact, c shows a perfect positive correlation, in which all the points fall along a straight line. • Parts d to f show negative correlations: The values of y tend to decrease with increasing values of x. The negative correlation becomes stronger as we proceed from d to f. In fact, f shows a perfect negative correlation, in which all the points fall along a straight line. • Part g shows no correlation between x and y: Values of x do not appear to be linked to values of y in any way. • Part h shows a nonlinear relationship: x and y appear to be related but the relationship does not correspond to a straight line. (Linear means along a straight line, and nonlinear means not along a straight line.) Figure 7.3 Types of correlation seen on scatterplots. Types of Correlation Positive correlation: Both variables tend to increase (or decrease) together. Negative correlation: The two variables tend to change in opposite directions, with one increasing while the other decreases. No correlation: There is no apparent (linear) relationship between the two variables. Nonlinear relationship: The two variables are related, but the relationship results in a scatterplot that does not follow a straight-line pattern. TECHNICAL NOTE In this text we use the term correlation only for linear relationships. Some statisticians refer to nonlinear relationships as “nonlinear correlations.” There are techniques for working with nonlinear relationships that are similar to those described in this text for linear relationships. EXAMPLE Mortality Life Expectancy and Infant Figure 7.4 shows a scatterplot for the variables life expectancy and infant mortality in 16 countries. What type of correlation does it show? Does this correlation make sense? Does it imply causality? Explain. Figure 7.4 Scatterplot for life expectancy and infant mortality data. Source: United Nations. SOLUTION The diagram shows a moderate negative correlation in which countries with lowerinfant mortality tend to have higher life expectancy. It is a negative correlation because the two variables vary in opposite directions. The correlation makes sense because we would expect that countries with better health care would have both lower infant mortality and higher life expectancy. However, it does not imply causality between infant mortality and life expectancy: We would not expect that a concerted effort to reduce infant mortality would increase life expectancy significantly unless it was part of an overall effort to improve health care. (Reducing infant mortality will slightly increase life expectancy because having fewer infant deaths tends to raise the mean age of death for the population.) Measuring the Strength of a Correlation For most purposes, it is enough to state whether a correlation is strong, weak, or nonexistent. However, sometimes it is useful to describe the strength of a correlation in more precise terms. Statisticians measure the strength of a correlation with a number called the correlation coefficient, represented by the letter r. The correlation coefficient is easy to calculate in principle (see the optional section on p. 243), but the actual work is tedious unless you use a calculator or computer. We can explore the interpretation of correlation coefficients by studying Figure 7.3, which shows the value of the correlation coefficient r for each scatterplot. Notice that the correlation coefficient is always between –1 and 1. When points in a scatterplot lie close to an ascending straight line, the correlation coefficient is positive and close to 1. When all the points lie close to a descending straight line, the correlation coefficient is negative with a value close to –1. Points that do not fit any type of straight-line pattern or that lie close to a horizontal straight line (indicating that the y values have no dependence on the x values) result in a correlation coefficient close to 0. Properties of the Correlation Coefficient, r • • • • • The correlation coefficient, r, is a measure of the strength of a correlation. Its value can range only from –1 to 1. • If there is no correlation, the points do not follow any ascending or descending straight-line pattern, and the value of r is close to 0. • If there is a positive correlation, the correlation coefficient is positive (0 < r ≤ 1): Both variables increase together. A perfect positive correlation (in which all the points on a scatterplot lie on an ascending straight line) has a correlation coefficient r = 1. Values of rclose to 1 indicate a strong positive correlation and positive values closer to 0 indicate a weak positive correlation. • If there is a negative correlation, the correlation coefficient is negative(–1 ‰ r < 0): When one variable increases, the other decreases. A perfect negative correlation (in which all the points lie on a descending straight line) has a correlation coefficient r = –1. Values of r close to –1 indicate a strong negative correlation and negative values closer to 0 indicate a weak negative correlation. TECHNICAL NOTE For the methods of this section, there is a requirement that the two variables result in data having a “bivariate normal distribution.” This basically means that for any fixed value of one variable, the corresponding values of the other variable have a normal distribution. This requirement is usually very difficult to check, so the check is often reduced to verifying that both variables result in data that are normally distributed. EXAMPLE U.S. Farm Size Figure 7.5 shows a scatterplot for the variables number of farms and mean farm size in the United States. Each dot represents data from a single year between 1950 and 2000; on this diagram, the earlier years generally are on the right and the later years on the left. Estimate the correlation coefficient by comparing this diagram to those in Figure 7.3 and discuss the underlying reasons for the correlation. Figure 7.5 Scatterplot for farm size data. Source: U.S. Department of Agriculture. SOLUTION The scatterplot shows a strong negative correlation that most closely resembles the scatterplot in Figure 7.3f, suggesting a correlation coefficient around r = –0.9. The correlation shows that when there were fewer farms, they tended to have a larger mean size, and when there were more farms, then tended to have a smaller mean size. This trend reflects a basic change in the nature of farming: Prior to 1950, most farms were small family farms. Over time, these small farms were replaced by large farms owned by agribusiness corporations. BY THE WAY In 1900, more than 40% of the U.S. population worked on farms; by 2000, less than 2% of the population worked on farms. EXAMPLE Accuracy of Weather Forecasts The scatterplots in Figure 7.6 show two weeks of data comparing the actual high temperature for the day with the same-day forecast (part a) and the three-day forecast (part b). Estimate the correlation coefficient for each data set and discuss what these coefficients imply about weather forecasts. Figure 7.6 Comparison of actual high temperatures with (a) same-day and (b) three-day forecasts. SOLUTION If every forecast were perfect, each actual temperature would equal the corresponding forecasted temperature. This would result in all points lying on a straight line and a correlation coefficient of r = 1. In Figure 7.6a, in which the forecasts were made at the beginning of the same day, the points lie fairly close to a straight line, meaning that same-day forecasts are closely related to actual temperatures. By comparing this scatterplot to the diagrams in Figure 7.3, we can reasonably estimate this correlation coefficient to be about r = 0.8. The correlation is weaker in Figure 7.6b, indicating that forecasts made three days in advance aren’t as close to actual temperatures as same-day forecasts. This correlation coefficient is about r = 0.6. These results are unsurprising because we expect longer-term forecasts to be less accurate. TIME UT TO THINK For further practice, visually estimate the correlation coefficients for the data for diamond weight and price (Figure 7.1) and diamond color and price (Figure 7.2). Calculating the Correlation Coefficient (Optional Section) The formula for the (linear) correlation coefficient r can be expressed in several different ways that are all algebraically equivalent, which means that they produce the same value. The following expression has the advantage of relating more directly to the underlying rationale for r: USING TECHNOLOGY—SCATTERPLOTS AND CORRELATION COEFFICIENTS EXCEL The screen shot below shows the process for making a scatterplot like that in Figure 7.1: • • • • • 1. Enter the data, which are shown in Columns B (weight) and C (price). 2. Select the columns for the two variables on the scatterplot; in this case, Columns B and C. 3. Choose “XY Scatter” as the chart type, with no connecting lines. You can then use the “chart options” (which comes up with a right-click in the graph) to customize the design, axis range, labels, and more. 4. To calculate the correlation coefficient, shown in row 26, use the built-in function CORREL. 5. [Optional] The straight line on the graph, called a best-fit line, is added by choosing the option to “Add Trendline”; be sure to choose the “linear” option for the trendline. You’ll also find options that add the two items shown in the upper left of the graph: the equation of the line and the value R2, which is the square of the correlation coefficient. Best-fit lines and R2 are discussed in Section 7.3. Microsoft Excel 2008 for Mac. STATDISK Enter the paired data in columns of the STATDISK Data Window. Select Analysis from the main menu bar, then select the option Correlation and Regression. Select the columns of data to be used, then click on the Evaluate button. The STATDISK display will include the value of the linear correlation coefficient r and other. A scatterplot can also be obtained by clicking on the PLOTbutton. TI-83/84 Plus Enter the paired data in lists L1 and L2, then press and select TESTS. Using the option of LinRegTTest will result in several displayed values, including the value of the linear correlation coefficient r. To obtain a scatterplot, press Press , then (for STAT PLOT). to turn Plot 1 on, then select the first graph type, which resembles a scatterplot. Set the X list and Y list labels to L1 and L2 and press select ZoomStat and press . , then In the above expression, division by n – 1(where n is the number of pairs of data) shows that r is a type of average, so it does not increase simply because more pairs of data values are included. The symbol sx denotes the standard deviation of the x values (or the values of the first variable), and sydenotes the standard deviation of the y values. The expression (x – x)/sx is in the same format as the standard score introduced in Section 5.2. By using the standard scores for x and y, we ensure that the value of r does not change simply because a different scale of values is used. The key to understanding the rationale for r is to focus on the product of the standard scores for x and the standard scores for y. Those products tend to be positive when there is a positive correlation, and they tend to be negative when there is a negative correlation. For data with no correlation, some of the products are positive and some are negative, with the net effect that the sum is relatively close to 0. The following alternative formula for r has the advantage of simplifying calculations, so it is often used whenever manual calculations are necessary. The following formula is also easy to program into statistical software or calculators: This formula is straightforward to use, at least in principle: First calculate each of the required sums, then substitute the values into the formula. Be sure to note that (Σx2) and (Σx)2 are not equal: (Σx2) tells you to first square all the values of the variable x and then add them; (Σx)2 tells you to add the x values first and then square this sum. In other words, perform the operation within the parentheses first. Similarly, (Σy2) and (Σy)2 are not the same. Section 7.1 Exercises Statistical Literacy and Critical Thinking 1. Correlation. In the context of correlation, what does r measure, and what is it called? 2. Scatterplot. What is a scatterplot, and how does it help us investigate correlation? 3. Correlation. After computing the correlation coefficient r from 50 pairs of data, you find that r = 0. Does it follow that there is no relationship between the two variables? Why or why not? 4. Scatterplot. One set of paired data results in r = 1 and a second set of paired data results in r= –1. How do the corresponding scatterplots differ? Does It Make Sense? For Exercises 5–8, decide whether the statement makes sense (or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of these statements have definitive answers, so your explanation is more important than your chosen answer. 5. Births. A study showed that for one town, as the stork population increased, the number of births in the town also increased. It therefore follows that the increase in the stork population caused the number of births to increase. 6. Positive Effect. An engineer for a car company finds that by reducing the weights of various cars, mileage (mi/gal) increases. Because this is a positive result, we say that there is a positive correlation. 7. Correlation. Two studies both found a correlation between low birth weight and weakened immune systems. The second study had a much larger sample size, so the correlation it found must be stronger. 8. Interpreting r. In investigating correlations between many different pairs of variables, in each case the correlation coefficient r must fall between –1 and 1. Concepts and Applications Types of Correlation. Exercises 9–16, list pairs of variables. For each pair, state whether you believe the two variables are correlated. If you believe they are correlated, state whether the correlation is positive or negative. Explain your reasoning. 9. Weight/Cost. The weights and costs of 50 different bags of apples 10. IQ/Hat Size. The IQ scores and hat sizes of randomly selected adults 11. Weight/Fuel Efficiency. The total weights of airliners flying from New York to San Francisco and the fuel efficiency as measured in miles per gallon 12. Weight/Fuel Consumption. The total weights of airliners flying from New York to San Francisco and the total amounts of fuel that they consume 13. Points and DJIA. The total number of points scored in Super Bowl football games and the changes in the Dow Jones Industrial stock index in the years following those games 14. Altitude/Temperature. The outside air temperature and the altitude of aircraft 15. Height/SAT Score. The heights and SAT scores of randomly selected subjects who take the SAT 16. Golf Score/Prize Money. Golf scores and prize money won by professional golfers 17. Crickets and Temperature. One classic application of correlation involves the association between the temperature and the number of times a cricket chirps in a minute. The scatterplot in Figure 7.7 shows the relationship for eight different pairs of temperature/chirps data. Estimate the correlation coefficient and determine whether there appears to be a correlation between the temperature and the number of times a cricket chirps in a minute. Figure 7.7Scatterplot for cricket chirps and temperature. Source: Based on data from The Song of Insects by George W. Pierce, Harvard University Press. 18. Two-Day Forecast. Figure 7.8 shows a scatterplot in which the actual high temperature for the day is compared with a forecast made two days in advance. Estimate the correlation coefficient and discuss what these data imply about weather forecasts. Do you think you would get similar results if you made similar diagrams for other two-week periods? Why or why not? Figure 7.8 19. Safe Speeds? Consider the following table showing speed limits and death rates from automobile accidents in selected countries. Country Norway United States Finland Britain Denmark Canada Japan Australia Netherlands Italy Death rate (per 100 million vehicle-miles) 3.0 3.3 3.4 3.5 4.1 4.3 4.7 4.9 5.1 6.1 Speed limit (miles per hou 55 55 55 70 55 60 55 65 60 75 Source: D. J. Rivkin, New York Times. • • a.Construct a scatterplot of the data. b.Briefly characterize the correlation in words (for example, strong positive correlation, weak negative correlation) and estimate the correlation coefficient of the data. (Or calculate the correlation coefficient exactly with the aid of a calculator or software.) • c.In the newspaper, these data were presented in an article titled “Fifty-five mph speed limit is no safety guarantee.” Based on the data, do you agree with this claim? Explain. 20. Population Growth. Consider the following table showing percentage change in population and birth rate (per 1,000 of population) for 10 states over a period of 10 years. State Nevada California New Hampshire Utah Colorado Minnesota Montana Illinois Iowa West Virginia Percentage change in population 50.1% 25.7% 20.5% 17.9% 14.0% 7.3% 1.6% 0% –4.7% –8.0% Birth rate 16.3 16.9 12.5 21.0 14.6 13.7 12.3 15.5 13.0 11.4 Source: U.S. Census Bureau and Department of Health and Human Services. • • • a.Construct a scatterplot for the data. b.Briefly characterize the correlation in words and estimate the correlation coefficient. c.Overall, does birth rate appear to be a good predictor of a state’s population growth rate? If not, what other factor(s) may be affecting the growth rate? 21. Brain Size and Intelligence. The table below lists brain sizes (in cm3) and Wechsler IQ scores of subjects (based on data from “Brain Size, Head Size, and Intelligence Quotient in Monozygatic Twins,” by Tramo et al, Neurology, Vol. 50, No. 5). Is there sufficient evidence to conclude that there is a linear correlation between brain size and IQ score? Does it appear that people with larger brains are more intelligent? Brain Size 965 1,029 1,030 1,285 1,049 1,077 1,037 IQ 90 85 86 102 103 97 124 Brain Size 1,068 1,176 1,105 • • • IQ 125 102 114 a.Construct a scatterplot for the data. b.Briefly characterize the correlation in words and estimate the correlation coefficient. c.Do these data suggest that people with larger brains are more intelligent? Explain. 22. Movie Data. Consider the following table showing total box office receipts and total attendance for all American films. Year 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Total Gross Receipts (billions of dollars) 8.4 9.2 9.2 9.4 8.8 9.2 9.7 9.6 10.6 10.6 Tickets Sold (billions) 1.49 1.58 1.53 1.51 1.38 1.41 1.40 1.34 1.41 1.34 Source: Motion Picture Association of America. • • a.Construct a scatterplot of the data. b.Briefly characterize the correlation in words and estimate the correlation coefficient. 23. TV Time. Consider the following table showing the average hours of television watched in households in five categories of annual income. Household income Less than $30,000 $30,000 – $40,000 $40,000 – $50,000 $50,000 – $60,000 More than $60,000 Source: Nielsen Media Research. Weekly TV hours 56.3 51.0 50.5 49.7 48.7 • • • a.Construct a scatterplot for the data. To locate the dots, use the midpoint of each income category. Use a value of $25,000 for the category “less than $30,000,” and use $70,000 for “more than $60,000.” b.Briefly characterize the correlation in words and estimate the correlation coefficient. c.Suggest a reason why families with higher incomes watch less TV. Do you think these data imply that you can increase your income simply by watching less TV? Explain. 24. January Weather. Consider the following table showing January mean monthly precipitation and mean daily high temperature for ten Northern Hemisphere cities (National Oceanic and Atmospheric Administration). City Athens Bombay Copenhagen Jerusalem London Montreal Oslo Rome Tokyo Vienna Mean daily high temperature for January (°F) 54 88 36 55 44 21 30 54 47 34 Mean January precipitation (inches) 2.2 0.1 1.6 5.1 2.0 3.8 1.7 3.3 1.9 1.5 Source: The New York Times Almanac. • • • a.Construct a scatterplot for the data. b.Briefly characterize the correlation in words and estimate the correlation coefficient. c.Can you draw any general conclusions about January temperatures and precipitation from these data? Explain. 25. Retail Sales. Consider the following table showing one year’s total sales (revenue) and profits for eight large retailers in the United States. Company Wal-Mart Kroger Home Depot Costco Target Total sales (billions of dollars) 315.6 60.6 81.5 60.1 52.6 Profits (billions of dollars) 11.2 0.98 5.8 1.1 2.4 Company Starbuck’s The Gap Best Buy Total sales (billions of dollars) 7.8 16.0 30.8 Profits (billions of dollars) 0.6 1.1 1.1 Source: Fortune.com. • • • a.Construct a scatterplot for the data. b.Briefly characterize the correlation in words and estimate the correlation coefficient. c.Discuss your observations. Does higher sales volume necessarily translate into greater earnings? Why or why not? 26. Calories and Infant Mortality. Consider the following table showing mean daily caloric intake (all residents) and infant mortality rate (per 1,000 births) for 10 countries. Country Afghanistan Austria Burundi Colombia Ethiopia Germany Liberia New Zealand Turkey United States • • • Mean daily calories 1,523 3,495 1,941 2,678 1,610 3,443 1,640 3,362 3,429 3,671 Infant mortality rate (per 1,000 births) 154 6 114 24 107 6 153 7 44 7 a.Construct a scatterplot for the data. b.Briefly characterize the correlation in words and estimate the correlation coefficient. c.Discuss any patterns you observe and any general conclusions that you can reach. Properties of the Correlation Coefficient. For Exercises 27 and 28, determine whether the given property is true, and explain your answer. 27. Interchanging Variables. The correlation coefficient remains unchanged if we interchange the variables x and y. 28. Changing Units of Measurement. The correlation coefficient remains unchanged if we change the units used to measure x, y, or both. PROJECTS FOR THE INTERNET & BEYOND 29. Unemployment and Inflation. Use the Bureau of Labor Statistics Web page to find monthly unemployment rates and inflation rates over the past year. Construct a scatter-plot for the data. Do you see any trends? 30. Success in the NFL. Find last season’s NFL team statistics. Construct a table showing the following for each team: number of wins, average yards gained on offense per game, and average yards allowed on defense per game. Make scatterplots to explore the correlations between offense and wins and between defense and wins. Discuss your findings. Do you think that there are other team statistics that would yield stronger correlations with the number of wins? 31. Statistical Abstract. Explore the “frequently requested tables” at the Web site for the Statistical Abstract of the United States. Choose data that are of interest to you and explore at least two correlations. Briefly discuss what you learn from the correlations. 32. Height and Arm Span. Select a sample of at least eight people and measure each person’s height and arm span. (When you measure arm span, the person should stand with arms extended like the wings on an airplane.) Using the paired sample data, construct a scatterplot and estimate or calculate the value of the correlation coefficient. What do you conclude? 33. Height and Pulse Rate. Select a sample of at least eight people and record each person’s pulse rate by counting the number of heartbeats in 1 minute. Also record each person’s height. Using the paired sample data, construct a scatterplot and estimate or calculate the value of the correlation coefficient. What do you conclude? IN THE NEWS 34. Correlations in the News. Find a recent news report that discusses some type of correlation. Describe the correlation. Does the article give any sense of the strength of the correlation? Does it suggest that the correlation reflects any underlying causality? Briefly discuss whether you believe the implications the article makes with respect to the correlation. 35. Your Own Positive Correlations. Give examples of two variables that you expect to be positively correlated. Explain why the variables are correlated and why the correlation is (or is not) important. 36. Your Own Negative Correlations. Give examples of two variables that you expect to be negatively correlated. Explain why the variables are correlated and why the correlation is (or is not) important. 7.2 INTERPRETING CORRELATIONS • Statistics show that of those who contract the habit of eating, very few survive. — Wallace Irwin Researchers sifting through statistical data are constantly looking for meaningful correlations, and the discovery of a new and surprising correlation often leads to a flood of news reports. You may recall hearing about some of these discovered correlations: dark chocolate consumption correlated with reduced risk of heart disease; musical talent correlated with good grades in mathematics; or eating less correlated with increased longevity. Unfortunately, the task of interpreting such correlations is far more difficult than discovering them in the first place. Long after the news reports have faded, we may still be unsure of whether the correlations are significant and, if so, whether they tell us anything of practical importance. In this section, we discuss some of the common difficulties associated with interpreting correlations. Beware of Outliers Examine the scatterplot in Figure 7.9. Your eye probably tells you that there is a positive correlation in which larger values of x tend to mean larger values of y. Indeed, if you calculate the correlation coefficient for these data, you’ll find that it is a relatively high r = 0.880, suggesting a very strong correlation. Figure 7.9 How does the outlier affect the correlation? However, if you place your thumb over the data point in the upper right corner of Figure 7.9, the apparent correlation disappears. In fact, without this data point, the correlation coefficient is zero! In other words, removing this one data point changes the correlation coefficient from r = 0.880 to r= 0. This example shows that correlations can be very sensitive to outliers. Recall that an outlier is a data value that is extreme compared to most other values in a data set (see Section 4.1). We must therefore examine outliers and their effects carefully before interpreting a correlation. On the one hand, if the outliers are mistakes in the data set, they can produce apparent correlations that are not real or mask the presence of real correlations. On the other hand, if the outliers represent real and correct data points, they may be telling us about relationships that would otherwise be difficult to see. Note that while we should examine outliers carefully, we should not remove them unless we have strong reason to believe that they do not belong in the data set. Even in that case, good research principles demand that we report the outliers along with an explanation of why we thought it legitimate to remove them. EXAMPLE Masked Correlation You’ve conducted a study to determine how the number of calories a person consumes in a day correlates with time spent in vigorous bicycling. Your sample consisted of ten women cyclists, all of approximately the same height and weight. Over a period of two weeks, you asked each woman to record the amount of time she spent cycling each day and what she ate on each of those days. You used the eating records to calculate the calories consumed each day. Figure 7.10 shows a scatterplot with each woman’s mean time spent cycling on the horizontal axis and mean caloric intake on the vertical axis. Do higher cycling times correspond to higher intake of calories? Figure 7.10 Data from the cycling study. SOLUTION If you look at the data as a whole, your eye will probably tell you that there is a positive correlation in which greater cycling time tends to go with higher caloric intake. But the correlation is very weak, with a correlation coefficient of r = 0.374. However, notice that two points are outliers: one representing a cyclist who cycled about a half-hour per day and consumed more than 3,000 calories, and the other representing a cyclist who cycled more than 2 hours per day on only 1,200 calories. It’s difficult to explain the two outliers, given that all the women in the sample have similar heights and weights. We might therefore suspect that these two women either recorded their data incorrectly or were not following their usual habits during the two-week study. If we can confirm this suspicion, then we would have reason to delete the two data points as invalid. Figure 7.11 shows that the correlation is quite strong without those two outlier points, and suggests that the number of calories consumed rises by a little more than 500 calories for each hour of cycling. Of course, we should not remove the outliers without confirming our suspicion that they were invalid data points, and we should report our reasons for leaving them out. Figure 7.11 The data from Figure 7.10 without the two outliers. Beware of Inappropriate Grouping Correlations can also be misinterpreted when data are grouped inappropriately. In some cases, grouping data hides correlations. Consider a (hypothetical) study in which researchers seek a correlation between hours of TV watched per week and high school grade point average (GPA). They collect the 21 data pairs in Table 7.2. The scatterplot (Figure 7.12) shows virtually no correlation; the correlation coefficient for the data is about r = –0.063. The lack of correlation seems to suggest that TV viewing habits are unrelated to academic achievement. However, one astute researcher realizes that some of the students watched mostly educational programs, while others tended to watch comedies, dramas, and movies. She therefore divides the data set into two groups, one for the students who watched mostly educational television and one for the other students. Table 7.3 shows her results with the students divided into these two groups. Figure 7.12 The full set of data concerning hours of TV and GPA shows virtually no correlation. TABLE 7.2 Hours of TV and High School GPA (hypothetical data) Hours per week of TV 2 4 4 5 5 5 6 7 7 8 9 9 10 12 12 GPA 3.2 3.0 3.1 2.5 2.9 3.0 2.5 2.7 2.8 2.7 2.5 2.9 3.4 3.6 2.5 Hours per week of TV 14 14 15 16 20 20 GPA 3.5 2.3 3.7 2.0 3.6 1.9 Now we find two very strong correlations (Figure 7.13): a strong positive correlation for the students who watched educational programs (r = 0.855) and a strong negative correlation for the other students (r = –0.951). The moral of this story is that the original data set hid an important (hypothetical) correlation between TV and GPA: Watching educational TV correlated positively with GPA and watching non-educational TV correlated negatively with GPA. Only when the data were grouped appropriately could this discovery be made. TABLE 7.3 Hours of TV and High School GPA— Grouped Data (hypothetical data) Group 1: watched educational programs Hours per week of TV 5 7 8 9 10 12 14 15 20 GPA 2.5 2.8 2.7 2.9 3.4 3.6 3.5 3.7 3.6 Group 2: watched regular TV Hours per week of TV 2 4 4 5 5 6 7 9 12 14 16 20 BY THE WAY Children ages 2–5 watch an average of 26 hours of television per week, while children ages 6–11 watch an average of 20 hours of television per week (Nielsen Media Research). Adult viewership averages more than 25 hours per week. If the average adult replaced television time with a job paying just $8 per hour, his or her annual income would rise by more than $10,000. GPA 3.2 3.0 3.1 2.9 3.0 2.5 2.7 2.5 2.5 2.3 2.0 1.9 Figure 7.13 These scatterplots show the same data as Figure 7.12, separated into the two groups identified in Table 7.3. In other cases, a data set may show a stronger correlation than actually exists among subgroups. Consider the (hypothetical) data in Table 7.4, showing the relationship between the weights and prices of selected cars. Figure 7.14 shows the scatterplot. The data set as a whole shows a strong correlation; the correlation coefficient is r = 0.949. However, on closer examination, we see that the data fall into two rather distinct categories corresponding to light and heavy cars. If we analyze these subgroups separately, neither shows any correlation: The light cars alone (top six in Table 7.4) have a correlation coefficient r = 0.019 and the heavy cars alone (bottom six in Table 7.4) have a correlation coefficient r = –0.022. You can see the problem by looking at Figure 7.14. The apparent correlation of the full data set occurs because of the separation between the two clusters of points; there’s no correlation within either cluster. TABLE 7.4 Car Weights and Prices (hypothetical data) Weight (pounds) 1,500 1,600 1,700 1,750 1,800 1,800 Price (dollars) 9,500 8,000 8,200 9,500 9,200 8,700 Weight (pounds) 3,000 3,500 3,700 4,000 3,600 3,200 Price (dollars) 29,000 25,000 27,000 31,000 25,000 30,000 Figure 7.14 Scatterplot for the car weight and price data in Table 7.4. TIME UT TO THINK Suppose you were shopping for a compact car. If you looked at only the overall data and correlation coefficient from Figure 7.14, would it be reasonable to consider weight as an important factor in price? What if you looked at the data for light and heavy cars separately? Explain. ASE STUDY Fishing for Correlations Oxford physician Richard Peto submitted a paper to the British medical journal Lancet showing that heart-attack victims had a better chance of survival if they were given aspirin within a few hours after their heart attacks. The editors of Lancet asked Peto to break down the data into subsets, to see whether the benefits of the aspirin were different for different groups of patients. For example, was aspirin more effective for patients of a certain age or for patients with certain dietary habits? Breaking the data into subsets can reveal important facts, such as whether men and women respond to the treatment differently. However, Peto felt that the editors were asking him to divide his sample into too many subgroups. He therefore objected to the request, arguing that it would result in purely coincidental correlations. Writing about this story in the Washington Post, journalist Rick Weiss said, “When the editors insisted, Peto capitulated, but among other things he divided his patients by zodiac birth signs and demanded that his findings be included in the published paper. Today, like a warning sign to the statistically uninitiated, the wacky numbers are there for all to see: Aspirin is useless for Gemini and Libra heart-attack victims but is a lifesaver for people born under any other sign.” The moral of this story is that a “fishing expedition” for correlations can often produce them. That doesn’t make the correlations meaningful, even though they may appear significant by standard statistical measures. Correlation Does Not Imply Causality Perhaps the most important caution about interpreting correlations is one we’ve already mentioned: Correlation does not necessarily imply causality. In general, correlations can appear for any of the following three reasons. Possible Explanations for a Correlation • • • 1. The correlation may be a coincidence. 2. Both correlation variables might be directly influenced by some common underlying cause. 3. One of the correlated variables may actually be a cause of the other. But note that, even in this case, it may be just one of several causes. For example, the correlation between infant mortality and life expectancy in Figure 7.4 is a case of common underlying cause: Both variables respond to the underlying variable quality of health care. The correlation between smoking and lung cancer reflects the fact that smoking causes lung cancer (see the discussion in Section 7.4). Coincidental correlations are also quite common; Example 2 below discusses one such case. Caution about causality is particularly important in light of the fact that many statistical studies are designed to look for causes. Because these studies generally begin with the search for correlations, it’s tempting to think that the work is over as soon as a correlation is found. However, as we will discuss in Section 7.4, establishing causality can be very difficult. EXAMPLE (Maybe) How to Get Rich in the Stock Market Every financial advisor has a strategy for predicting the direction of the stock market. Most focus on fundamental economic data, such as interest rates and corporate profits. But an alternative strategy might rely on a famous correlation between the Super Bowl winner in January and the direction of the stock market for the rest of the year: The stock market tends to rise when a team from the old, pre-1970 NFL wins the Super Bowl and tends to fall when the winner is not from the old NFL. This correlation successfully matched 28 of the first 32 Super Bowls to the stock market, which made the “Super Bowl Indicator” a far more reliable predictor of the stock market than any professional stock broker during the same period. In fact, detailed calculations show that the probability of such success by pure chance is less than 1 in 100,000. Should you therefore make a decision about whether to invest in the stock market based on the NFL origins of the most recent Super Bowl winner? SOLUTION The extremely strong correlation might make it seem like a good idea to base your investments on the Super Bowl Indicator, but sometimes you need to apply a bit of common sense. No matter how strong the correlation might be, it seems inconceivable to imagine that the origin of the winning team actually causes the stock market to move in a particular direction. The correlation is undoubtedly a coincidence, and the fact that its probability of occurring by pure chance was less than 1 in 100,000 is just another illustration of the fact that you can turn up surprising correlations if you go fishing for them. This fact was borne out in more recent Super Bowls: Following Super Bowl 32, the indicator successfully predicted the stock market direction in only 5 of the next 10 years—exactly the fraction that would be expected by pure chance. ASE STUDY Oat Bran and Heart Disease If you buy a product that contains oat bran, there’s a good chance that the label will tout the healthful effects of eating oats. Indeed, several studies have found correlations in which people who eat more oat bran tend to have lower rates of heart disease. But does this mean that everyone should eat more oats? Not necessarily. Just because oat bran consumption is correlated with reduced risk of heart disease does not mean that it causes reduced risk of heart disease. In fact, the question of causality is quite controversial in this case. Other studies suggest that people who eat a lot of oat bran tend to have generally healthful diets. Thus, the correlation between oat bran consumption and reduced risk of heart disease may be a case of a common underlying cause: Having a healthy diet leads people both to consume more oat bran and to have a lower risk of heart disease. In that case, for some people, adding oat bran to their diets might be a bad idea because it could cause them to gain weight, and weight gain is associated with increased risk of heart disease. This example shows the importance of using caution when considering issues of correlation and causality. It may be a long time before medical researchers know for sure whether adding oat bran to your diet actually causes a reduced risk of heart disease. Useful Interpretations of Correlation In discussing uses of correlation that might lead to wrong interpretations, we have described the effects of outliers, inappropriate groupings, fishing for correlations, and incorrectly concluding that correlation implies causality. But there are many correct and useful interpretations of correlation, some of which we have already studied. So while you should be cautious in interpreting correlations, they remain a valuable tool in any field in which statistical research plays a role. Section 7.2 Exercises Statistical Literacy and Critical Thinking 1. Correlation and Causality. In clinical trials of the drug Lisinopril, it is found that increased dosages of the drug correlated with lower blood pressure levels. Based on the correlation, can we conclude that Lisinopril treatments cause lower blood pressure? Why or why not? 2. SIDS. An article in the New York Times on infant deaths included a statement that, based on the study results, putting infants to sleep in the supine position decreased deaths due to SIDS (sudden infant death syndrome). What is wrong with that statement? 3. Outliers. When studying salaries paid to CEOs of large companies, it is found that almost all of them range from a few hundred thousand dollars to several million dollars, but one CEO is paid a salary of $1. Is that salary of $1 an outlier? In general, how might outliers affect conclusions about correlation? 4. Scatterplot. Does a scatterplot reveal anything about a cause and effect relationship between two variables? Does It Make Sense? For Exercises 5–8, decide whether the statement makes sense (or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of these statements have definitive answers, so your explanation is more important than your chosen answer. 5. Scatterplot. A set of paired sample data results in a correlation coefficient of r = 0, so the scatterplot will show that there is no pattern of the plotted points. 6. Causation. If we have 20 pairs of sample data with a correlation coefficient of 1, then we know that one of the two variables is definitely the cause of the other. 7. Causation. If we conduct a study showing that there is a strong negative correlation between resting pulse rate and amounts of time spent in rigorous exercise, we can conclude decreases in resting pulse rates are somehow associated with increases in exercise. 8. Causation. If we have two variables with one being the direct cause of the other, then there may or may not be a correlation between those two variables. Concepts and Applications Correlation and Causality. Exercises 9–16 make statements about a correlation. In each case, state the correlation clearly. (For example, we might state that “there is a positive correlation between variable A and variable B.”) Then state whether the correlation is most likely due to coincidence, a common underlying cause, or a direct cause. Explain your answer. 9. Guns and Crime Rate. In one state, the number of unregistered handguns steadily increased over the past several years, and the crime rate increased as well. 10. Running and Weight. It has been found that people who exercise regularly by running tend to weigh less than those who do not run, and those who run longer distances tend to weigh less than those who run shorter distances. 11. Study Time. Statistics students find that as they spend more time studying, their test scores are higher. 12. Vehicles and Waiting Time. It has been found that as the number of registered vehicles increases, the time drivers spend sitting in traffic also increases. 13. Traffic Lights and Car Crashes. It has been found that as the number of traffic lights increases, the number of car crashes also increases. 14. Galaxies. Astronomers have discovered that, with the exception of a few nearby galaxies, all galaxies in the universe are moving away from us. Moreover, the farther the galaxy, the faster it is moving away. That is, the more distant a galaxy, the greater the speed at which it is moving away from us. 15. Gas and Driving. It has been found that as gas prices increase, the distances vehicles are driven tend to get shorter. 16. Melanoma and Latitude. Some studies have shown that, for certain ethnic groups, the incidence of melanoma (the most dangerous form of skin cancer) increases as latitude decreases. 17. Outlier Effects. Consider the scatterplot in Figure 7.15. Figure 7.15 • • a.Which point is an outlier? Ignoring the outlier, estimate or compute the correlation coefficient for the remaining points. b.Now include the outlier. How does the outlier affect the correlation coefficient? Estimate or compute the correlation coefficient for the complete data set. 18. Outlier Effects. Consider the scatterplot in Figure 7.16. Figure 7.16 • • a.Which point is an outlier? Ignoring the outlier, estimate or compute the correlation coefficient for the remaining points. b.Now include the outlier. How does the outlier affect the correlation coefficient? Estimate or compute the correlation coefficient for the complete data set. 19. Grouped Shoe Data. The following table gives measurements of weight and shoe size for 10 people (including both men and women). • a.Construct a scatterplot for the data. Estimate or compute the correlation coefficient. Based on this correlation coefficient, would you conclude that shoe size and weight are correlated? Explain. Weight (pounds) 105 112 115 123 135 155 165 170 180 190 • Shoe size 6 4.5 6 5 6 10 11 9 10 12 b.You later learn that the first five data values in the table are for women and the next five are for men. How does this change your view of the correlation? Is it still reasonable to conclude that shoe size and weight are correlated? 20. Grouped Temperature Data. The following table shows the average January high temperature and the average July high temperature for 10 major cities around the world. City Berlin Geneva Kabul Montreal Prague Auckland Buenos Aires Sydney Santiago Melbourne January high 35 39 36 21 34 73 85 78 85 78 July high 74 77 92 78 74 56 57 60 59 56 • • a.Construct a scatterplot for the data. Estimate or compute the correlation coefficient. Based on this correlation coefficient, would you conclude that January and July temperatures are correlated for these cities? Explain. b.Notice that the first five cities in the table are in the Northern Hemisphere and the next five are in the Southern Hemisphere. How does this change your view of the correlation? Would you now conclude that January and July temperatures are correlated for these cities? Explain. 21. Birth and Death Rates. Figure 7.17 shows the birth and death rates for different countries, measured in births and deaths per 1,000 population. Figure 7.17Birth and death rates for different countries. Source: United Nations. • • a.Estimate the correlation coefficient and discuss whether there is a strong correlation between the variables. b.Notice that there appear to be two groups of data points within the full data set. Make a reasonable guess as to the makeup of these groups. In which group might you find a relatively wealthy country like Sweden? In which group might you find a relatively poor country like Uganda? • c.Assuming that your guess about groups in part b is correct, do there appear to be correlations within the groups? Explain. How could you confirm your guess about the groups? 22. Reading and Test Scores. The following (hypothetical) data set gives the number of hours 10 sixth-graders read per week and their performance on a standardized verbal test (maximum of 100). Reading time per week 1 1 2 3 3 4 5 6 10 12 • • • Verbal test score 50 65 56 62 65 60 75 50 88 38 a.Construct a scatterplot for these data. Estimate or compute the correlation coefficient. Based on this correlation coefficient, would you conclude that reading time and test scores are correlated? Explain. b.Suppose you learn that five of the children read only comic books while the other five read regular books. Make a guess as to which data points fall in which group. How could you confirm your guess about the groups? c.Assuming that your guess in part b is correct, how does it change your view of the correlation between reading time and test scores? Explain. PROJECTS FOR THE INTERNET & BEYOND 23. Football-Stock Update. Find data for recent years concerning the Super Bowl winner and the end-of-year change in the stock market (positive or negative). Do recent results still agree with the correlation described in Example 2? Explain. 24. Real Correlations. • a.Describe a real situation in which there is a positive correlation that is the result of coincidence. • • • • • b.Describe a real situation in which there is a positive correlation that is the result of a common underlying cause. c.Describe a real situation in which there is a positive correlation that is the result of a direct cause. d.Describe a real situation in which there is a negative correlation that is the result of coincidence. e.Describe a real situation in which there is a negative correlation that is the result of a common underlying cause. f.Describe a real situation in which there is a negative correlation that is the result of a direct cause. IN THE NEWS 25. Misinterpreted Correlations. Find a recent news report in which you believe that a correlation may have been misinterpreted. Describe the correlation, the reported interpretation, and the problems you see in the interpretation. 26. Well-Interpreted Correlations. Find a recent news report in which you believe that a correlation has been presented with a reasonable interpretation. Describe the correlation and the reported interpretation, and explain why you think the interpretation is valid. 7.3 BEST-FIT LINES AND PREDICTION Suppose you are lucky enough to win a 1.5-carat diamond in a contest. Based on the correlation between weight and price in Figure 7.1, it should be possible to predict the approximate value of the diamond. We need only study the graph carefully and decide where a point corresponding to 1.5 carats is most likely to fall. To do this, it is helpful to draw a best-fit line (also called a regression line) through the data, as shown in Figure 7.18. This line is a “best fit” in the sense that, according to a standard statistical measure (which we discuss shortly), the data points lie closer to this line than to any other straight line that we could draw through the data. Figure 7.18 Best-fit line for the data from Figure 7.1. BY THE WAY The term regression comes from an 1877 study by Sir Francis Galton. He found that the heights of boys with short or tall fathers were closer to the mean than were the heights of their fathers. He therefore said that the heights of the children regress toward the mean, from which we get the term regression. The term is now used even for data that have nothing to do with a tendency to regress toward a mean. Definition The best-fit line (or regression line) on a scatterplot is a line that lies closer to the data points than any other possible line (according to a standard statistical measure of closeness). Of all the possible straight lines that can be drawn on a diagram, how do you know which one is the best-fit line? In many cases, you can make a good estimate of the bestfit line simply by looking at the data and drawing the line that visually appears to pass closest to all the data points. This method involves drawing the best-fit line “by eye.” As you might guess, there are methods for calculating the precise equation of a best-fit line (see the optional topic at the end of this section), and many computer programs and calculators can do these calculations automatically. For our purposes in this text, a fit by eye will generally be sufficient. Predictions with Best-Fit Lines We can use the best-fit line in Figure 7.18 to predict the price of a 1.5-carat diamond. As indicated by the dashed lines in the figure, the best-fit line predicts that the diamond will cost about $9,000. Notice, however, that two actual data points in the figure correspond to 1.5-carat diamonds, and both of these diamonds cost less than $9,000. That is, although the predicted price of $9,000 sounds reasonable, it is certainly not guaranteed. In fact, the degree of scatter among the data points in this case tells us that we should not trust the best-fit line to predict accurately the price for any individual diamond. Instead, the prediction is meaningful only in a statistical sense: It tells us that if we examined many 1.5-carat diamonds, their mean price would be about $9,000. This is only the first of several important cautions about interpreting predictions with best-fit lines. A second caution is to beware of using best-fit lines to make predictions that go beyond the bounds of the available data. Figure 7.19 shows a best-fit line for the correlation between infant mortality and longevity from Figure 7.4. According to this line, a country with a life expectancy of more than about 80 years would have a negative infant mortality rate, which is impossible. • It is a capital mistake to theorize before one has data. —Arthur Conan Doyle Figure 7.19 A best-fit line for the correlation between infant mortality and longevity from Figure 7.4. Source: United Nations. A third caution is to avoid using best-fit lines from old data sets to make predictions about current or future results. For example, economists studying historical data found a strong negative correlation between unemployment and the rate of inflation. According to this correlation, inflation should have risen dramatically in the mid-2000s when the unemployment rate fell below 6%. But inflation remained low, showing that the correlation from old data did not continue to hold. Fourth, a correlation discovered with a sample drawn from a particular population cannot generally be used to make predictions about other populations. For example, we can’t expect that the correlation between aspirin consumption and heart attacks in an experiment involving only men will also apply to women. • It’s tough to make predictions, especially about the future. —attributed to Niels Bohr, Yogi Berra, and others Fifth, remember that we can draw a best-fit line through any data set, but that line is meaningless when the correlation is not significant or when the relationship is nonlinear. For example, there is no correlation between shoe size and IQ, so we could not use shoe size to predict IQ. Cautions in Making Predictions from Best-Fit Lines • • • • • 1. Don’t expect a best-fit line to give a good prediction unless the correlation is strong and there are many data points. If the sample points lie very close to the best-fit line, the correlation is very strong and the prediction is more likely to be accurate. If the sample points lie away from the best-fit line by substantial amounts, the correlation is weak and predictions tend to be much less accurate. 2. Don’t use a best-fit line to make predictions beyond the bounds of the data points to which the line was fit. 3. A best-fit line based on past data is not necessarily valid now and might not result in valid predictions of the future. 4. Don’t make predictions about a population that is different from the population from which the sample data were drawn. 5. Remember that a best-fit line is meaningless when there is no significant correlation or when the relationship is nonlinear. EXAMPLE Valid Predictions? State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not. • a. You’ve found a best-fit line for a correlation between the number of hours per day that people exercise and the number of calories they consume each day. You’ve used this correlation to predict that a person who exercises 18 hours per day would consume 15,000 calories per day. • • • • • b. There is a well-known but weak correlation between SAT scores and college grades. You use this correlation to predict the college grades of your best friend from her SAT scores. c. Historical data have shown a strong negative correlation between national birth rates and affluence. That is, countries with greater affluence tend to have lower birth rates. These data predict a high birth rate in Russia. d. A study in China has discovered correlations that are useful in designing museum exhibits that Chinese children enjoy. A curator suggests using this information to design a new museum exhibit for Atlanta-area school children. e. Scientific studies have shown a very strong correlation between children’s ingesting of lead and mental retardation. Based on this correlation, paints containing lead were banned. f. Based on a large data set, you’ve made a scatterplot for salsa consumption (per person) versus years of education. The diagram shows no significant correlation, but you’ve drawn a best-fit line anyway. The line predicts that someone who consumes a pint of salsa per week has at least 13 years of education. SOLUTION • • • • • • a. No one exercises 18 hours per day on an ongoing basis, so this much exercise must be beyond the bounds of any data collected. Therefore, a prediction about someone who exercises 18 hours per day should not be trusted. b. The fact that the correlation between SAT scores and college grades is weak means there is much scatter in the data. As a result, we should not expect great accuracy if we use this weak correlation to make a prediction about a single individual. c. We cannot automatically assume that the historical data still apply today. In fact, Russia currently has a very low birth rate, despite also having a low level of affluence. d. The suggestion to use information from the Chinese study for an Atlanta exhibit assumes that predictions made from correlations in China also apply to Atlanta. However, given the cultural differences between China and Atlanta, the curator’s suggestion should not be considered without more information to back it up. e. Given the strength of the correlation and the severity of the consequences, this prediction and the ban that followed seem quite reasonable. In fact, later studies established lead as an actual cause of mental retardation, making the rationale behind the ban even stronger. f. Because there is no significant correlation, the best-fit line and any predictions made from it are meaningless. BY THE WAY In the United States, lead was banned from house paint in 1978 and from food cans in 1991, and a 25-year phaseout of lead in gasoline was completed in 1995. Nevertheless, many young children—especially children living in poor areas—still have enough lead in their blood to damage their health. Major sources of ongoing lead hazards include paint in older housing and soil near major roads, which has high lead content from past use of leaded gasoline. EXAMPLE Will Women Be Faster Than Men? Figure 7.20 shows data and best-fit lines for both men’s and women’s world record times in the 1-mile race. Based on these data, predict when the women’s world record will be faster than the men’s world record. Comment on the prediction. Figure 7.20 World record times in the mile (men and women). SOLUTION If we accept the best-fit lines as drawn, the women’s world record will equal the men’s world record by about 2040. However, this is not a valid prediction because it is based on extending the best-fit lines beyond the range of the actual data. In fact, notice that the most recent world records (as of 2011) date all the way back to 1999 for men and 1996 for women, while the best-fit lines predict that the records should have fallen by several more seconds since those dates. The Correlation Coefficient and Best-Fit Lines Earlier, we discussed the correlation coefficient as one way of measuring the strength of a correlation. We can also use the correlation coefficient to say something about the validity of predictions with best-fit lines. For mathematical reasons (not discussed in this text), the square of the correlation coefficient, or r2, is the proportion of the variation in a variable that is accounted for by the best-fit line (or, more technically, by the linear relationship that the best-fit line expresses). For example, the correlation coefficient for the diamond weight and price data (see Figure 7.18) turns out to be r = 0.777. If we square this value, we get r2 = 0.604 which we can interpret as follows: About 0.6, or 60%, of the variation in the diamond prices is accounted for by the best-fit line relating weight and price. That leaves 40% of the variation in price that must be due to other factors, presumably such things as depth, table, color, and clarity—which is why predictions made with the bestfit line in Figure 7.18are not very precise. A best-fit line can give precise predictions only in the case of a perfect correlation (r = 1 or r = –1); we then find r2 = 1, which means that 100% of the variation in a variable can be accounted for by the best-fit line. In this special case of r2 = 1, predictions should be exactly correct, except for the fact that the sample data might not be a true representation of the population data. Best-Fit Lines and r2 The square of the correlation coefficient, or r2, is the proportion of the variation in a variable that is accounted for by the best-fit line. TECHNICAL NOTE Statisticians call r2 the coefficient of determination. EXAMPLE Retail Hiring You are the manager of a large department store. Over the years, you’ve found a strong correlation between your September sales and the number of employees you’ll need to hire for peak efficiency during the holiday season; the correlation coefficient is 0.950. This year your September sales are fairly strong. Should you start advertising for help based on the best-fit line? SOLUTION In this case, we find that r2 = 0.9502 = 0.903, which means that 90% of the variation in the number of peak employees can be accounted for by a linear relationship with September sales. That leaves only 10% of the variation in the number of peak employees unaccounted for. Because 90% is so high, we conclude that the best-fit line accounts for the data quite well, so it seems reasonable to use it to predict the number of employees you’ll need for this year’s holiday season. EXAMPLE Voter Turnout and Unemployment Political scientists are interested in knowing what factors affect voter turnout in elections. One such factor is the unemployment rate. Data collected in presidential election years since 1964 show a very weak negative correlation between voter turnout and the unemployment rate, with a correlation coefficient of about r = –0.1 (Figure 7.21). Based on this correlation, should we use the unemployment rate to predict voter turnout in the next presidential election? Figure 7.21 Data on voter turnout and unemployment, 1964–2008. Source: U.S. Bureau of Labor Statistics. SOLUTION The square of the correlation coefficient is r2 = (–0.1)2 = 0.01, which means that only about 1% of the variation in the data is accounted for by the best-fit line. Nearly all of the variation in the data must therefore be explained by other factors. We conclude that unemployment is not a reliable predictor of voter turnout. Multiple Regression If you’ve ever purchased a diamond, you might have been surprised that we found such a weak correlation between color and price in Figure 7.2. Surely a diamond cannot be very valuable if it has poor color quality. Perhaps color helps to explain why the correlation between weight and price is not perfect. For example, maybe differences in color explain why two diamonds with the same weight can have different prices. To check this idea, it would be nice to look for a correlation between the price and some combination of weight and color together. • All who drink his remedy recover in a short time, except those whom it does not help, who all die. Therefore, it is obvious that it fails only in incurable cases. — Galen, Roman “doctor” TIME UT TO THINK Check this idea in Table 7.1. Notice, for example, that Diamonds 4 and 5 have nearly identical weights, but Diamond 4 costs only $4,299 while Diamond 5 costs $9,589. Can differences in their color explain the different prices? Study other examples in Table 7.1 in which two diamonds have similar weights but different prices. Overall, do you think that the correlation with price would be stronger if we used weight and color together instead of either one alone? Explain. There is a method for investigating a correlation between one variable (such as price) and a combination of two or more other variables (such as weight and color). The technique is called multiple regression, and it essentially allows us to find a best-fit equation that relates three or more variables (instead of just two). Because it involves more than two variables, we cannot make simple diagrams to show best-fit equations for multiple regression. However, it is still possible to calculate a measure of how well the data fit a linear equation. The most common measure in multiple regression is the coefficient of determination, denoted R2. It tells us how much of the scatter in the data is accounted for by the best-fit equation. If R2 is close to 1, the best-fit equation should be very useful for making predictions within the range of the data values. If R2 is close to zero, then predictions with the best-fit equation are essentially useless. Definition The use of multiple regression allows the calculation of a best-fit equation that represents the best fit between one variable (such as price) and a combination of two or more other variables (such as weight and color). The coefficient of determination, R2, tells us the proportion of the scatter in the data accounted for by the best-fit equation. In this text, we will not describe methods for finding best-fit equations by multiple regression. However, you can use the value of R2 to interpret results from multiple regression. For example, the correlation between price and weight and color together results in a value of R2 = 0.79. This is somewhat higher than the r2 = 0.61 that we found for the correlation between price and weight alone. Statisticians who study diamond pricing know that they can get stronger correlations by including additional variables in the multiple regression (such as depth, table, and clarity). Given the billions of dollars spent annually on diamonds, you can be sure that statisticians play prominent roles in helping diamond dealers realize the largest possible profits. BY THE WAY One study of alumni donations found that, in developing a multiple regression equation, one should include these variables: income, age, marital status, whether the donor belonged to a fraternity or sorority, whether the donor is active in alumni affairs, the donor’s distance from the college, and the nation’s unemployment rate, used as a measure of the economy (Bruggink and Siddiqui, “An Econometric Model of Alumni Giving: A Case Study for a Liberal Arts College,” The American Economist, Vol. 39, No. 2). EXAMPLE Alumni Contributions You’ve been hired by your college’s alumni association to research how past contributions were associated with alumni income and years that have passed since graduation. It is found that R2 = 0.36. What does that result tell us? SOLUTION With R2 = 0.36, we conclude that 36% of the variation in past contributions can be explained by the variation in alumni income and years since graduation. It follows that 64% of the variation in past contributions can be explained by factors other than alumni income level and years since graduation. Because such a large proportion of the variation can be explained by other factors, it would make sense to try to identify any other factors that might have a strong effect on past contributions. Finding Equations for Best-Fit Lines (Optional Section) The mathematical technique for finding the equation of a best-fit line is based on the following basic ideas. If we draw any line on a scatterplot, we can measure the vertical distance between each data point and that line. One measure of how well the line fits the data is the sum of the squares of these vertical distances. A large sum means that the vertical distances of data points from the line are fairly large and hence the line is not a very good fit. A small sum means the data points lie close to the line and the fit is good. Of all possible lines, the best-fit line is the line that minimizes the sum of the squares of the vertical distances. Because of this property, the best-fit line is sometimes called the least squares line. You may recall that the equation of any straight line can be written in the general form where m is the slope of the line and b is the y-intercept of the line. The formulas for the slope and y-intercept of the best-fit line are as follows: In the above expressions, r is the correlation coefficient, sx denotes the standard deviation of the xvalues (or the values of the first variable), sy denotes the standard deviation of the y values, xrepresents the mean of the values of the variable x, and y represents the mean of the values of the variable y. Because these formulas are tedious with manual calculations, we usually use a calculator or computer to find the slope and y-intercept of best-fit lines. Statistical software packages and some calculators, such as the TI-83/84 Plus family of calculators, are designed to automatically generate the equation of a best-fit line. When software or a calculator is used to find the slope and intercept of the best-fit line, results are commonly expressed in the format y = b0 + b1x, where b0 is the intercept and b1 is the slope, so be careful to correctly identify those two values. Section 7.3 Exercises Statistical Literacy and Critical Thinking 1. Best-Fit Line. What is a best-fit line (also called a regression line)? How is a best-fit line useful? 2. r2. For a study involving paired sample data, it is found that r = –0.4. What is the value of r2? In general, what is r2 called, what does it measure, and how can it be interpreted? That is, what does its value tell us about the variables? 3. Regression. An investigator has data consisting of heights of daughters and the heights of the corresponding mothers and fathers. She wants to analyze the data to see the effect that the height of the mother and the height of the father has on the height of the daughter. Should she use a (linear) regression or multiple regression? What is the basic difference between (linear) regression and multiple regression? 4. R2. Using data described in Exercise 3, it is found that R2 = 0.68. Interpret that value. That is, what does that value tell us about the data? Does It Make Sense? For Exercises 5–8, decide whether the statement makes sense (or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of these statements have definitive answers, so your explanation is more important than your chosen answer. 5. r2 Value. A value of r2 = 1 is obtained from a sample of paired data with one variable representing the amount of gas (gallons) purchased and the total cost of the gas. 6. r2 Value. A value of r2 = –0.040 is obtained from a sample of men, with each pair of data consisting of the height in inches and the SAT score for one man. 7. Height and Weight. Using data from the National Health Survey, the equation of the best-fit line for women’s heights and weights is obtained, and it shows that a woman 120 inches tall is predicted to weigh 430 pounds. 8. Old Faithful. Using paired sample data consisting of the duration time (in seconds) of eruptions of Old Faithful geyser and the time interval (in minutes) after the eruption, a value of r2 = 0.926 is calculated, indicating that about 93% of the variation in the interval after eruption can be explained by the relationship between those two variables as described by the best-fit line. Concepts and Applications Best-Fit Lines on Scatterplots. For Exercises 9–12, do the following. • • • a. Insert a best-fit line in the given scatterplot. b. Estimate or compute r and r2. Based on your value for r2, determine how much of the variation in the variable can be accounted for by the best-fit line. c. Briefly discuss whether you could make valid predictions from this best-fit line. 9. Use the scatterplot for color and price in Figure 7.2. 10. Use the scatterplot for life expectancy and infant mortality in Figure 7.4. 11. Use the scatterplot for number of farms and size of farms in Figure 7.5. 12. Use both scatterplots for actual and predicted temperature in Figure 7.6. Best-Fit Lines. Exercises 13–20 refer to the tables in the Section 7.1 Exercises. In each case, do the following. • • • • a. Construct a scatterplot and, based on visual inspection, draw the best-fit line by eye. b. Briefly discuss the strength of the correlation. Estimate or compute r and r2. Based on your value for r2, identify how much of the variation in the variable can be accounted for by the best-fit line. c. Identify any outliers on the scatterplot and discuss their effects on the strength of the correlation and on the best-fit line. d. For this case, do you believe that the best-fit line gives reliable predictions outside the range of the data on the scatterplot? Explain. 13. Use the data in Exercise 19 of Section 7.1. 14. Use the data in Exercise 20 of Section 7.1. 15. Use the data in Exercise 21 of Section 7.1. 16. Use the data in Exercise 22 of Section 7.1. 17. Use the data in Exercise 23 of Section 7.1. To locate the points, use the midpoint of each income category; use a value of $25,000 for the category “less than $30,000,” and use a value of $70,000 for the category “more than $60,000.” 18. Use the data in Exercise 24 of Section 7.1. 19. Use the data in Exercise 25 of Section 7.1. 20. Use the data in Exercise 26 of Section 7.1. PROJECTS FOR THE INTERNET & BEYOND 21. Lead Poisoning. Research lead poisoning, its sources, and its effects. Discuss the correlations that have helped researchers understand lead poisoning. Discuss efforts to prevent it. 22. Asbestos. Research asbestos, its sources, and its effects. Discuss the correlations that have helped researchers understand adverse health effects from asbestos exposure. Discuss efforts to prevent those adverse health effects. 23. Worldwide Population Indicators. The following table gives five population indicators for eleven selected countries. Study these data and try to identify possible correlations. Doing additional research if necessary, discuss the possible correlations you have found, speculate on the reasons for the correlations, and discuss whether they suggest a causal relationship. Birth and death rates are per 1,000 population; fertility rate is per woman. Country Afghanistan Argentina Australia Canada Egypt El Salvador France Israel Japan Laos United States Birth rate 50 21 15 14 29 30 13 21 10 45 Death rate 22 8 7 7 8 6 9 7 7 15 Life expectancy 43 72 78 78 64 68 78 77 79 51 Percent urban 20 88 85 77 45 45 73 91 78 22 Fertility rate 6.9 2.6 1.9 1.6 3.4 3.1 1.6 2.8 1.5 6.7 16 9 76 76 2.0 Source: The New York Times Almanac. IN THE NEWS 24. Predictions in the News. Find a recent news report in which a correlation is used to make a prediction. Evaluate the validity of the prediction, considering all of the cautions described in this section. Overall, do you think the prediction is valid? Why or why not? 25. Best-Fit Line in the News. Although scatterplots are rare in the news, they are not unheard of. Find a scatterplot of any kind in a news article (recent or not). Draw a best-fit line by eye. Discuss what predictions, if any, can be made from your best-fit line. 26. Your Own Multiple Regression. Come up with an example from your own life or work in which a multiple regression analysis might reveal important trends. Without actually doing any analysis, describe in words what you would look for through the multiple regression and how the answers might be useful. 7.4 THE SEARCH FOR CAUSALITY A correlation may suggest causality, but by itself a correlation never establishes causality. Much more evidence is required to establish that one factor causes another. Earlier, we found that a correlation between two variables may be the result of either (1) coincidence, (2) a common underlying cause, or (3) one variable actually having a direct influence on the other. The process of establishing causality is essentially a process of ruling out the first two explanations. In principle, we can rule out the first two explanations by conducting experiments: • • • We can rule out coincidence by repeating the experiment many times (or by using a large number of subjects in the experiment). Because coincidences occur randomly, the same coincidence is unlikely to occur in repeated trials of an experiment. • We can rule out a common underlying cause by controlling and randomizing the experiment to eliminate the effects of confounding variables (see Section 1.3). If the controls rule out confounding variables, any remaining effects must be caused by the variables of interest. Unfortunately, these ideas are often difficult to put into practice. In the case of ruling out coincidence, it may be too time-consuming or expensive to repeat an experiment a sufficient number of times. To rule out a common underlying cause, the experiment must control for everything except the variables of interest, and this is often impossible. Moreover, there are many cases in which experiments are impractical or unethical, so we can gather only observational data. Because observational studies cannot definitively establish causality, we must find other ways of trying to establish causality. Establishing Causality Suppose you have discovered a correlation and suspect causality. How can you test your suspicion? Let’s return to the issue of smoking and lung cancer. The strong correlation between smoking and lung cancer did not by itself prove that smoking causes lung cancer. In principle, we could have looked for proof with a controlled experiment. But such an experiment would be unethical because it would require forcing a group of randomly selected people to smoke cigarettes. So how was smoking established as a cause of lung cancer? The answer involves several lines of evidence. First, researchers found correlations between smoking and lung cancer among many groups of people: women, men, and people of different races and cultures. Second, among groups of people that seemed otherwise identical, lung cancer was found to be more rare in nonsmokers. Third, people who smoked more and for longer periods of time were found to have higher rates of lung cancer. Fourth, when researchers accounted for other potential causes of lung cancer (such as exposure to radon gas or asbestos), they found that almost all the remaining lung cancer cases occurred among smokers (or people exposed to second-hand smoke). BY THE WAY Statistical methods cannot prove that smoking causes cancer, but statistical methods can be used to identify an association, and physical proof of causation can then be sought by researchers. Dr. David Sidransky of Johns Hopkins University and other researchers found a direct physical link that involves mutations of a specific gene among smokers. Molecular analysis of genetic changes allows researchers to determine whether cigarette smoking is the cause of a cancer. (See “Association Between Cigarette Smoking and Mutation of the p53 Gene in Squamous-Cell Carcinoma of the Head and Neck,” by Brennan, Boyle et al., New England Journal of Medicine, Vol 332, No. 11.) These four lines of evidence made a strong case, but still did not rule out the possibility that some other factor, such as genetics, predisposes people both to smoking and to lung cancer. However, two additional lines of evidence made this possibility highly unlikely. One line of evidence came from animal experiments. In controlled experiments, animals were divided into randomly chosen treatment and control groups. The experiments still found a correlation between inhalation of cigarette smoke and lung cancer, which seems to rule out a genetic factor, at least in the animals. The final line of evidence came from biologists studying small samples of human lung tissue. The biologists discovered the basic process by which ingredients in cigarette smoke create cancer-causing mutations. This process does not appear to depend in any way on specific genetic factors, making it all but certain that lung cancer is caused by smoking and not by any preexisting genetic factor. The fact that second-hand smoke exposure is also associated with some cases of lung cancer further argues against a genetic factor (since second-hand smoke affects non-smokers) but is consistent with the idea that ingredients in cigarette smoke create cancer-causing mutations. The following box summarizes these ideas about establishing causality. Generally speaking, the case for causality is stronger when more of these guidelines are met. Guidelines for Establishing Causality If you suspect that a particular variable (the suspected cause) is causing some effect: • • • • • 1. Look for situations in which the effect is correlated with the suspected cause even while other factors vary. 2. Among groups that differ only in the presence or absence of the suspected cause, check that the effect is similarly present or absent. 3. Look for evidence that larger amounts of the suspected cause produce larger amounts of the effect. 4. If the effect might be produced by other potential causes (besides your suspected cause), make sure that the effect still remains after accounting for these other potential causes. 5. If possible, test the suspected cause with an experiment. If the experiment cannot be performed with humans for ethical reasons, consider doing the experiment with animals, cell cultures, or computer models. • 6. Try to determine the physical mechanism by which the suspected cause produces the effect. BY THE WAY The first four guidelines to the left are called Mill’s methods after John Stuart Mill (1806–1873). Mill was a leading scholar of his time and an early advocate of women’s right to vote. In philosophy, the four methods are called, respectively, the methods of agreement, difference, concomitant variation, and residues. TIME UT TO THINK There’s a great deal of controversy concerning whether animal experiments are ethical. What is your opinion of animal experiments? Defend your opinion. ASE STUDY Air Bags and Children By the mid-1990s, passenger-side air bags had become commonplace in cars. Statistical studies showed that the air bags saved many lives in moderate- to high-speed collisions. But a disturbing pattern also appeared. In at least some cases, young children, especially infants and toddlers in child car seats, were killed by air bags in low-speed collisions. At first, many safety advocates found it difficult to believe that air bags could be the cause of the deaths. But the observational evidence became stronger, meeting the first four guidelines for establishing causality. For example, the greater risk to infants in child car seats fit Guideline 3, because it indicated that being closer to the air bags increased the risk of death. (A child car seat sits on top of the built-in seat, thereby putting a child closer to the air bags than the child would be otherwise.) To seal the case, safety experts undertook experiments using dummies. They found that children, because of their small size, often sit where they could be easily hurt by the explosive opening of an air bag. The experiments also showed that an air bag could impact a child car seat hard enough to cause death, thereby revealing the physical mechanism by which the deaths occurred. BY THE WAY Based on these studies, the government now recommends that child car seats never be used on the front seat and that children under age 12 (or under 4 feet, 9 inches tall) sit in the back seat whenever possible. ASE STUDY Cardiac Bypass Surgery Cardiac bypass surgery is performed on people who have severe blockage of arteries that supply the heart with blood (the coronary arteries). If blood flow stops in these arteries, a patient may suffer a heart attack and die. Bypass surgery essentially involves grafting new blood vessels onto the blocked arteries so that blood can flow around the blocked areas. By the mid-1980s, many doctors were convinced that the surgery was prolonging the lives of their patients. However, a few early retrospective studies turned up a disconcerting result: Statistically, the surgery appeared to be making little difference. In other words, patients who had the surgery seemed to be faring no better on average than similar patients who did not have it. If this were true, it meant that the surgery was not worth the pain, risk, and expense involved. Because these results flew in the face of what many doctors thought they had observed in their own patients, researchers began to dig more deeply. Soon, they found confounding variables that had not been accounted for in the early studies. For example, they found that patients getting the surgery tended to have more severe blockage of their arteries, apparently because doctors recommended the surgery more strongly to these patients. Because these patients were in worse shape to begin with, a comparison of longevity between them and other patients was not really valid. More important, the research soon turned up substantial differences in the results among patients who had the surgery in different hospitals. In particular, a few hospitals were achieving remarkable success with bypass surgery and their patients fared far better than patients who did not have the surgery or had it at other hospitals. Clearly, the surgical techniques used by doctors at the successful hospitals were somehow different and superior. Doctors studied the differences to ensure that all doctors could be trained in the superior techniques. In summary, the confounding variables of amount of blockage and surgical technique had prevented the early studies from finding a real correlation between cardiac bypass surgery and prolonged life. Today, cardiac bypass surgery is accepted as a cause of prolonged life in patients with blocked coronary arteries. It is now among the most common types of surgery, and it typically adds decades to the lives of the patients who undergo it. BY THE WAY As you might guess, it is also difficult to define reasonable doubt. For criminal trials, the Supreme Court endorsed this guidance from Justice Ruth Bader Ginsburg: “Proof beyond a reasonable doubt is proof that leaves you firmly convinced of the defendant’s guilt. There are very few things in this world that we know with absolute certainty, and in criminal cases the law does not require proof that overcomes every possible doubt. If, based on your consideration of the evidence, you are firmly convinced that the defendant is guilty of the crime charged, you must find him guilty. If on the other hand, you think there is a real possibility that he is not guilty, you must give him the benefit of the doubt and find him not guilty.” Hidden Causality So far we have discussed how to establish causality after first discovering a correlation. However, sometimes a correlation—or the lack of a correlation—can hide an underlying causality. As the next case study shows, such hidden causality often occurs because of confounding variables. Confidence in Causality The six guidelines offer us a way to examine the strength of a case for causality, but we often must make decisions before a case of causality is fully established. Consider, for example, the well-known case of global warming. It may never be possible to prove beyond all doubt that the burning of fossil fuels is causing global warming (see the Focus on Environment at the end of this chapter), so we must decide whether to act while we still face some uncertainty about causation. How much must we know before we decide to act? In other areas of statistics, accepted techniques help us deal with this type of uncertainty by allowing us to calculate a numerical level of confidence or significance. But there are no accepted ways to assign such numbers to the uncertainty that comes with questions of causality. Fortunately, another area of study has dealt with practical problems of causality for hundreds of years: our legal system. You may be familiar wi...
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Attached.

Statistics
Exercise 19 page 21
Safe Speeds? Consider the following table showing speed limits and death rates from automobile
accidents in selected countries.
Death rate (per 100 million vehiclemiles)( y)
Speed limit (miles per hour) (x)
Country
Norway
3
55
United States
3,3
55
Finland
3,4
55
Britain
3,5
70
Denmark
4,1
55
Canada
4,3
60
Japan
4,7
55
Australia
4,9
65
Netherlands
5,1
60
Italy
6,1
75
Source: D. J. Rivkin, New York Times.
Questions
a. Construct a scatterplot of the data.
b. Briefly characterize the correlation in words (for example, strong positive correlation, weak negative
correlation) and estimate the correlation coefficient of the data. (Or calculat...


Anonymous
I use Studypool every time I need help studying, and it never disappoints.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Related Tags