Homework help for statistics

User Generated

nw1008

Mathematics

mgsc6200

Northeastern University

Description

20 questions attached. Course Book - Managers guide to Statistics. These are homework questions. Need to be concise but informative. Show calculations for each question. Attaching course weekly lecture notes to help with solving questions.

Unformatted Attachment Preview

1. 2. 3. A company has estimated the standard deviation of individual sales transactions to be $2000. Approximately 100 sales transactions take place each week. What is an estimate of the standard error of the average weekly sales? A commodity trader who usually makes around 25 transactions per week obtained the following graph showing the average transaction size in $1000 for each of the last 20 weeks. Each dot represents the weekly average (in $1000) of about 25 transactions. By examining the graph, would you estimate the standard error for the weekly average of transaction sizes to be closest to $2,000, $20,000, or $50,000? A media rating company that publishes television ratings uses a random sample of 1600 households and _nds that 20% watched a particular show in a given week. What is the approximate margin of error? 4. Polls showed that the two main candidates in the 2004 presidential election were nearly tied on the day before the election. To predict the winner, a newspaper would like to have a poll which has margin of error of 5%. Roughly how large a sample would be needed for such a poll? (Hint: since candidates are nearly tied, each has roughly 50% of the vote.) 5. Using a random sample of 100 workers, researchers calculate a 95% con_dence interval for the average hourly wage earned by construction workers in the city of Boston. This interval they calculate is from $18 to $26. Does this mean we can say that roughly 95% of construction workers in the city earn hourly wages between $18 and $26? 6. A hotel has 180 rooms. Past experience shows that only 84% of people who reserve rooms actually show up. If the hotel takes 200 reservations, what is the chance there will be enough rooms for the people who actually show up? (Hint: You want to _nd the chance that, in the sample of 200 reservations, the percentage of those who show up will be 90% (180/200) or less, knowing that true population percentage is 84%.) 7. A health insurer wants to know if having a nurse available to answer simple questions over a telephone hotline could cut costs by eliminating unnecessary doctor visits. Records show the yearly cost of doctor visits has an SD of $300. They randomly select 225 families, give them access to the hotline, and record the costs of their doctor visits. In this sample, they _nd the average yearly cost per family to be $820. Find a 95% con_dence interval for the yearly population average cost of doctor visits per family. 8. A health insurer wants to know if having a nurse available to answer simple questions over a telephone hotline could cut costs by eliminating unnecessary doctor visits. Past records show the yearly average cost of doctor visits per family is around $875 with an SD of $300. They randomly select 225 families, give them access to the hotline, and record the yearly costs of their doctor visits. In this sample, they _nd the average yearly cost per family to be $840. We want to test the null hypothesis that the hotline does not reduce costs, that is, average cost is $875. What would be the conclusion based on this data? 9. Market researchers would like to know if consumers can taste the di􀃖erence between a product made with low calorie oil and the same product made with regular oil. A random sample of 310 people are blindfolded and given both products to taste. Overall, 270 people correctly guess which is the low-calorie product. Find a 95% con_dence interval for the population percentage of people who can correctly guess the low-calorie product. 10. Market researchers would like to know if consumers can taste the di􀃖erence between a product made with low calorie oil and the same product made with regular oil. A random sample of 400 people are blindfolded and given both products to taste. Overall, 230 people correctly guess which is the low-calorie product. Is there evidence of a di􀃖erence in taste of the two products? 11. Suppose that the correlation coe􀃖cient between two variables X and Y is estimated to be 0.82 using a random sample, but no other information about the variables is provided. Which of the following statements is most appropriate about the relationship between them? 12. Suppose that the correlation coe􀃖cient between two variables is nearly zero. What can we say about the relationship between them? 13. A commodity trader 􀃖nds a positive correlation between wheat prices and soybean prices, both measured as dollars per pound. Would you expect the correlation to be higher, lower, or the same if prices for wheat were in dollars per pound and prices for soybeans were in dollars per kilogram? (1 kilogram is 2.2 pounds.) 14. Based on a random sample of adults, there is a small but positive correlation between income and height for adults, and these data tend to follow a gently sloping line. Does this mean people who earn more money tend to be a bit taller on average than people who earn less? 15. The following regression equation has been estimated for Y using X as the independent variable: Y = 20 X + 64. Now we would like to estimate the regression equation for X using Y as the independent variable. 16. Store records show that people who live closer to a grocery store tend to purchase more. The relationship between the distance to the store in miles (X) and the amount a customer spends over the year in dollars (Y) 􀃖ts the following regression line: Y = –100 X + 1000. Interpret the coe􀃖cient –100 in this regression equation. 17. Store records show that people who live closer to a store tend to purchase more. The relationship between the distance to the store in miles (X) and the amount a customer spends over the year in dollars (Y) 􀃖ts the following regression line: Y = –100 X + 1000. If a customer lives 3 miles from the store, what is the predicted amount that the customer spends at the store annually? 18. Greg is analyzing data on household budget for selected years over a 60-year period from data published by the U.S. Bureau of Labor Statistics. Greg’s dependent variable (Y) is the annual household expenditures for food at home in the U.S. (in $1,000s), and his independent variable (X) is the annual household income (in $1,000s). The data 􀃖t the following regression line: Y = 41.372 + 0.06172 X. What are the predicted annual household expenditures when the household income is $70,000? 19. U.S. News & World Report publishes the average starting salary and the average GMAT score for MBA graduates at each of the top 80 ranked schools in the country. The data 􀃖t the following regression line: Y = 288X – 111,408 where X is the average GMAT score for students at a school and Y is the average starting salary for graduates from that same school. What is a proper interpretation of the coe􀃖cient 288 in the regression equation? 20. Consider the following two variables: X = Length of hospital stay by any given patient (days), Y = Cost of the patient’s hospital stay (dollars). What do you expect about the relationship between these variables? MGSC 6200: Information Analysis Note: This is the text-only version of this week’s lecture. All media (i.e. videos, flash presentations, and PowerPoints) and learning activities (i.e. assigned readings, assignments, and discussions) are accessible only through the online course. Week 3: Correlation and Regression Introduction Log in to the course to view video and alternative version. Week 3 Video Transcript Week 3 Overview This week, we move from considering a single variable by itself to exploring if two variables are related to each other. We shall learn how we can measure the strength of a relationship if it is of a simple type called a linear relationship. In that case, we shall also learn how to obtain and use the line representing the relationship. In discussing relationships, we shall also learn that there is an important difference between being related to and being caused by another variable. Two variables X and Y may be related in different ways. Here are some possibilities: 1. A change in X causes a change in Y. 2. A change in Y causes a change in X. 3. A third variable (or a set of variables) causes a change in both X and Y. Because of the third possibility, we cannot conclude that the existence of a relationship between two variables indicates that one causes the other. Thus, an observed relationship can be due to causes other than the two variables we are considering. Week 3 has two lessons: Lesson 1: Correlation Between Two Variables Lesson 2: Simple Regression After concluding the two lessons, you will submit the following for assessment purposes: Week 3 Homework Assignment (due on Day 7 by 9:00 a.m. E.T.) Week 3 Quiz (duration is 60 minutes, will be open 8:00 p.m. E.T. on Day 5 and will remain open for 52 hours) 1 MGSC 6200: Information Analysis Week 3 Learning Activities Learning Activity Description Schedule exam time with ProctorU Scheduling ProctorU at least 72 hours before the exam. Due Date Point Value and Percentage Day 1 ~ Week 3 Lesson 1 Correlation Between Two Variables ~ ~ Week 3 Reading 1 Read Sections 1 - 6 of Chapter 5 of your textbook ~ ~ Practice Exercises 1 Use Microsoft Excel to complete the exercise ~ ~ Practice Exercises 2 Use Microsoft Excel to complete the exercise ~ ~ Practice Exercises 3 Use Microsoft Excel to complete the exercise ~ ~ Practice Exercises 4 Exercises 13, 17, 21 at the end of Chapter 5 ~ ~ Week 3 Lesson 2 Simple Regression ~ ~ Week 3 Reading 2 Read Chapter 6 of your textbook ~ ~ Practice Exercises 5 Exercises 4, 11 at the end of Chapter 6 ~ ~ Practice Exercises 6 Use Microsoft Excel to complete the exercise ~ ~ Week 3 Assignment 1 – Homework Assignment Chapter 5 Exercises 6 (page 170), 9 (page 176-177), and Exercise WK3_X1 Day 7 at 9:00 a.m. E.T. 12 points; 12% Available at 8:00 10 points; 10% Chapter 6 Exercises 9 (page 199), 13 (a), (b), (c) (page 200201) and Exercise WK3_X2 Week 3 Quiz 1 ProctorU is required to take this 2 MGSC 6200: Information Analysis Learning Activity Description quiz. Due Date Point Value and Percentage p.m. E.T. on Day 5 for 52 hours. You have 60 minutes to complete the quiz Lesson 1: Correlation Between Two Variables Lesson Overview In this lesson, we shall study how two variables of interest may be related to each other, and if so, how we can measure the strength of their relationship. We shall also learn some caveats in interpreting and measuring the strength of relationships. In discussing relationships, we shall also learn that there is an important difference between being related to and being caused by another variable. When you finish this lesson, you should be able to: 1. 2. 3. 4. Understand what we mean by two variables being related to each other. Understand that correlation does not mean causation. Understand the difference between linear and other types of relationships. Understand that the correlation coefficient measures the strength of a linear relationship. Lesson Readings: Week 3 Reading 1 Important! Before you begin this lesson, read Sections 1 - 6 of Chapter 5 of your textbook. Please do not continue until you have done so. Note on Section 6 in Chapter 5: Section 6 in chapter 5 of your textbook gives an explicit formula for the population correlation coefficient between two variables. This section also has some discussion of the SD of the sum of two correlated variables. Reading this section is completely optional, and you can omit it without any adverse consequences to your course grade. It will not be included in any quiz or assignment. 3 MGSC 6200: Information Analysis Lesson 1: Correlation Between Two Variables Presentation: Log in to the course to access interactive course content and alternative version. Correlation Between Two Variables Alternate Version Lesson 1: Correlation Between Two Variables Practice Exercises: Practice Exercises 1 Exercise: Please start Excel on your computer, open the file fourdatasets.xls from the Data Sets, and verify that r = 0.816 in each data set using the CORREL function you saw earlier: =CORREL(X-values, Y-values) Practice Exercises 2 Exercise: Before continuing, please use the Chart facility under the Insert menu in Microsoft Excel to obtain the four scatterplots you just saw in Figure 2 on the previous page for the Anscombe data set in fourdatasets.xls from the Data Sets. Practice Exercises 3 Exercise: Using the Anscombe Data Set a in fourdatasets.xls from the Data Sets, do the following and calculate the correlation coefficient in each case: 1. 2. 3. 4. Add 5 to all X and all Y values. Subtract 20 from all X values but leave the Y values unchanged. Divide all X and all Y values by 4. Leave the X values as they are but multiply all Y vales by 4. Practice Exercises 4 Before continuing, do the following additional practice exercises: Exercises 13 (page 179), 17 (page 181), 21 (pages 182-183) at the end of Chapter 5 When you are ready, continue to Week 3 Lesson 2. 4 MGSC 6200: Information Analysis Lesson 2: Simple Regression Lesson Overview In this lesson, we shall study how to find and use the straight line that best fits the sample data on two variables X and Y. This line is called the regression line. We shall learn the regression line that best fits a sample data set is really an estimate of the regression line for the population. Then we shall learn how to test if a regression line passes a test of significance, and if so, how to use it for predicting the value of Y for any given value of X. Learning Objectives When you finish this lesson, you should be able to: 1. 2. 3. 4. Describe the meaning of the regression line. Define the terms: regression coefficient, coefficient of determination, and residual. Demonstrate the most commonly used way of obtaining the best fitting line. Explain how to test if the regression line is statistically significant. Week 3 Reading 2 Important! Before you begin this lesson, read Chapter 6 of your textbook. This lesson will build upon Chapter 10 with only a reinforcement of the important highlights. Please do not continue to the next page until you have read Chapter 6. Chapter 6 Concepts As you just read, the regression line is the line that best fits the general pattern in a scatterplot of data on two variables X and Y. Keep in mind that each X, Y pair in the data set is one observation, shown as a single point in the scatterplot. To the right is a scatterplot with a regression line superimposed on it. The line is meant to show the general pattern in the data. Observe that individual observations are scattered around the line and few, if any, actually fall on the line. The equation of the regression line is Y = m X + b. The letter m designates the slope, and b designates the intercept of the line. The definition of slope as rise divided by run, as shown in this figure, means that slope is simply 5 MGSC 6200: Information Analysis "the change in Y per unit change in X." Perhaps the first questions we would be asking at this point are: 1. How do we determine the regression line (i.e. find its slope and intercept)? 2. How do we use the regression line, or, what do we do with it? We will begin with the second question and come back to the first question shortly. Uses of Regression There are two basic uses of regression: 1) To forecast or predict the value of a variable Y when the value of X is given. 2) To understand and explain the variability in Y in terms of an explanatory variable X (or several explanatory variables, which is the subject of our next lesson on multiple regression). The variable Y, which we want to predict or explain, is sometimes called the response or dependent variable. The variable X is called the predictor or explanatory variable. 1) Forecasting If we are successful in obtaining the best-fitting regression line (i.e., its slope and intercept have been found), then here is how we can predict the value of Y for any given X value, we simply plug in the value of X into the regression equation: It is worth emphasizing that the Predicted Y value is the Y-coordinate of that point on the regression line whose X-coordinate is the given X value: 6 MGSC 6200: Information Analysis This point is worth thinking about for a moment. Since observations can be scattered around the regression line, we cannot say that Predicted Y value is the actual Y value of a data point whose X-coordinate is the given X value. We can say, however, that Predicted Y is the estimated average of Y values whose X-coordinate is the given X value. So, the slope of the regression line should really be interpreted as the change we expect to see in the average value of Y per unit change in the value of X. Therefore, think of Predicted Y as the estimated average of Y values, given the X value. When we predict the average value of Y in this way, it is often desirable to also give a confidence interval for it. Details of doing this are beyond the scope of this course, but such confidence intervals are routinely provided by statistical software. Computer software also routinely provides an estimate of the standard deviation of data values from the regression line. This estimate is called the standard error for the regression line, and it shows the typical vertical deviation of data points from the line. Another point is worth emphasizing: If we have obtained the regression equation for Y with X being the predictor, we cannot just solve the equation Y = m X + b for X and use it as the regression equation for X with Y being the predictor. The reason is that we now have to treat Y as a given value and estimate the average of X. Please re-read Section 2 in Chapter in your textbook for a more detailed discussion of this point. Practice Exercises 5 Before continuing, do the following additional practice exercises: Exercises 4 (page 198) and 11 (page 200) at the end of Chapter 6 7 MGSC 6200: Information Analysis 2) Explaining the variability Y in terms of X Once we obtain the slope and the intercept of the regression line, we can use it to say something about Y in terms of X. For example, we can say that "for each unit change in X, we expect the average of Y to change by m." The sign of the slope also allows us to say if the change will be an increase or decrease. It turns out that we can say more. For example, we can also make a statement like: "the variable X explains . . . percent of the variability in Y" or a statement like: ". . . percent of the variability in Y is associated with X" The percentage explained is given by something called the coefficient of determination, which we shall explain below. One more fact is worth emphasizing before we return to our first question (namely, determining the regression line). In practice, we almost always have to work with sample data since population data is usually unavailable. If all we have is a sample, and we wish to make statements about the population, then we have to justify generalizing what we learn from our sample to the population. Just as we did in week two regarding a population mean or percentage, we have to be concerned with confidence intervals and hypothesis tests about a regression line estimated from sample data. Let us try to explain this in a graphical way. This is a conceptual graph of an unknown population, shown as a ellipse, and a random sample taken from that population, shown with the red dots. For emphasis, the ellipse representing the population data has been drawn with its long axis parallel to the X axis. As a matter of fact, the long axis of the ellipse is the population 8 MGSC 6200: Information Analysis regression line. The red line is the sample regression line that best fits the sample data. Observe that the sample regression line has a positive slope, whereas the population regression line has zero slope. If you knew this to be true at the beginning, would you use the red line to say how X and Y are related in the population? Your answer should be a resounding NO! Of course, we do not know this at the beginning. All we have on hand is sample data, and we would be looking at the following: Hopefully, you can now see that, even though the best fitting sample regression line has positive slope, we still have to test the hypothesis that the population regression line has zero slope. We shall learn how to do this shortly. Determining the Best Fitting Regression Line Although there are several methods of estimating the slope and the intercept of the regression line, the most common method is the ordinary least squares method. Here is the basic idea of this method: Find that line for which the sum of the squares of vertical deviations from the line is as small as possible. The graph below shows the vertical deviations from a regression line. These vertical deviations are also called the residuals. For each and every possible regression line, these deviations would be squared and added up, and we would choose the line for which this sum is the smallest: 9 MGSC 6200: Information Analysis Obviously, this is a perfect job for a computer and not a rewarding job for a student or a professor. In Microsoft Excel, slope and intercept calculations are built into the software as functions. If the Y-values are in the range A1:A100 and the X-values are in the range B1:B100, then you can obtain the slope and the intercept as follows: =SLOPE(A1:A100, B1:B100) =INTERCEPT(A1:A100, B1:B100). This will get the basic computations done, but you can do even better, much better in fact, by using the Regression choice under Data Analysis in the Tools menu in Excel. This will give you much more information than just the slope and the intercept. For example, it also provides the standard error for the regression line, which shows the typical magnitude of the residuals from the line. Excel's Regression menu is illustrated in Chapter 11 of your textbook, but not in Chapter 10. It is available only if the Analysis ToolPak has been installed and added in. Important Note The remainder of this lesson is background material for regression, and it does not appear in your textbook until Chapters 11 and 12. This material will NOT be included in this week's quiz or homework assignment. It is provided for information purposes only. This material involves the following questions about regression: 1. How well does the estimated regression line fit our data? 2. Is the slope of the regression line different than zero? 3. Is ordinary linear regression appropriate for our data? You do not have to learn all of the details related to these questions, but you should understand why these questions are important, and how they are commonly answered. 10 MGSC 6200: Information Analysis How Well Does The Estimated Regression Line Fit The Data? A commonly used number for answering this question is called the coefficient of determination. To explain it, we need the following mathematical fact, which you do not have to remember for this week's quiz: Notes: 1. SS stands for "Sum of Squares." The mathematical fact says that SS Total for all Y values is equal to SS Residuals plus SS Regression. 2. SS Total shows the squared deviations of Y values from their average, so we can think of SS Total as a measure of the variability of Y. In fact, if we divide SS Total by n - 1 and take the square root of the result, we would find SD of Y. 3. Since SS Residuals is a measure of the deviations from the regression line, we can think of it as unexplained variability. SS Regression is found by subtracting this unexplained variability from total variability SS Total, so we can think of SS Regression as that part of variability of Y which is explained by the regression line (Explained = Total - Unexplained). The coefficient of determination is now defined as follows: We can get an intuitive feel for r2 by making the following observations:   If the regression line is horizontal, then Predicted Y is equal to Average Y for any data point, and therefore, SS Regression = 0. In this case, r2 = 0 since its numerator is zero. If the regression line fits the data perfectly (i.e. all data points fall exactly on the line with no deviation), then SS Residuals = 0. In this case, SS Regression = SS Total, and r2 = 1. So, the coefficient of determination r2 is always a number between 0 and 1, with zero showing no fit, and 1 showing perfect fit. The closer to 1 the r2 is the better the fit. It 11 MGSC 6200: Information Analysis shows the fraction of the variability of Y explained by the regression equation. Multiplied by 100, it shows the percentage of variability in Y explained by the regression equation (or by X). Here is an interesting fact that can be mathematically proven (which we'll leave out of this lesson): If we calculate the correlation coefficient r as described in chapter five and square it, we would obtain the coefficient of determination r2. Conversely, if we first calculate r2 as described above in this lesson, then take its square root, and then give it the sign of the slope m, then we would obtain the correlation coefficient r between X and Y. Is the Slope of the Regression Line Different Than Zero? Here is a fact that is worth reiterating: If the regression line is horizontal, then the Predicted value of Y is the same with any X value. In other words, X provides no useful information for predicting Y. Regardless of the value of X, the predicted value of Y is the average of Y values. Therefore, the correlation coefficient (or the coefficient of determination) is zero precisely when the slope of the regression line is zero (and vice versa). Therefore, testing the null hypothesis that slope is zero is the same as testing if the correlation coefficient is zero (or if the coefficient of determination is zero). Once we test one of these three hypotheses, we will have tested all three. Null Hypothesis: Slope of the population regression line is zero. Alternative Hypothesis: Slope of the population regression line is not zero. In order to conclude that the slope of the population regression line is not zero, the null hypothesis should be rejectedon the basis of sample data. Only then can we say that X and Y are linearly related. Remember the three methods of hypothesis testing from week two? We can use any of them here, and we shall illustrate them a bit later in this lesson. To obtain the necessary statistics, confidence intervals, or p-values, the use of computer software is helpful and often essential. You would be well-advised not to do these calculations manually, or with a simple calculator. Is Linear Regression Appropriate For Our Data? In order to be confident that linear regression is an appropriate fit for our sample data, we need to check to see if the standard assumptions of regression analysis appear to be satisfied. What are these assumptions and how do we check them? The standard assumptions of linear regression are the following:  The pattern of relationship between X and Y is linear. 12 MGSC 6200: Information Analysis    For any given X value, Y values have a Normal distribution around the regression line (or equivalently, residuals have a Normal distribution with mean zero). Residuals in different observations are independent of one another (i.e. they have no effect on one another). Residuals have the same standard deviation regardless of the X value. To check the first assumption, we know that we can look at the scatterplot of Y against X. How do we check the other assumptions? Here are two basic tools that are helpful in this task: 1. Obtain a scatterplot of residuals on the vertical axis against Predicted Y values on the horizontal axis. This "residual scatterplot" should look random with no discernible pattern. 2. Obtain a histogram of residuals. This histogram should look approximately bell shaped. If the sample size is small, this may not be very helpful since the histogram will probably not look bell-shaped, but it is a useful tool with large sets of data. A Detailed Example Let us answer the five basic questions we asked in this lesson (using Data Set a in the Anscombe data that we encountered in the last lesson) with the help of Excel's Regression menu. This graphic reflects the data set and the scatterplot. Below is the regression output table from Excel (some formatting has been applied to cells). The table was obtained by selecting Regression in the box that appears when you select Data Analysis under Tools. You must have Analysis ToolPak added-in for Data Analysis to be available (you can see pages 217-220 in your textbook for the steps involved in obtaining the 13 MGSC 6200: Information Analysis output table. To get the Residuals and Predicted Y values, Residual checkbox needs to be checked in the Regression dialog box shown on page 220). In this lesson, we shall focus only on the highlighted cells in the table. And, below is the residual scatterplot (residuals against Predicted Y): 1. How do we determine the regression line, i.e. find its slope and intercept? The slope and intercept are found in the bottom left cells in the regression output table, under the heading Coefficients where we see the following values: 14 MGSC 6200: Information Analysis Intercept = 3.00 Slope = 0.50 (it is positive) The slope is interpreted as follows: For every increase of 1.0 in the value of X, the average Y is predicted to increase by 0.50. Similarly, if X is decreased by 1.0, the average Y is expected to decrease by 0.50. If the value of X is increased by 0.1, then the average Y is expected to increase by (0.1) x (0.50) = 0.05, and so on. 2. How do we use the regression line? The regression equation can be used to predict the value of Y for any given X, or to explain the variability in Y using X. To do the former, let us say we want to predict Y for X = 8. Plugging this into the regression equation, we obtain: Predicted value of Y = 0.50 X + 3.00 = 0.50x8 + 3.00 = 7.00 The predicted value of 7.00 is the estimated average of Y when X is equal to 8. In the sample data set, there was one point for which X = 8. The actual Y for that point was 6.95, so the residual for this point is 6.95 - 7.00 = 0.05. See if you can locate this residual in the residual scatterplot. Remember that not all points with X = 8 will have this same residual. The output table also shows the standard error for the regression line as 1.24 (to the right of the cell labeled Standard Error). This can be interpreted as the typical magnitude of residuals. We would expect about 95% of all residuals to be within plus or minus 2 standard errors from the regression line, and almost all residuals to be within plus or minus 3 standard errors. To say how much of the variability of Y is associated with X, we should look at the coefficient of determination. In the output table, this is found to the right of the cell labeled R Square where we see the value 0.67. This means that the regression line (or the variable X) explains 67% of the total variability in Y. 3. How well does the estimated regression line fit our data? The coefficient of determination of 0.67 is moderately high, so the regression line fits the data reasonably well. The square root of 0.67 is 0.82 which is the correlation coefficient between X and Y. It is positive since the slope is positive. Note: In Excel's regression output table, the square root of R Square is always reported as Multiple R and as a positive number, even if only one X variable is being used as we are doing here. This can be a bit confusing, and also inaccurate if the estimated slope is negative. 4. Is the slope of the regression different than zero? To test the null hypothesis that the population regression slope is zero, we can look at any one of three things in the lower part of the regression output table:  Look at the t Stat value for X where we see 4.24. This is the distance in standard units between the estimated slope of 0.50 and the hypothesized slope of 0 in the null hypothesis. This value is much above the range from -2 to +2. With α = 0.05, the null hypothesis is rejected, and we conclude that the population slope is not zero.  Look in the P-value column where we see 0.00. This is much smaller than α = 0.05, so we reach the same conclusion. 15 MGSC 6200: Information Analysis  Look at the 95% confidence interval for the population slope. The interval extends from 0.23 to 0.77. This interval does not include 0, so again, we reach the same conclusion.  Note. We are usually not concerned with the population intercept, so hypothesis testing about the intercept is not important. For the record, it is done in the same way we just did for slope. 5. Is the ordinary linear regression model appropriate for our data? This question cannot be answered satisfactorily with a very small set of data, as in this example where we only have 11 data points. However, the randomly scattered appearance of the points in the residual scatterplot is encouraging. If we had more data, it would be helpful to also obtain a histogram of residuals and see if it looks bell shaped. Practice Exercises 6 Before continuing, do the following practice exercise. The last three bullets in this exercise are optional this week, and they will be assigned again next week.     Open the marketing.xls file from the Data Sets in Excel. Get a scatterplot of Y = Sales against X = Price. Use Tools/Data Analysis/Regression menus to get the regression output table. In the Regression dialog box, make sure the checkbox for Residuals is checked. Get a scatterplot of Residuals against Predicted Y. Answer the five basic questions as we did in the example on the previous page. Solutions to Week 3 Practice Exercises It is expected that you attempt the assigned textbook problems before you look at the Textbook Solutions for Week 3. Week 3 Assignment 1 - Homework Assignment Total Points Possible: 12 Points Complete the following textbook exercises by 9:00 a.m. E.T. on Day 7.   Chapter 5 Exercises 6 (page 170), 9 (pages 176-177), and Exercise WK3_X1 Chapter 6 Exercises 9 (page 199), 13 (a), (b), (c) (pages 200-201) and Exercise WK3_X2 Exercise WK3_X1 A manager of a large chain of nationwide sporting goods stores would like to know which of the following factors has the strongest link to sales: Age (median age of customer base in years), 16 MGSC 6200: Information Analysis HS (percentage of customer base with a high school diploma), College (percentage of customer base with a high school diploma), Growth (annual population growth rate of customer base over the past 10 years), and Income (Median family income of customer base in dollars). Looking at the file WK3_X1.xls what do you conclude? The data stored in the file WK3_X1.xls are the monthly sales totals from a random sample of 38 stores in the franchise. All stores in the franchise and thus within the sample, are approximately the same size and carry the same merchandise. The county, or in some cases the counties in which the store draws the majority of its customers is referred to here as the customer base. For each of the 38 stores, demographic information about the customer base is provided. The data are real, but the name of the franchise is not used, at the request of the company. Exercise WK3_X2 Management of a soft-drink bottling company wants to develop a method for allocating delivery costs to customers. Although one cost clearly relates to travel time within a particular route, another cost variable reflects the time required to unload the cases of soft drink at the delivery point. A sample of 20 deliveries within a territory was selected. The delivery times (in minutes) and the number of cases delivered were recorded in the file WK3_X2.xls. An analyst computes the delivery time per case delivered and averages these to get 0.33 minutes. Using this he gives an estimated delivery time of 50 minutes for 150 cases to be delivered. Is this a reasonable delivery time? Explain why or why not. Looking at the data in the file, can you give a better estimate? Make sure that you justify your estimate. Each question in the graded homework assignment will be evaluated using the following: Points Thorough display of work and correct answer. 2 Display of work and partially correct answer. 1 Insufficient display of work, or incorrect answer, or unanswered question. 0 Due Date: Sunday, Day 7, at 9:00 a.m. E.T. To upload a file, select 'Browse My Computer', locate the file you wish to upload, and doubleclick on the file name. The file will appear in the 'Attached files' section. If you agree to abide by Northeastern University’s Academic Honesty and Integrity Policy, select the 'Submit' button to upload your file. Submit your answers using the drop box located in the Assignments area of the course. The answers will be available on Day 7 at 9:01 a.m. E.T. after you submit your assignment. 17 MGSC 6200: Information Analysis Answers to Week 3 Assignment 1 - Homework Assignment Week 3 Quiz Instructions IMPORTANT: Quiz/Exam Advisory Value: 10 Points Available at 8:00 p.m. E.T. on Day 7 for 52 hours. Time limit: 60 minutes If you are using the Microsoft Internet Explorer 8 or 9 browser, you will need to change your browser settings to "Compatibility View" before taking your exam/quiz. See Internet Explorer 8 Compatibility View for instructions on how to change compatibility view. Alternatively, you can download the Mozilla Firefox browser to use to take your exam/quiz. If you require further assistance, please contact the Help Desk: 1-866-291-8058 ProctorU The exam is password protected and can only be unlocked by ProctorU. Please be certain to schedule your exam time at least 72 hours before taking the exam. You will need to know your course's CRN number when registering in order to register for the correct section. The CRN number can be found in the top left corner of Blackboard and is usually a 5 digit number following the course code. Also, it is crucial that you check all of the technical requirements for ProctorU prior to the time of your exam and verify that your computer system meets ProctorU Technical Requirements. You can do this by visiting ProctorU Technical Requirements and using the Test It Out tool. There is also a 'Connect to a live person' option on the Technical Requirements page where you can have further testing done on your system by a live ProctorU representative. This is required to ensure you will be able to complete your exam without issue. Week 3 Quiz 1 Please read these rules first!    This is a timed quiz that has 10 questions. You have up to 60 minutes to complete the quiz. Once you start, you can temporarily stop and then continue, but the clock will NOT stop running. Each question will be displayed on a separate page. In order to leave ample time for each question, try not to spend more than 6 minutes for any one question. 18 MGSC 6200: Information Analysis   Note: Blackboard is configured to display the same quiz instructions above each question. These instructions remain constant throughout the quiz, so once you have read them, you will not need to spend your quiz time reading the instructions above each question. Below each question, there will be some choices and you are asked to select one of them as your answer. In some questions, it may be necessary to do a few calculations to help you select an answer. For numeric answers, select the answer that is closest to your calculation. Before you begin, have your working tools easily accessible, such as your book, calculator, and table of areas under the Normal curve and a pen and some paper. To attempt this quiz, follow these instructions: o o o Select 'Begin' to start the quiz. After answering each question, select the 'Save answer' button. After answering and saving the last question, select 'Save and submit'. Note: Week 3 Quiz will not be displayed on Week 3 menu until 8:00 p.m. E.T. on Day 5. Correct answer feedback will be available at 9:00 a.m. E.T. the following day. Week 3 Quiz 1 Value: 10 Points Available: 8:00 p.m. E.T. on Day 5 for 52 hours. Time limit: 60 minutes Week 3 Quiz 1 ProctorU The exam is password protected and can only be unlocked by ProctorU. Please be certain to schedule your exam time at least 72 hours before taking the exam. You will need to know your course's CRN number when registering in order to register for the correct section. The CRN number can be found in the top left corner of Blackboard and is usually a 5 digit number following the course code. Also, it is crucial that you check all of the technical requirements for ProctorU prior to the time of your exam and verify that your computer system meets ProctorU Technical Requirements. You can do this by visiting ProctorU Technical Requirements and using the Test It Out tool. There is also a 'Connect to a live person' option on the Technical Requirements page where you can have further testing done on your system by a live ProctorU representative. This is required to 19 MGSC 6200: Information Analysis ensure you will be able to complete your exam without issue. Please read these rules first!      This is a timed quiz that has 10 questions. You have up to 60 minutes to complete the quiz. Once you start, you can temporarily stop and then continue, but the clock will NOT stop running. Each question will be displayed on a separate page. In order to leave ample time for each question, try not to spend more than 6 minutes for any one question. Note: Blackboard is configured to display the same quiz instructions above each question. These instructions remain constant throughout the quiz, so once you have read them, you will not need to spend your quiz time reading the instructions above each question. Below each question, there will be some choices and you are asked to select one of them as your answer. In some questions, it may be necessary to do a few calculations to help you select an answer. For numeric answers, select the answer that is closest to your calculation. Before you begin, have your working tools easily accessible, such as your book, calculator, table of areas under the Normal curve and a pen and some paper. To attempt this quiz, follow these instructions: o o o Select 'Begin' to start the quiz. After answering each question, select the 'Save answer' button. After answering and saving the last question, select 'Save and submit'. Note: Week 3 Quiz will not be displayed on Week 3 menu until 8:00 p.m. E.T. on Day 5. Correct answer feedback will be available at 9:00 a.m. E.T. the following day. Explanation of Answers to Week 3 Quiz 1 Log in to the course to access interactive course content and alternative version. Select here for the answers and feedback to Week 3 Quiz 1 20 MGSC 6200: Information Analysis Note: This is the text-only version of this week’s lecture. All media (i.e. videos, flash presentations, and PowerPoints) and learning activities (i.e. assigned readings, assignments, and discussions) are accessible only through the online course. Week 2: Fundamentals of Statistical Inference Log in to the course to view video and alternative version. Week 2 Video Transcript Week 2 Overview The real usefulness of the science of statistics comes from the fact that, under known assumptions, we can generalize what we learn from a set of sample data to the population from which the sample was taken. In other words, statistics teaches us how to generalize the knowledge we gain from a sample, which we have observed, to the entire population, which we have not observed. This week, we shall learn the basics of how to do this. Week 2 consists of three lessons: Lesson 1: Sampling Variability and Standard Error Lesson 2: Confidence Intervals Lesson 3: Hypothesis Testing After concluding the three lessons, you will submit the following for assessment purposes: Week 2 Homework Assignment (due on Day 7 by 9:00 a.m. E.T.) Week 2 Quiz (duration is 60 minutes, will be open 8:00 p.m. E.T. on Day 5 and will remain open for 52 hours. Week 2 Learning Activities Description Due Date Point Value and Percentage ProctorU Scheduling Schedule exam time with ProctorU at least 72 hours before the exam. Day 1 ~ Week 2 Lesson 1 Sampling Variability And Standard ~ ~ Learning Activity 1 MGSC 6200: Information Analysis Learning Activity Point Value and Percentage Description Due Date Week 2 Reading 1 Read Sections 1 and 2 of Chapter 9 of your textbook ~ Practice Exercises 1 Exercises 1, 2, 5 at the end of Section 2 of Chapter 9 ~ ~ Week 2 Reading 2 Read Section 3 of Chapter 9. Omit Technical Note. ~ ~ Practice Exercises 2 and Reading Exercises 8, 9 at the end of Section 3 of Chapter 9 Read Section 4 of Chapter 9 ~ ~ Practice Exercises 3 Exercises 10, 11, 14 at the end of Section 4 of Chapter 9 ~ ~ Week 2 Lesson 2 Confidence Intervals ~ ~ Week 2 Reading 3 Read pages 335 - 339 of Chapter 10 ~ ~ Week 2 Reading 4 Read pages 339 - 344 of Chapter 10 ~ ~ Practice Exercises 4 Exercises 10, 12, 16 at the end of Chapter 10 ~ ~ Week 2 Lesson 3 Hypothesis Testing ~ ~ Week 2 Reading 5 Read Chapter 11, but omit Sections 5 and 6 ~ ~ Practice Exercises 5 Exercise 2 at the end of Section 3, Exercise 4 at the end of Section 4, and Exercise 17 at the end of Chapter 11 ~ ~ Error Week 2 Assignment Chapter 9 Exercises 4 (page 310), 1 – Homework 22 (page 329), 28 (page 330) Day 7 at 9:00 a.m. Assignment Chapter 10 Exercises 17 (page 18 points; 18% E.T. 352), 20 (page 353), 28 (page 355) Chapter 11 Exercises 11 (page 2 MGSC 6200: Information Analysis Learning Activity Description Due Date Point Value and Percentage 386), 13 (page 387), and Exercise X. Week 2 Quiz 1 ProctorU is required to take this quiz. You have 60 minutes to complete the quiz Available at 8:00 p.m. E.T. on Day 5 for 52 hours. 10 points; 10% Lesson 1: Sampling Variability And Standard Error Lesson Overview In this lesson, you will learn the basic facts about the way sample results are expected to vary from sample to sample. This knowledge is important because it allows us to generalize sample results to the population from which the sample was taken. We need sample data from only a tiny fraction of a large population in order to predict the population mean, population percentage, or any other unknown characteristic of the population. For example, we can predict the outcome of an election after only a few votes have been counted. There is no magic in statistics, but this is as close as it comes, and it is based on the facts you will learn in this lesson. Lesson Objectives When you finish this lesson, you should be able to: 1. Define the meaning of variability in a sample statistic such as the sample average or sample percentage is. 2. Describe the meaning of the standard error for a sample average or sample percentage. 3. Comprehend what is meant by margin of error, and how it can be used to determine the necessary sample size. Week 2 Reading 1 Important Before you begin this lesson, read Sections 1 and 2 of Chapter 9 of your textbook (do not read the rest of the chapter yet). Please do not continue to the next page until you have read 3 MGSC 6200: Information Analysis Sections 1 and 2. Law of Averages For a Sample Percentage and Sample Mean As you have just read, there is a law of averages for a sample percentage, which says the following: With a large, randomly selected sample, an observed percentage in the sample will tend to be very close to the true percentage in the population form which the sample was taken. The larger the sample, the closer the sample percentage will be to the population percentage. Here is an example to illustrate this law:  A randomly selected sample of 500 voters was asked if they support stem cell research. When the answers were tallied, there were 310 Yes (or 310/500 = 62 percent), and 190 No answers (or 190/500 = 38 percent). Using the law of averages in this situation, we can be pretty confident that around 62 percent of all voters in the population support stem cell research. We can make this statement even though we have data only for 500 voters among millions of voters. The population percentage may not be exactly 62 percent, but it will be near this value. Of course, we can ask "How near?" One answer to this question is called the margin of error. Later in this lesson, we shall learn how to find the margin of error. Similar to a sample percentage, there is also a law of averages for the sample mean: With a large, randomly selected sample, the sample average true population average μ. The larger the sample, the closer tends to be very close to the will be to µ. Here is an example to illustrate this law:  In a random sample of 500 workers in the United States, it was found that the average number of hours they work each week for pay is 48 hours. In this case, the law of averages tells us that, in the entire population of workers (tens of millions), the average hours worked per week is near 48. How near? Again, an answer is provided by the margin of error, and we shall learn how to find it shortly. In order for the law of averages to be applicable, the sample must be large enough (more than about 30, but the larger the better). Also, the sample must be selected randomly. This means that, in selecting the sample, we do not favor some members of the population and shy away from others (intentionally or unintentionally). In other words, we do not want to have any selection bias (remember it from the first lesson?). Random selection avoids selection bias, and it ensures that the sample is representative of the population. Biased, unrepresentative 4 MGSC 6200: Information Analysis samples are useless from a statistical viewpoint, no matter how large they may be. Practice Exercises 1 Before continuing, do the following practice exercises: Exercises 1, 2, 5 at the end of Section 2 of Chapter 9 (pages 309-311) Sampling Variability In order to answer the "How near?" question, we have to answer another important question we need to discuss at this point. It involves variability that we should expect to see in a sample statistic, such as a sample percentage or sample average:  How would we expect a sample percentage to vary from sample to sample if we repeatedly take many random samples from the same population? The same question about the sample average would be:  How would we expect a sample average to vary from sample to sample if we repeatedly take many random samples from the same population? Believe it or not, answers to these hypothetical questions are known (and proven as mathematical facts) without actually taking many random samples to see what would happen. We shall give these facts shortly, after introducing the term "standard error." Variability in a sample statistic, such as a sample percentage or sample average, is measured with its standard error. Conceptually, standard error is the same as standard deviation, but the difference in terminology helps to emphasize that standard error is used to measure variability in a sample statistic, and standard deviation is used to measure variability in the data values from which the statistic is calculated. Standard Error — Measures variability in a sample statistic, such as a sample average or sample percentage. Standard Deviation — Measures variability in the data values from which the statistic is calculated. Week 2 Reading 2 Before continuing, read Section 3 of Chapter 9. You can omit the "Technical Note: Finite Population Correction Factor" mentioned at the end of the section, since it is usually not very important in practice. 5 MGSC 6200: Information Analysis Sample Percentage As you just read in section 2 of chapter seven, we can say the following about a sample percentage: Alternate Version Let us do an example. Example:  In a longitudinal study of a random sample of 400 new business establishments, it was found that 272 were still in existence after two years, and 174 were still in existence after four years. (Note: U.S. Bureau of Labor Statistics is a main source of such statistics. Numbers given in this example are approximately accurate but simplified to serve as an example.) The law of averages tells us that the percentage of all new businesses surviving after two years is near 272/400 = 68 percent, and the percentage of all new businesses that survive after four years is 174/400 = 44 percent (rounded). To continue with the example, let us calculate the standard error and margin of error for the sample percentage of new businesses that are still in existence after four years. Note that the 6 MGSC 6200: Information Analysis true population percentage is unknown, but we can use the sample percentage as its rough estimate in calculating SE. The standard error of 2.5% says that if we repeatedly took random samples of 400 new businesses, then the typical deviation of the sample percentage from the true population percentage would be about 2.5%. The margin of error of 5% says that we can reasonably expect the true population percentage to be within plus or minus 5% of the sample percentage (i.e. between 44 - 5 = 39% and 44 + 5 = 49%). This is an answer to the "How near?" question we asked earlier. Also, as we shall discuss in the next lesson, this interval from 39% to 49% is called a 95% confidence interval for the true population percentage. Practice Exercises 2 Before continuing, do the following practice exercises: Exercises 8, 9 at the end of Section 3 of Chapter 9 (pages 320-321) After doing these exercises, read Section 4 of Chapter 9. 7 MGSC 6200: Information Analysis Sample Average As you just read in Section 4 of Chapter 7, we can say the following about a sample average: Alternate Version Let us do an example:  In a random sample of 500 workers in the United States, it was found that the average number of hours they work each week for pay is 48 hours. A rough estimate of the population standard deviation of weekly hours worked is 6 hours. We used this example earlier when we stated that the law of averages tells us that the sample average of 48 hours is likely to be near the population average of hours worked (and vice versa). To say a little bit more than this, let us calculate the SE and ME for the sample average: The standard error of 0.27 hours says that if we repeatedly took random samples of 500 8 MGSC 6200: Information Analysis workers, then the typical deviation of the sample average from the true population average would be about 0.27 hours. The margin of error of 0.54 hours says that we can reasonably expect the true population average to be within plus or minus 0.54 hours of the sample average (i.e. between 48 - 0.54 = 47.46 hours and 48 + 0.54 = 48.54 hours). Again, this is an answer to the "How near?" question we asked earlier. As we shall discuss in the next lesson, this interval from 47.46 hours to 48.54 hours is also called a 95% confidence interval for the true population mean. Determining Necessary Sample Size With our knowledge of ME and a minimum use of algebra, we can obtain another useful result, which is the answer to the following question:  If we want to estimate an unknown population mean or percentage within a specified margin of error, how large a random sample do we need? To answer this for estimating a population mean, we note that: Select the "or" button to see the calculation develop. Log in to the course to access interactive course content and alternative version. Determining Necessary Sample Size Alternate Version Observe that we need the SD of population in order to carry out this calculation. When SD of population is unknown, we need at least a rough estimate. Example:  How large a sample size do we need in order to estimate the average starting salary of MBAs in the U.S. within a margin of error of $2000? Assume that the population standard deviation is $10,000. Applying the last formula, we get:  Size of sample = (2 x 10000/2000)2 = 100 Thus, we need a random sample of 100 MBA starting salaries in the U.S., preferably in the same year. In a similar way, we can find the necessary sample size for estimating a population percentage. Omitting the algebra, the result is: 9 MGSC 6200: Information Analysis Observe that we need the true population percentage in order to carry out this calculation, but true percentage is what we are trying to estimate in the first place. To get around this chickenand-egg situation, we can use a very rough estimate of the true percentage, if one is available, or simply use 50% as the true percentage when we have no idea what it may be. This will give a somewhat higher-than-necessary sample size, but at least we shall have some idea of the largest sample size we need. Example:  How large of a sample size do we need in order to estimate the percentage of large businesses that offer a company-sponsored retirement plan to their employees within a margin of error of 5%? Assume that we have no idea of the true percentage in the population of all large businesses. Applying the last formula, we get:  So, we need a random sample of 400 large businesses. You may want to see for yourself that the sample size will not change greatly if we use a different value for the true population percentage, but the sample size will be somewhat lower than 400. Try 40% or 60% and see what happens (you should find about 384). Practice Exercises 3 Before continuing, do the following practice exercises: Exercises 10, 11, 14 at the end of Section 4 of Chapter 9 (pages 326-327) When you are ready, continue to Week 2 Lesson 2. 10 MGSC 6200: Information Analysis Lesson 2: Confidence Intervals Lesson Overview When population characteristics of interest are unknown, such as a population mean or percentage, we can estimate them using sample data. We know, however, that we cannot expect to obtain perfectly error-free estimates from sample data. In this lesson, you will learn how to estimate a population mean or percentage, including how to find a confidence interval for each. Lesson Objectives When you finish this lesson, you should be able to: 1. Describe the concept of a sampling distribution and the central limit theorem. 2. Define the meaning of a confidence interval for a population mean or percentage. 3. Determine a confidence interval for a population mean or percentage. Week 2 Reading 3 Important! Before you begin this lesson, read 335-339 of Chapter 10 in your textbook. Sampling Distribution As you just read, the Central Limit Theorem is a key fact in understanding confidence intervals. In order to understand what it says, we need to understand an important concept called a sampling distribution, which will now be our focus. The Concept of a Sampling Distribution Sampling variability and margin of error are concepts that are parts of a bigger picture called a sampling distribution. A sampling distribution is a histogram of a sample statistic, such as a sample percentage or a sample average, obtained from many random samples taken from the same population. It is a rather abstract concept, so we shall try to explain it in the following way: Imagine that you engage in the following hypothetical activity: Take a random sample of size n from a population of interest, calculate the sample average, and put the sample back into the population. Then, take another random sample, calculate the average, and put the sample back into the population. Do it again, and again, and again. If we continue in this way, we would end up with a large number of sample averages, one from each sample that was taken and put back. The histogram of these sample averages is called the sampling distribution of the 11 MGSC 6200: Information Analysis sample average. The sampling distribution of a sample percentage is similar, but we calculate a sample percentage from each sample. The concept of a sampling distribution is abstract and a bit confusing to students of statistics. Please think about the hypothetical activity of repeatedly taking random samples from a population, putting each one back before taking the next sample. This is not something we can actually carry out in practice. In fact, don't try to do this at home! It will take too long, and you will even forget why you are doing it. We can simulate the activity using a computer, but even then, the key is your understanding that this is about the distribution of sample averages obtained from repeated sampling. Central Limit Theorem There is a very famous result in statistics that tells us that these sampling distributions are bellshaped or at least approximately so. This result is called the Central Limit Theorem. More precisely, it says that, regardless of the shape of the distribution of data values in the population, the histogram of sample averages from repeated samples would look approximately normal if the sample size is reasonably large. This important mathematical fact is illustrated in the Sample Distributions and The Central Limit Theorem (PDF). The best advice we can give you to help eliminate the confusion is this: look at the left side of the PDF when you are asked to think about the individual data values in a population. Look at the right side when you are thinking about sample averages (or sample percentages) taken from the population. Two observations are useful: 1. Notice that the distribution of sample averages on the right is a narrower bell-shaped curve than the bell-shaped population curve on the left (please stop and look at the diagram again to make sure you see this). This is not a coincidence. You have seen in the last lesson that the standard error of the narrower curve on the right is equal to: As long as we have a sample size of at least 2 (usually larger), SE for avg will be smaller than SD of population, and therefore, the sampling distribution is narrower (has less variability) than the population distribution. In fact, the larger the sample size, the smaller the standard error, and the narrower the sampling distribution. 2. The center of the sampling distribution on the right is exactly the same as the mean µ of 12 MGSC 6200: Information Analysis the population distribution. This is the basis for what we called the law of averages in the last lesson. Sample averages will vary around µ, and the larger the sample size, the closer to µ they will be. If the distribution of sample averages is bell-shaped, then we learned (in the lesson on the Normal distribution) that 95% of sample averages will fall within plus or minus 1.96 standard errors from the mean. Rounding 1.96 to 2, we get the margin of error that you saw in the last lesson. We described ME as the largest distance you would reasonably expect to see between the sample average and the population average. As you can now surmise, "reasonably expect" means that we can say with approximately 95% confidence that the largest distance is within the margin of error. It may seem that the Central Limit Theorem deals only with sample averages of quantitative data, so it is not useful for sample percentages. This is definitely not true. A sample percentage is really a sample average in disguise. To see why, imagine that we have the following gender data for a small sample of 7 individuals (pictured on the right). In the table, we added a third column labeled Gend, where gender has been re-coded as Male = 0 and Female = 1. At the bottom of the Gend column, you see the average of the 7 values in that column.  avg = (0+1+1+0+1+0+1)/7 = 4/7 = 0.57 This is the proportion of females in this sample, which is 57% (after multiplying by 100). Therefore, we can think of a sample percentage as 100 times the sample average of a variable whose value is 0 or 1. Consequently, the Central Limit Theorem applies to sample percentages just as well as it applies to sample averages. Let us summarize what we learned thus far about the margin of error ME. For a sample average, ME is the largest distance you would reasonably expect to see between the sample average and the population average. For a sample percentage, it is the largest distance you would reasonably expect to see between the sample percentage and the population percentage. In either case, ME is twice the standard error, and using the Central Limit Theorem, we explained that "reasonably expect" means "we are 95% confident." Now, we are ready to discuss confidence intervals. Week 2 Reading 4 Before continuing, read pp. 339 – 344 Chapter 10. Please do not continue to the next page until you have done so. 13 MGSC 6200: Information Analysis Confidence Interval for a Population Mean or Percentage Putting together what we learned thus far, we can make the statement:  We are 95% confident that the largest distance between the sample average and the population average is no larger than ME. This is the same as the following statement:  We are 95% confident that the sample average falls in the range from µ- ME to µ+ ME. This in turn is the same as the statement:  We are 95% confident that µ falls in the range from - ME to + ME. When the mean µ of a population is unknown and we want to estimate it using sample data, the last statement provides an interval estimate with 95% confidence. Therefore, the interval - ME to + ME is a 95% confidence interval for the population mean. Please note: If we want a level of confidence other than 95%, then we cannot use plus or minus twice the standard error to calculate ME. We need to use the appropriate z-value that we can look up in the Normal table. For example:     for a 99.7% confidence interval, we should use plus or minus 3 times the standard error, for a 99% confidence interval, we should use plus or minus 2.58 times the standard error, for a 90% confidence interval, we should use plus or minus 1.64 times the standard error, and so on. To summarize:    A 95% confidence interval is the sample average plus or minus the margin of error (which is twice the standard error). A 99.7% confidence interval is the sample average plus or minus three times the standard error. A 90% confidence interval is the sample average plus or minus 1.64 times the standard error. To avoid any misinterpretation, it is worth emphasizing that:  A 95% confidence interval does not mean 95% of the population values falls in that interval, or that we are 95% confident someone from the population falls in that interval. 14 MGSC 6200: Information Analysis  A 95% confidence interval does mean we are 95% confident the population average or population percentage falls in that interval. Log in to the course to access interactive course content and alternative version. Self-Check: Confidence Interval for a Population Mean or Percentage Alternate Version Technical Note Technical Note – This is not discussed in your textbook until Section 6 of Chapter 9, and it will not be included in any quiz! Calculation of a confidence interval requires SE for avg or SE for %, which in turn requires knowledge of the true population standard deviation or true population percentage. If either of these population values is unknown, and if the sample size is large, we can just use the sample standard deviation or sample percentage in the calculations. On the other hand, if the sample size is really small (under 30), then an adjustment needs to be made in the calculations. This adjustment uses the values from a slightly different bell-shaped distribution, called the tdistribution, instead of z-values from the standard Normal distribution. This technicality is not covered in your textbook. The t-distribution is very similar to, and just slightly wider than, the zdistribution (unless the sample size is very small—in the single digits—in which case the difference is more substantial). The purpose of this note is to merely make you aware that, technically speaking, it is more correct to use t-values instead of z-values with very small sample sizes and unknown population standard deviation. Again, this is not required knowledge for grading purposes in this course. Practice Exercises 4 Before continuing, do the following practice exercises: Exercises 10, 12, 16 at the end of Chapter 10 (pages 350, 352) When you are ready, continue to Week 2 Lesson 3. Lesson 3: Hypothesis Testing Lesson Overview In statistics, hypotheses are conjectures or claims about unknown population characteristics. If we have access to representative sample data, we can test whether a hypothesis is supported or not supported on the basis of sample data. In this lesson, you shall learn how to test hypotheses about a population mean or percentage. The basic ideas you shall learn are applicable for other hypotheses as well, but the details depend on the situation at hand, and we 15 MGSC 6200: Information Analysis shall not cover them in this lesson. Lesson Objectives When you finish this lesson, you should be able to: 1. Explain null and alternative hypotheses. 2. Define what is meant by statistically significant evidence against a null hypothesis. 3. Demonstrate how to conduct hypothesis tests. Week 2 Reading 5 Important! Before you begin this lesson, read Chapter 11, but omit Sections 5 and 6. Please do not continue to the next page until you have done so. A Hypothesis A hypothesis is a conjecture or claim about an unknown situation of interest. Hypotheses can be generated on the basis of actual observations or on the basis of previously untested theories. For each and every hypothesis, there is a dual hypothesis, which is the opposite of the conjecture or claim being made. Here are a few examples of hypotheses and their opposites, stated in such pairs: A new drug is effective in treating an illness. New drug is not effective in treating the illness. A new production method reduces costs. New method does not reduce costs. A new training method improves worker productivity. New method does not improve worker productivity. A majority of voters support stem cell research. A majority of voters do not support stem cell research. A person accused of having committed a crime is guilty. The accused person is not guilty. GMAT scores of MBA candidates are increasing over time. GMAT scores are not increasing over time. 16 MGSC 6200: Information Analysis The variables "Price" and "Quantity Sold" of a product are related to each other. Price and Quantity Sold are not related. In each of these examples, the first statement represents a new claim for which supporting evidence is needed, and the second says "not so" or "the claim is not true." In statistics, the new claim for which supporting evidence is needed is called the alternative hypothesis or research hypothesis, and its opposite asserting "not so" is called the null hypothesis. This terminology is standard in statistics and you should know it. In each of the examples above, the second statement in the pair would be the null hypothesis, and the first statement would be the alternative hypothesis. Statistical Approach to Hypothesis Testing In order to see if a new claim is supported by data, here is how we proceed: We state the new claim as the alternative hypothesis, and its opposite as the null hypothesis. We then take a skeptical view of the new claim, and give the benefit of the doubt to the null hypothesis. This means that we shall not reject the null hypothesis unless the sample data provides sufficiently strong evidence against the null hypothesis. Then we collect a random sample of data, or use a random sample if it is available. Next, using a clearly defined method, we determine if the sample data provides sufficiently strong evidence against the null hypothesis. If so, we say that we found statistically significant evidence against the null hypothesis and it is rejected (and the alternative hypothesis is supported). Otherwise, the evidence against the null hypothesis is weak and it cannot be rejected. In that case, we conclude that there is not enough evidence to support the alternative hypothesis. We shall translate this approach to specific steps a little later in this lesson. At this point, it is important that you understand the approach just described. For example, it is important to keep in mind that, we can only reject or not reject the null hypothesis based on evidence; we cannot conclude for sure that the null hypothesis is true or false. Statistics has no magical powers to assert the truth of any statement about a population when all we have is sample data. So our conclusion can possibly be incorrect. What do we mean by giving the benefit of the doubt to the null hypothesis? We mean that there should be a small chance of rejecting it if it is in fact true. Since we may never know if the null hypothesis is actually true or false, we have to work with the probability of incorrectly rejecting it. The largest value of this probability that we are willing to tolerate is called the significance level of the test, and it is shown with the Greek letter alpha α. Please note: As it is possible that we can incorrectly reject a true null hypothesis (with probability α), it is also possible that we can incorrectly fail to reject a false null hypothesis. Statisticians refer to the first of these two types of error as Type I error, and to the second as 17 MGSC 6200: Information Analysis Type II error (with some probability β). This terminology is not of importance to non-statisticians, so it is not essential for you to learn. It turns out that once the sample size is fixed and α is selected, there is not much one can do about the probability of Type II error. So, again, you need not worry about it. The exact statement of the null hypothesis and alternative hypotheses can have several possible forms. One basic form is as follow:   Null hypothesis: Population mean (or percentage) is equal to a hypothesized value. Alternative hypothesis: Population mean (or percentage) is not equal to the hypothesized value. Since not equal to can occur either as "greater than" or as "less than" the hypothesized value, the test of these hypotheses is often called a two-tailed test. When the alternative hypothesis posits only greater than or only less than, then the test is called a one-tailed test. Steps in Conducting a Test The following are the general steps in any hypothesis test, not just hypotheses about a population mean or percentage: 1. Tentatively assume that the null hypothesis is true, and select a value for the significance level. Great precision in the choice of α is not very important. The most commonly used value is 0.05, but smaller or larger values could be selected, and there are no specific rules for this choice. Remember that the smaller the value of α, the higher the benefit of doubt we are giving to the null hypothesis. In situations where the consequences of the rejection of a true null hypothesis are very serious (e.g. falsely convicting an innocent person, declaring a nuclear power plant safe when it is not, etc.), a very small value of α should be used (such as 0.01 or 0.001 or even smaller). Larger values can be used when consequences are not so serious. 2. Compute the appropriate sample statistic to be used for the test and ask "Can such a result be reasonably expected if the null hypothesis is true?" Reasonably expected usually means "with 95% confidence" or "with 5% significance," but again, exact value of the confidence level is determined by the selected value of α. To be more precise, it is (1 - α) x 100%. The appropriate statistics for testing hypotheses about a population mean is the sample mean. For testing hypothesis about a population percentage, sample percentage is the appropriate statistic. 3. If the answer in the previous step is Yes, then the evidence against the null hypothesis is not statistically significant. The null hypothesis cannot be rejected, and there is not enough evidence to support the alternative hypothesis. If the answer is No, then the observed sample statistic is highly unlikely to have occurred if the null hypothesis is true. Then, there is statistically significant evidence against the null hypothesis, so it should be rejected and the alternative hypothesis is supported. 18 MGSC 6200: Information Analysis In the second step of the process just described, there are three different ways of judging whether a sample result is as we would reasonably expect if the null hypothesis is true. In reality, all three ways are equivalent; that is, using the same hypotheses, sample data, and significance level α, they all reach the same conclusion. In a given situation, one may be a little bit easier to carry out than the others, in which case that could be a reason for choosing it over the others. Otherwise, there is no reason to choose one over the others. If you face a hypothesis testing situation, in this course or elsewhere, pick the method that is easiest for you. We now describe and give an example of each method. In the descriptions, we assume that the null and alternative hypotheses have been identified, a random sample of data has been obtained, and the significance level α has been selected. Method 1 To reach a conclusion, use the distance of the sample average from the hypothesized value of the population average under the null hypothesis (if the test is about a population percentage, find the distance of the sample percentage from the hypothesized population percentage). Then, the rule for reaching a conclusion is as follows (see page 189 of textbook): With a significance level of α = 0.05 or 5%, a sample average (or percentage) that is, …more than two standard errors away from what you would have expected under the null hypothesis is called statistically significant, and can be considered beyond the reasonable range of chance variation. In this case, the null hypothesis should be rejected and the alternative hypothesis is supported. … within two standard errors of what you would expect under the null hypothesis is viewed as being consistent with chance variation. The null hypothesis cannot be rejected. If we want to use a significance level other than α = 0.05, then we need to use the corresponding z-value instead of the z-value of two. We will now give two examples, one involving hypothesis testing about a population mean and another about a population percentage. We will then apply Method 1 in each example. Later, we will also apply Methods 2 and 3 to these same examples. Example 1a. Hypothesis testing about population mean A health insurer wants to know if having a nurse available to answer simple questions over a telephone hotline could cut the costs of doctor visits. Records show the annual average cost per family is typically around $900 with an SD of $100. The insurer decides to conduct an experiment where they randomly select 400 families, give them access to a hotline, and 19 MGSC 6200: Information Analysis record their health care costs. After a year, they observe that, in the sample families with hotline access, the average cost per family is $850. Is there sufficient evidence that the hotline reduced the cost of doctor visits? Answer to Example 1a using Method 1: Let us first formulate the hypotheses. The benefit of the doubt should be given to the assertion that the hotline does not reduce costs. Null hypothesis: Average annual cost of doctor visits for all families with access to a hotline is not less than $900 (that is, having a hotline has not reduced costs). Alternative hypothesis: Average annual cost of doctor visits for all families with access to a hotline is less than $900 (that is, hotline has reduced costs). Let us use α = 0.05 or 5%. We may observe that this is a one-tailed test since the alternative hypothesis is one sided. To find how far the sample average of $850 is below the hypothesized value of $900, we first calculate the standard error: The sample average of $850 is (850-900)/5 = -10.0 standard errors below $900. If the population average were in fact $900 (or more), we could not reasonably expect a sample average to be this far below the population average by chance (we would expect it to be within about two standard errors). We must reject the null hypothesis and conclude that the hotline does reduce the cost of doctor visits. The alternative hypothesis is supported by the sample data. Example 1b. Hypothesis testing about population percentage In 2005 the Small Business Administration became concerned1 that a high percentage of the billions of dollars in special loans earmarked for 9/11 terrorism recovery may have been given to unqualified businesses. Out of 59 randomly sampled loans, only nine appeared to be given to qualified businesses. Is there sufficient evidence that most loans were being given to unqualified businesses? Answer to Example 1b using Method 1: Let us first formulate the hypotheses. The benefit of the doubt should be given to the assertion 20 MGSC 6200: Information Analysis that a majority of loans are being given to qualified businesses. Null hypothesis: 51% or more of the loans are being given to qualified businesses. Alternative hypothesis: Less than 51% of the loans are being given to qualified businesses Let us assume that α = 0.05 or 5%. We may observe that this is also a one-tailed test since the alternative hypothesis is one sided. The sample percentage in this case is (9/59) x 100 = 15%. To find how far this sample percentage is below the hypothesized value of 51%, we first calculate the standard error: The sample percentage of 15% is (15-51)/4.6 = - 7.8 standard errors below 51%. If the population percentage were in fact 51% (or more), we could not reasonably expect a sample percentage to be this far below the population percentage by chance (we would expect it to be within about two standard errors). We must reject the null hypothesis and conclude that the majority of loans are not being made to qualified businesses. The alternative hypothesis is supported by the sample data. 1 SBA Finds 9/11 Loan Recipients Ineligible By LARRY MARGASAK, Associated Press Writer Thu Dec 29, 7:14 AM ET. Method 2 Tentatively assuming the null hypothesis is true, determine how likely it is to observe a sample average (or sample percentage) as extreme as the one we observed in our sample. This is called the p-value or observed significance of the sample statistic. If the p-value is very small, then the observed value of the statistic (sample average or percentage) is too extreme to have occurred by chance under the null hypothesis. Here is the simple rule for this case: If the observed p-value of the sample average (or percentage) is less than the significance level α, then the evidence against the null hypothesis is statistically significant, and the null hypothesis should be rejected. If the p-value is greater than α, then the null hypothesis cannot be rejected. Calculation of the p-value is a technical detail that is best done with computer software because the calculation is slightly different depending on whether the test is two-tailed or one-tailed. If the test is one-tailed, the p-value is found as an area at one tail of the sampling distribution of the test statistic. In a two-tailed test, the p-value is the sum of two areas, half at the upper tail and 21 MGSC 6200: Information Analysis the other half at the lower tail. Neither this technical detail, nor the exact calculation of the p-value is important for you to learn, and they will not be included in any quiz. On the other hand, if the p-value has been calculated, you should know how to use it to reach a conclusion because it is so simple. The null hypothesis is rejected if the p-value is smaller than α, and the null hypothesis cannot be rejected if the p-value is larger than α. For example, if the p-value has been calculated as 0.023, and if we have selected a significance level of α = 0.05, then we should reject the null hypothesis since 0.023 is smaller than 0.05. In fact, the null hypothesis would be rejected with any value of α larger than 0.023, and not rejected with any value of α smaller than 0.023. Using Method 2 in Example 1a. In applying Method 1 to this example, we calculated that the sample average of $850 is (850900)/5 = -10.0 standard errors below the population average of $900. In this case, the p-value is the area lying below - 10 standard errors under the Normal curve (which is the sampling distribution of the sample average by the central limit theorem). Minus 10 standard errors is so far below the mean that this area is practically zero. And, since the p-value of zero is smaller than α = 0.05, we must reject the null hypothesis and conclude that having a hotline reduces the cost of doctor visits. Using Method 2 in Example 1b. In applying Method 1 to this example, we calculated that the sample percentage of 15% is - 7.8 standard errors below the hypothesized population percentage of 51%. The p-value is the area lying below - 7.8 standard errors under the Normal curve representing the sampling distribution of the sample percentage. Minus 7.8 standard errors is so far below the mean that this area is practically zero. And, since the p-value of zero is smaller than α = 0.05, we must reject the null hypothesis and conclude that the majority of loans are not being made to qualified businesses. Method 3 Find a (1 - α) x 100 percent confidence interval for the population average (or percentage). This interval shows the plausible values of the average (or percentage) based on the sample data we have. Therefore, if the hypothesized value of the population average (or percentage) is somewhere in the confidence interval, then the hypothesized value is plausible and we cannot reject the null hypothesis. If it is not anywhere in the confidence interval, then there is sufficient evidence to reject the null hypothesis. If the (1 - α) x 100 percent confidence interval does not contain the value (s) of the population average (or percentage) specified in the null hypothesis, then the evidence against the null hypothesis is statistically significant, and the null hypothesis should be rejected. Otherwise, the null hypothesis cannot be rejected. 22 MGSC 6200: Information Analysis Using Method 3 in Example 1a. Since we are using α = 0.05, we should first find a 95% confidence interval for the mean cost of doctor visits for all families with access to a hotline. As we learned in the last lesson, this interval is the sample average plus or minus two standard errors (ME = 2 x $5 = $10). Therefore, this interval is from $850 - $10 to $850 + $10, or from $840 to $860. This interval does not contain $900 or any larger value, as specified in the null hypothesis. Therefore, we must reject the null hypothesis. Using Method 3 in Example 1b. Since we are using α = 0.05, we should first find a 95% confidence interval for the population percentage of loans awarded to qualified businesses. This interval is the sample percentage 15% plus or minus two standard errors (ME = 2 x 4.6 = 9.2%). Therefore, this interval is from 15% - 9.2% to 15% + 9.2%, or from 5.8% to 24.2%. This interval does not contain 51% or any larger percentage, as specified in the null hypothesis. Therefore, we must reject the null hypothesis, and conclude that a majority of loans are not being awarded to qualified businesses. Remember: If you face a hypothesis testing situation, pick the method that is easiest for you to use. All three methods are equivalent, and only need to use one of them in any given situation. If the p-value has been calculated (as statistical software often provides), then Method 2 is probably the simplest because all you have to do is compare the p-value with the significance level α. Practice Exercises 5 Before continuing, do the following practice exercises: Exercise 2 at the end of Section 3 (page 370), Exercise 4 at the end of Section 4 (page 373), and Exercise 17 (Page 388) at the end of Chapter 11. Solutions to Week 2 Practice Exercises It is expected that you attempt the assigned textbook problems before you look at the Textbook Solutions for Week 2. Week 2 Assignment 1 - Homework Assignment Total Points Possible: 18 Points Complete the following textbook exercises by 9:00 a.m. E.T. on Day 7. 23 MGSC 6200: Information Analysis    Chapter 9 Exercises 4 (page 310), 22 (page 329), 28 (page 330) Chapter 10 Exercises 17 (page 352), 20 (page 353), 28 (page 355) Chapter 11 Exercises 11 (page 386), 13 (page 387), and Exercise X. Exercise X Where do CFOs get their money news? According to Robert Half International, 47% get their money news from newspapers, 15% get it from communication/colleagues, 12% get it from television, 11% from the Internet, 9% from magazines, 5% from radio, and 1% don’t know. Suppose a researcher wants to test these results. She randomly samples 67 CFOs and finds that 40 of them get their money news from newspapers. Does the test show enough evidence to reject the findings of Robert Half International? Each question in the graded homework assignment will be evaluated using the following: Points Thorough display of work and correct answer. 2 Display of work and partially correct answer. 1 Insufficient display of work, or incorrect answer, or unanswered question. 0 Uploading Files To upload a file, select 'Browse My Computer', locate the file you wish to upload, and doubleclick on the file name. The file will appear in the 'Attached files' section. If you agree to abide by Northeastern University’s Academic Honesty and Integrity Policy, select the 'Submit' button to upload your file. Submit your answers using the dropbox located in the Assignments area of the course. The answers will be available on Day 7 at 9:01 a.m. E.T. after you submit your assignment. IMPORTANT: Quiz/Exam Advisory If you are using the Microsoft Internet Explorer 8 or 9 browser, you will need to change your browser settings to "Compatibility View" before taking your exam/quiz. See Internet Explorer 8 Compatibility View for instructions on how to change compatibility view. Alternatively, you can download the Mozilla Firefox browser to use to take your exam/quiz. If you require further assistance, please contact the Help Desk: 1-866-291-8058 24 MGSC 6200: Information Analysis ProctorU The exam is password protected and can only be unlocked by ProctorU. Please be certain to schedule your exam time at least 72 hours before taking the exam. You will need to know your course's CRN number when registering in order to register for the correct section. The CRN number can be found in the top left corner of Blackboard and is usually a 5 digit number following the course code. Also, it is crucial that you check all of the technical requirements for ProctorU prior to the time of your exam and verify that your computer system meets ProctorU Technical Requirements. You can do this by visiting ProctorU Technical Requirements and using the Test It Out tool. There is also a 'Connect to a live person' option on the Technical Requirements page where you can have further testing done on your system by a live ProctorU representative. This is required to ensure you will be able to complete your exam without issue. Week 2 Quiz 1 Please read these rules first!      This is a timed quiz that has 10 questions. You have up to 60 minutes to complete the quiz. Once you start, you can temporarily stop and then continue, but the clock will NOT stop running. Each question will be displayed on a separate page. In order to leave ample time for each question, try not to spend more than 6 minutes for any one question. Note: Blackboard is configured to display the same quiz instructions above each question. These instructions remain constant throughout the quiz, so once you have read them, you will not need to spend your quiz time reading the instructions above each question. Below each question, there will be some choices and you are asked to select one of them as your answer. In some questions, it may be necessary to do a few calculations to help you select an answer. For numeric answers, select the answer that is closest to your calculation. Before you begin, have your working tools easily accessible, such as your book, calculator, table of areas under the Normal curve and a pen and some paper. To attempt this quiz, follow these instructions: o o o Select 'Begin' to start the quiz. After answering each question, select the 'Save answer' button. After answering and saving the last question, select 'Save and submit'. 25 MGSC 6200: Information Analysis Note: Week 2 Quiz will not be displayed on Week 2 menu until 8:00 p.m. E.T. on Day 5. Correct answer feedback will be available at 9:00 a.m. E.T. the following day. Week 2 Quiz 1 Value: 10 Points Available: 8:00 p.m. E.T. on Day 5 for 52 hours. Time limit: 60 minutes ProctorU The exam is password protected and can only be unlocked by ProctorU. Please be certain to schedule your exam time at least 72 hours before taking the exam. You will need to know your course's CRN number when registering in order to register for the correct section. The CRN number can be found in the top left corner of Blackboard and is usually a 5 digit number following the course code. Also, it is crucial that you check all of the technical requirements for ProctorU prior to the time of your exam and verify that your computer system meets ProctorU Technical Requirements. You can do this by visiting ProctorU Technical Requirements and using the Test It Out tool. There is also a 'Connect to a live person' option on the Technical Requirements page where you can have further testing done on your system by a live ProctorU representative. This is required to ensure you will be able to complete your exam without issue. Please read these rules first!      This is a timed quiz that has 10 questions. You have up to 60 minutes to complete the quiz. Once you start, you can temporarily stop and then continue, but the clock will NOT stop running. Each question will be displayed on a separate page. In order to leave ample time for each question, try not to spend more than 6 minutes for any one question. Note: Blackboard is configured to display the same quiz instructions above each question. These instructions remain constant throughout the quiz, so once you have read them, you will not need to spend your quiz time reading the instructions above each question. Below each question, there will be some choices and you are asked to select one of them as your answer. In some questions, it may be necessary to do a few calculations to help you select an answer. For numeric answers, select the answer that is closest to your calculation. Before you begin, have your working tools easily accessible, such as your book, calculator, table of areas under the Normal curve and a pen and some paper. 26 MGSC 6200: Information Analysis To attempt this quiz, follow these instructions: o o o Select 'Begin' to start the quiz. After answering each question, select the 'Save answer' button. After answering and saving the last question, select 'Save and submit'. Note: Week 2 Quiz will not be displayed on Week 2 menu until 8:00 p.m. E.T. on Day 5. Correct answer feedback will be available at 9:00 a.m. E.T. the following day. Explanation of Answers to Week 2 Quiz 1 Log in to the course to access interactive course content and alternative version. Select here for the answers and feedback to Week 2 Quiz 1 27
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Attached.

1.

A company has estimated the standard deviation of individual sales transactions to be $2000.
Approximately 100 sales transactions take place each week. What is an estimate of the standard error of
the average weekly sales?
ANSWER:
The standard error or standard deviation of the sample is given as s/sqrt(N), which is 2000/sqrt(100)=200.
Thus, the standard error of the average weekly sales is $200.

2.

A commodity trader who usually makes around 25 transactions per week obtained the following graph
showing the average transaction size in $1000 for each of the last 20 weeks. Each dot represents the
weekly average (in $1000) of about 25 transactions. By examining the graph, would you estimate the
standard error for the weekly average of transaction sizes to be closest to $2,000, $20,000, or $50,000?

ANSWER:
The mean will lie around the middle of the graph and $2000 is too small to be the standard error. From
the mean, the data deviates around $30000, and therefore $50,000 is too large for it to be the standard
error. Hence, 20000 is the answer

3.

A media rating company that publishes television ratings uses a random sample of 1600 households and
_nds that 20% watched a particular show in a given week. What is the approximate margin of error?
ANSWER:
Given information those random samples of 1,600 households are selected for media rating. That is n =
1,600n=1,600. The percentage of particular show watched in a week is 20%.
The margin of error with 95% confidence is obtained by substituting the critical value, p = 0.2p=0.2 and n
= 1,600n=1,600 in margin of error formula.
p = 20/100 = 0.2, therefore E = 1.96*sqrt[0.2*(1-0.2)/1600] = 1.96*0.01 = 2%
Thus, the approximate margin of error is 2%

4.

Polls showed that the two main candidates in the 2004 presidential election were nearly tied on the day
before the election. To predict the winner, a newspaper would like to have a poll which has margin of
error of 5%. Roughly how large a sample would be needed for such a poll? (Hint: since candidates are
nearly tied, each has roughly 50% of the vote.)
ANSWER:
The number of samples needed for the poll is calculated as follows:
Given information that margin of error is 5%. That is, E = 0.05
Candidates are nearly tied, if each has roughly 50% of vote. That is, p = 0.5
→ 0.02 = 1.96*sqrt[0.5*(1-0.5)/n] → 0.05/1.96 = sqrt(0.5*0.5)/sqrt(n) → sqrt(n) = 0.5/0.02 → sqrt(n) = 25,
so n = 625
Thus, the size of the sample for newspaper would be needed for a poll is 625.

5.

Using a random sample of 100 workers, researchers calculate a 95% confidence interval for the average
hourly wage earned by construction workers in the city of Boston. This interval they calculate is from $18
to $26. Does this mean we can say that roughly 95% of construction workers in the city earn hourly wages
between $18 and $26?
ANSWER:
NO,it does not mean we can say that roughly 95% of construction workers in the city earn hourly wages
between $18 to$26. So here we are 95% confident that the average wage falls between $18 to $26.

6.

A hotel has 180 rooms. Past experience shows that only 84% of people who reserve rooms actually show
up. If the hotel takes 200 reservations, what is the chance there will be enough rooms for the people who
actually show up? (Hint: You want to _nd the chance that, in the sample of 200 reservations, the
percentage of those who show up will be 90% (180/200) or less, knowing that true population percentage
is 84%.)
ANSWER:
Here the proportion of people who show up =0.84
std deviation =(p(1-p)/n)1/2 =0.0259
Hence, the chance there will be enough rooms for the people who actually show up =probability at most
180 people shows up=P(Z 820+/- 1.96*225/10 --> (775, 864)
Thus, the confidence interval will be approximately from $780 to $860

8.

A health insurer wants to know if having a nurse available to answer simple questions over a telephone
hotline could cut costs by eliminating unnecessary doctor visits. Past records show the yearly average cost
of doctor visits per family is around $875 with an SD of $300. They randomly select 225 families, give
them access to the hotline, and record the yearly costs of their doctor visits. In this sample, they _nd the
average yearly cost per family to be $840. We want to test the null hypothesis that the hotline does not
reduce costs, that is, average cost is $875. What would be the conclusion based on this data?
ANSWER:
Null Hypothesis: The yearly average cost of doctor visits per family after giving them access to hotline is
not significantly less than $875

9.

Market researchers would like to know if consumers can taste the di􀃖erence between a product made
with low calorie oil and the same product made with regular oil. A random sample of 310 people are
blindf...


Anonymous
Just what I needed. Studypool is a lifesaver!

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Related Tags