September 2 2018

User Generated

cerggloynpx2005

Business Finance

Description

1,250-word count minimum with three scholarly sources in APA format. My name is Adrienne.

1. This is an Excel exercise where you chart the rank of your name to as far back as the data allow. If your name is unique and does not fall in the top 1,000, pick another name in your family (such as your mother or father).

  • Each year the Social Security Administration (SSA) puts out a press release on the most popular baby names. This list is based on the names sent to the SSA when applying for a Social Security number. Search the SSA website for most popular baby names. https://www.ssa.gov/OACT/babynames/#ht=2
  • Go to the Web site and towards the bottom of the page choose the option for Popularity of a Name. Enter your name and search for up to 100 years. The website will give you the rank of your name for each of the past 100 years. It is very possible that some names fall in and out of favor over time. Years will be omitted if your name is not within the top 1,000. If your name does not show up, substitute the name of another family member.
  • The web table can be easily copied and pasted into Excel. Just grab the two columns (year and rank) and copy them. Then open Excel and paste the results in. It should go smoothly.
  • We will use Excel to create a chart of the rank of your name over the past 100 years. This is relatively easy to do in Excel. I am going to walk you through the steps.
  • The data are in reverse order (newest to oldest). We want to sort by year, oldest to newest. You do this by grabbing both columns (with the headers) and selecting Data and Sort. Sort allows you to note that there are headers, choose the sorting variable, and whether it is ascending or descending. If you make a mistake, click Edit and Undo.
  • Grab the rank variable starting at the earliest year and then start inserting a graph (Insert, and choose the graph type – I recommend a line graph).
  • After the initial graph, you can select the Year column as x-axis by clicking on the Chart, selecting Chart Design, selecting Select Data, and clicking on the horizontal (Category) axis labels (look for the icon in the right corner to actually choose the column and rows for the x-axis).
  • For the rest of the graph, you are on your own to add a connecting line, title, and subtitles. This is dressing up the chart to show to others. Explore with right clicks or double clicks and you will find a way forward, or use Google to search “Excel how do I _____?”
  • Submit your graph.

2. The following is a small data set of 25 observations. I want you to calculate some statistics by hand to cement the class material. Excel can be used to solve this problem.

  • Var: 28; 4; 27; 23; 17; 38; 21; 16; 28; 15; 23; 33; 34; 42; 42; 14; 14; 28; 22; 31; 18; 28; 17; 17; 30
  • Create a stem and leaf plot of the data (it is easy to do in Word by making a two column table and entering the numbers in there, bolding the stems).
  • Calculate the following (show your work): Mean, Median, and Mode.

3. Below is the data for infant mortality for 34 OECD countries. The Organization for Economic Co-operation and Development (OECD) is an international economic organization of 34 countries founded in 1961 to stimulate economic progress and world trade. OECD’s website provided some data on infant mortality for 34 countries. Infant mortality (the rate of death of children under 1 year of age per 1,000 live births) is a measure of development. The table below has the data for 34 OECD countries.

  • Create a stem and leaf plot of the data.
  • Calculate the mean, median, and mode for this data.
  • Briefly describe the distribution—focus on the shape of the distribution and whether there are an outliers or strange values.
  • The sum of x Sum(x) for 34 OECD countries is 128.20.

4. Each year the Academy of the Screen Actors Guild gives an award for the best actor and actress in a motion picture. The name and age of each has been recorded since 1996. The data for males and females is given below (the sample size, n =20). The sum of their age is also given.

  • Construct a stem and leaf plot for each group to compare the distributions. You should use the same scale on both graphs to make a better comparison.
  • Calculate the measures of central tendency for each group. You will need to use the full data from above (use all decimal places to calculate these measures). Use two decimal places for your answers. You can use Excel to help you do this.
  • Calculate the measures of variability for each group (range, variance, standard deviation, CV). You will need to use the full data from above (use all decimal places to calculate these measures). Use two decimal places for your answers. You can use Excel to help you do this.
  • Make a table showing the summary measures for each group. Make a comparison in words summarizing your results.

5. Below is the data for infant mortality for 34 OECD countries. The Organization for Economic Co-operation and Development (OECD) is an international economic organization of 34 countries founded in 1961 to stimulate economic progress and world trade. OECD’s website provided some data on infant mortality for 34 countries. Infant mortality (the rate of death of children under 1 year of age per 1,000 live births) is a measure of development. The table below has the data for 34 OECD countries.

  • Previously, we calculate the mean, median, and mode for this data. Now, add the range, variance, standard deviation, and coefficient of variation for this data.
  • Briefly describe the distribution—focus on the shape of the distribution and whether there are an outliers or strange values.
  • The sum of x [Sum(x)] for 34 OECD countries is 128.20. The Sum(x^2) is 664.42.

6. Todd Andrlik, founder and editor of Journal of the American Revolution, wrote a piece about how young many of the founding fathers were when the Declaration of Independence was first signed in 1776. There were 56 signers of the Declaration of Independence, and their ages are given below, sorted by age.

  • Calculate the measures of central tendency. In addition, calculate the range, variance, standard deviation, and coefficient of variation for this data.
  • The sum of all the values is Sum(x) = 2,479.
  • The sum of the squares of all the values is Sum(x^2) = 116,015.
  • Briefly describe the distribution—focus on the shape of the distribution and whether there are any outliers or strange values.

Unformatted Attachment Preview

R I C A R D , A D R I E N N E 2 4 7 9 T S         COLORS: cyan magenta yellow black   KH Final Proof: 7-24-13 BOOK: 8.5x10.88 SPINE: _.80___ for Perfect Binding Introduction to Statistics, Plain and Simple Statistics is an important tool for many fields—business, the physical sciences, economics and the social sciences, engineering, and the biological sciences. It enables us to examine and test important research questions concerning individual variables and the relationships among a set of variables. The results, if used properly, can help make difficult decisions. There is hardly a field that does not use some form of statistical data analysis as a prime tool in the research process. Most likely, your own field of study uses statistical techniques in research and analysis of data and that is why you are required to take a course in statistics. R I C A R D , chapter 1 A D R I E N N E Many students approach statistics with some fear and trepidation. Common concerns involve anxiety over math skills, a worry of not being able to get the logic behind a statistical test, and a feeling of distrust for the relevance of statistics. In terms of the former concerns, modern desktop and laptop computers have made most of the calculations of statistics easy and painless. These tools enable us to focus on the more important aspects of sound data analysis practice and interpreting the results of our analysis correctly. While we will present formulas and equations, the emphasis in this book is on understanding how statistics will be applied, and gaining insight into the logic behind their use. As for the latter concern, the distrust of things statistical, I hope to help you gain some appreciation for the topic as we move through the course. Statistics can be and are very relevant in the research process, provided they are applied with respect and care. 2 4 7 9 T S There are many examples of the importance of statistics that we encounter each and every day. For example: • How do they know how much rain or snow has fallen in a given period of time? • Can a business person make decisions about the future by analyzing data from the past? • Can researchers ever get a good measurement of crowd size at a war or political protest? • Can a sales team make decisions on new products from a sample of consumers? • How do drug trials lead to the acceptance of a new drug that can be brought to market? The answer to each of these questions involves the use of statistics. In many cases the answers are estimates from samples, and they come with some concerns about the ability to truly measure the concept and the ability of the sample to represent the population. Some are direct measurements, such as the amount of snowfall, and some require a model to predict or K11352_Ilvento_CH01.indd 1 7/29/13 10:39 AM 2 chapter 1 Introduction to Statistics, Plain and Simple forecast ­future events. People in the field will use trial and error, experience, and ­theoretical mathematics to make reasonable estimates of the amount of snowfall, future trends in the economy, and whether a new drug can be effective. You will learn that the answers to these and many more questions contain some error. At times the conclusions will be wrong, and that is a chance we have to accept when working with samples and models. Again and again I will state that the inferential aspects of statistics are not about certainty. They are about making reasonable conclusions based on a sample of data and knowledge of statistical theory. Overall, statistics provide us with powerful tools for analyzing data and making decisions from an experiment or sample of data, but we can be wrong in our conclusions. R I C A R D , The focus of this book is on understanding the basics of statistics. I would like you to: • Gain an appreciation for how descriptive and inferential statistics are used in various fields of study, from business to agriculture, healthcare, and economics; • Learn how to analyze a set of data; • Learn how to present the data and make meaningful and coherent conclusions to others, and • Learn how to critique the use of statistics by others. A D R I E N N E This book provides an overview of most of the topics you might expect to find in a beginning statistics course. It is meant to be a general overview to build a foundation for further work. Obviously, anyone who wants to use statistics in research or a work environment will need to take additional courses and consult additional resources. The title of the book expresses my personal goal in this course: to help you understand statistics plainly and simply. I do not seek to cover every topic in statistics, or even as many topics one might find in other introductory textbooks. My goal is to build a good foundation for students to learn about statistics. It is my hope for many of you that you will take additional courses and gain more depth of knowledge about and insight into statistics and statistical techniques. We will begin with some general concepts and terms that you will need to know to begin to speak the language of research methods and statistics. Some ideas, such as measurement, are not statistical concepts per se, but they are important in understanding and interpreting statistics in research. This chapter will be decidedly non-mathematical. However, later chapters will include formulas and real data. 2 4 7 9 T S What Are Statistics? There are many concepts involving the word statistics, and there are many definitions of what it means. Statistics are thought to be the data itself (the government released the latest statistics on unemployment); a field of study within mathematics; and a set of tools used by many disciplines to analyze data. In its broadest sense, statistics are the science of data. The field refers to aspects of • Collecting data • Classifying, summarizing, and organizing data K11352_Ilvento_CH01.indd 2 7/29/13 10:39 AM chapter 1 Introduction to Statistics, Plain and Simple 3 • Analysis of data • Interpretation of the analysis of data It is important to note that statistics is both a field of study and the ­application of a set of tools to analyze data. Statisticians work primarily in developing and testing a set of tools to analyze data, especially in relation to making inferences from a sample of data to a population. Much of the work of statisticians focuses on derivations of techniques, theoretical proofs, and providing a literature as to the effectiveness and usefulness of various techniques. Statisticians also work directly with data and work in companies, government agencies, and in research institutions to apply statistical techniques to a range of data. R I C A R D , However, most of applications of statistics are done by people who are not statisticians. Sociologists, psychologists, economists, political scientists, biologists, physicians and nurses, and business analysts are not statisticians, but they rely on the work of statisticians to apply various techniques to data. Mostly likely, you are taking a course in statistics because your discipline feels you need some exposure to statistics to read the literature or even to participate in the research process. Most of the statistical techniques we learn in a statistics course help us describe and summarize our data. For example, the mean or average is a summary measure of central tendency of the data. With the mean, we can use a single statistical measure, and summarize 100, 1,000, or even 10,000 data points. In some cases we want to go further than simply summarizing the data—we want to make an inference from an experiment or a sample of data. In order to do so, we need to first distinguish between description and inference. A D R I E N N E Descriptive versus Inferential Statistics We make a distinction between two main approaches in the use of statistics for data analysis, descriptive versus inferential statistics. Descriptive ­statistics uses measures and graphs to summarize the data with an emphasis on parsimony. The strategy is to find summary measures which describe the data adequately and succinctly, be they a percentage, average, or a standard deviation. Descriptive statistics also involve describing the relationships between variables or sets of variables through the use of very sophisticated techniques, such as correlation, regression, factor analysis, and logistic regression. 2 4 7 9 T S Inferential statistics involves many of the same techniques as in descriptive statistics, but it goes a step further. Inferential statistics seeks to make statements from analysis of a sample or a set of experiments to a larger population. Almost all research is focused on using inferential statistics. In inferential statistics, we use some of the same techniques as in descriptive statistics, but now the focus is also on making estimates, decisions, predictions, or generalizations about a population from a smaller subset or sample. The sample can be a subset of a population as a cross-section of the population at a point in time, or as a sample in time or space. Inferential statistics are a powerful tool for research. They enable us to make statements about a large group from a much smaller sample. Thus, we can survey a sample of 1,000 people and make statements about 309 million people in the United  States, K11352_Ilvento_CH01.indd 3 7/29/13 10:39 AM 4 chapter 1 Introduction to Statistics, Plain and Simple as is typically done in survey research for elections (in March 2010, the e ­ stimated population of the United States was nearly 309 million people). Whenever we work with a set of experiments or a sample of data, there is a chance that the results we see are partly a function of the random fluctuations we expect from sample to sample. In other words, it is possible to get an unusual sample and the result we observe does not reflect the population values. Statisticians have worked out strategies to know how unusual a result is given a sampling process (primarily a random process) and a sample size. This is always done within a probabilistic framework with a chance of being right or wrong in our conclusion. In general, we want the probabilities of making a good decision to be as large as possible. When we deal with samples, it is never about certainty. There is always a chance of being wrong in our conclusions with sample estimates. R I C A R D , Populations and Samples A population is the total number of units involved in the research question. The units are the members (or elements) of the population. The population is what you are focused on when conducting a study—it is the group about which you would like to make conclusions. Even when a sample is used in research, the sample is expected to represent a population. Depending on the focus of a study, populations could be: • • • • • • • A D R I E N N E People Animals Cells Plants Courses Geographic places Objects The population should be clearly defined in any research endeavor. The population is defined by: 1. T  he purpose of the study - what are we trying to understand and what questions are we trying to answer? 2. The units and elements involved - what are the basic units that make up the population? 3. Geographic coverage - what is the particular geographic area of interest, a county, a state, or the whole country? 4. Time frame - is there a clear delineation of the time frame involved? Things do change over time and it is important to note the time frame for the population under consideration. 2 4 7 9 T S A census is when we collect data on all elements in a population. Sometimes it is difficult or impossible to get information on the entire population. An alternative is to take a sample of the population. A sample is a subset of the units or elements of a population. To be valid, we want our sample to represent the population. In other words, we want the characteristics of the sample to resemble the characteristics of the population so that we can make generalizations from the sample to the population. Samples are also defined by the same considerations as the population—the focus of the study, the units or elements involved, the geographic coverage, and the time frame. K11352_Ilvento_CH01.indd 4 7/29/13 10:39 AM chapter 1 Introduction to Statistics, Plain and Simple 5 Why Should We Sample? The major reason we sample is that sampling saves time, money, and other resources (for example, computation time on a computer). In some cases, it may actually be impossible to collect information on every element of the population and sampling becomes a reasonable alternative. Could we actually count every unemployed person in a nation of 305 million people, or everyone who is a supporter of the president? Sampling allows us to collect data for a research project and still have some confidence that the results represent something of value. R I C A R D , So, the most important reason we sample is because it works. We can design a study based on a smaller sample and have a very good chance that the data represents the population. In fact, a well conducted sample may actually be more accurate in providing estimates than population studies that attempt to get every subject in the population, but fall way short. Every ten years we conduct a census of the population in the United States. While the U.S. Constitution requires that we attempt to count every person via the census, most of the more interesting data—education, employment, poverty, and marital status—are based on a sample of the population. In fact, the Census Bureau has argued again and again that they could a better job in making estimates of the population with well designed samples than with their current attempts to count everyone. The reason is that it is very difficult to count everyone. A D R I E N N E A valuable property of a sample is that it is representative of the population. By this we mean that the sample characteristics resemble those possessed by the population. Inferential statistics require a sample to be representative of the population, and that can be done when the sample is drawn through a random process. A random sample is when each element or unit has the same chance of being selected. Classic statistical inference requires that the sample be selected through a random process. Measurement and Levels of Measurement 2 4 7 9 T S Measurement is the process of assigning a number or value to variables of the individual elements of the population (or sample). Measurement is a very important issue. Some measurement seems relatively straightforward—­ distance, weight, dollars spent. However, even straightforward measurements can come with some error and perhaps even bias. With other types of variables, the measurement is not so straightforward. How do we measure intelligence, anger, social networks, support for a policy, or willingness to pay for a product or service? These measures are more difficult to conceptualize and therefore their measurement is more difficult. Some of these may require multiple measures to fully assess the concept. Consider your grade in a course. It is supposed to measure your comprehension of the material in the course. But even that is debatable—some might argue that exams and assignments simply measure a person’s ability to memorize the material, and not the ability to comprehend it. Even so, few would be happy with a single test for a course as the final determinant of the grade. Most ­students prefer multiple tests or measures, averaged out, as a better indication of their performance. In essence, we would argue for multiple indicators to measure our grade in the course. K11352_Ilvento_CH01.indd 5 7/29/13 10:39 AM 6 chapter 1 Introduction to Statistics, Plain and Simple With measurement we must also deal with issues of validity (are we measuring what we think we are measuring) and reliability (is the measuring device consistent). A user of data is responsible for asking questions and in some cases doing preliminary analysis to determine if the measures are valid and consistent. The process of measurement is often complex—do not take it for granted. Levels of Measurement is the term we use to reflect that variables can be measured by numbers or classifications. There are various ways to characterize measurement of variables. An easy dichotomy in measurement is qualitative versus quantitative data. Qualitative data do not follow a natural numerical scale and thus are classified into categories such as male or female; customers versus non-customers; and race (white, African American, Asian, and so forth). Qualitative data are often called categorical data. R I C A R D , Quantitative data use measures that are recorded on a naturally occurring scale, such as age, income, or time. There is a continuous nature to these data. In the extreme case, the measurement is continuous and smaller and smaller increments can be used in making measurements, depending upon how accurate you need to be. For example, we can measure distance in yards, feet, inches, or fractions of inches. A more elaborate description involves three levels of measurement—nominal, ordinal, and continuous. Statistical programs, such as JMP, often use these three levels to characterize the data and determine the appropriate statistics. Nominal data (or categorical) have no implied order or superiority and can be thought of as qualitative. A middle ground is ordinal data, where there is an implied order or rank, but the distance between units is not well specified. Rankings, opinion questions that use ordered categories such as strongly agree to strongly disagree, and variables that use an ordered scale from one to ten are examples of ordinal data. A D R I E N N E Continuous data are the same as quantitative data. These data are measured on a continuous scale and the distance between measures is better understood. Most, but not all, of the advanced statistical techniques require continuous level data in order to meet the assumption of the method and to extract the most information from the data. 2 4 7 9 T S Levels of measurement are not trivial to the use of statistics. Many statistical techniques are predicated on certain levels of measurement of the variables involved. Some techniques or formulas assume a certain level is used and applying the technique to the wrong type of variable can lead to results that are biased or misleading. A software package such as JMP will change the techniques of an analysis based on level of measure for the key variables used. Sources of Statistical Data There are many ways to think of the type of research studies where statistics are employed. I will use a basic breakdown that that includes observational studies, experiments, and secondary data. From this perspective observational studies are any where the research observes or questions participants, but does not structure or manipulate the participants (such as assigning them to a treatment or control group). Field studies of nature as well as surveys would fall under observational studies. K11352_Ilvento_CH01.indd 6 7/29/13 10:39 AM chapter 1 Introduction to Statistics, Plain and Simple 7 In contrast, experiments involve the researcher actively manipulating the subjects, often into treatment and control groups, as a way to control for extraneous factors that may influence the outcome of the experiment. An experimental design, when conducted properly, can have less threats to validity and therefore we can have more confidence in drawing conclusions from the results. However, not all research lends itself to experimental designs. Thus, the need for observational studies. A third type of study uses data from published sources—secondary data, also known as existing data. In this case someone else collected the data and made it available to you. Economists often use existing data about the ­economy—sales, unemployment, interest rates—to develop statistical ­models that forecast the future. Likewise, climatologists use weather data to develop models and demographers use census data to study such things as migration. Sources of existing data include: • • • • • R I C A R D , Census of Population Current Population Survey Sports statistics Unemployment data The stock exchange A word of caution on studies using existing data. When you use data collected by someone else, most of the data decisions are out of your control. These are decision about whether to use a sample, the size of the sample, which data items to collect, at what geographic level the data will be available, and the time frame when the data will be collected. With existing data you are often a “data taker” and must settle on the decisions made by someone else. For example, you might want to analyze monthly data on unemployment by county, but the Bureau of Labor Statistics only has monthly data at the state level. At times you will need to compromise your study objectives in order to use these data. Working with existing data also will require you to become very familiar with data definitions and data decision before you use the data. A D R I E N N E 2 4 7 9 T S Critical Thinking with Statistics I urge you to be a critical thinker when looking at how statistics are used. When you read about a study, particularly in the news, you should be asking questions about the study. Statistics involves making critical decisions and rational thought as to how a set of data is: Sampled Measured Collected Analyzed Interpreted If the study or report does not tell you details about these decisions, you are limited in making a judgment on the validity and worth of the study. It is important to ask questions and be a critical thinker when it comes to how people use, or misuse, statistics. Throughout this book I will present applications and at times challenges for you to look at statistical results. I urge you to always question the data and results and see if the logic of the analysis makes sense. I will end Chapter 1 with an example of a critical K11352_Ilvento_CH01.indd 7 7/29/13 10:39 AM 8 chapter 1 Introduction to Statistics, Plain and Simple look at a ­measurement issue that all students in college face—grades! You should have some personal experience with this topic, and I hope a viewpoint. Grades in most U.S. universities tend to be a strange measurement. A Measurement Example: What Level of Measurement Are College Grades? Most U.S. universities have a very curious system of grading for courses and then ultimately for their system of grade point averaging (GPA). Almost every course, my courses included, use a point system that has an absolute zero and an upper bound. Some professors have a point system that goes beyond 100. In these systems the total points for the grade could be 200, 300, or a higher figure. Some requirements get more points, such as an exam, and some are worth less. The final grade is based on a percentage of the total points that each student earns, which converts their total points back to a 0 to 100 system. For example, if the course had 250 total points and a student earned 205 points, her grade would be: R I C A R D , Percentage = 205/250 x 100 = 82.0% I use a different system, where each exam, assignment, or quiz is weighted to yield a final score of 100. For example, an exam might be worth 15 points toward the final 100 points for the course. A student who receives an 85 on the exam gets a 85x.15 = 12.75 points toward her final grade. In this way the grade is converted to a scale of 0 to 100, with exams, assignments, and other requirements weighted differently toward the total. A D R I E N N E In either strategy, the grade can be thought of as a continuous level of measurement. However, most universities rely on a grading system with letter grades: A, B, C, D, and F for failure. Some universities include a plus and minus allowing for a wider range of grades, such as A, A-, B+, B , B-, C+, C, C-, D+, D, D-, and F. A letter grade system is clearly ordinal. Once we convert from the numerical system to letters, even if we use pluses and minuses, a lot of information is lost. A grade of an A in my class can be based on percentage of 93.1 or a 99—the letter grade makes no distinction between the two students. Both will receive an A for the course. Some information is clearly lost in such a system. In fact, some students realize that their grade will not change whether they turn in a final assignment or not, and forgo the assignment because their grade will not change. 2 4 7 9 T S This is the way I make the conversion from a continuous measure with a theoretical distribution of 0 to 100 to a letter grade. A AB+ B BC+ C CD+ D DF K11352_Ilvento_CH01.indd 8 93 to 100 90 to 92.9 87 to 89.9 83 to 86.9 80 to 82.9 77 to 79.9 73 to 76.9 70 to 72.9 67 to 69.9 63 to 66.9 60 to 62.9 Below 60 7/29/13 10:39 AM chapter 1 Introduction to Statistics, Plain and Simple 9 I would note that college professors have a tremendous amount of latitude as to the cut-off points for letter grades or pluses and minuses. One professor may use 60 as the cut-off for passing while another uses 65. Some professors curve the grades and a final grade of 40 might be passing. The variation from course to course can be enormous. At this point we moved from a continuous grade to an ordinal letter grade, which shows up on the transcript. However, universities do not stop there. Most use a grade point average (GPA) system which converts the letter grade back to a point system, weighted by the number of credits for the grade. At the University of Delaware we refer to these as quality points per credit. The University of Delaware uses the following system to convert grades. I am including the entire list just to show you how complicated grading can be with Pass/Fail options, incompletes, and listeners, to name a few. A AB+ B BC+ C CD+ D DF R I C A R D , Excellent 4.00 quality points per credit 3.67 quality points per credit 3.33 quality points per credit Good 3.00 quality points per credit 2.67 quality points per credit 2.33 quality points per credit Fair 2.00 quality points per credit 1.67 quality points per credit 1.33 quality points per credit Poor 1.00 quality points per credit 0.67 quality points per credit Failure 0.00 quality points per credit A D R I E N N E X - Failure, 0.00 quality points per credit (Academic Dishonesty) Z - Failure, 0.00 quality points per credit (Unofficial Withdrawal) L - Listener (Audit), Registration without credit or grade. Class attendance is required, but class participation is not. LW - Listener Withdrawn, A listener who does not attend sufficient class meetings to be eligible, in the judgment of the instructor, for the grade of L will receive the grade LW. 2 4 7 9 T S NR - No grade required. P – Passing, For specifically authorized courses. P grades are not calculated in indexes. (For further explanation, see Pass/Fail grade option section.) W - Official Withdrawal, Passing at time of withdrawal. The following temporary grades are used: I – Incomplete, For incomplete assignments, absences from the final or other examinations, or any other course work not completed by the end of the semester. S - Satisfactory progress, For thesis, research, dissertation, independent study, special problems, distance learning and other courses which span two semesters or in which assignments extend beyond the grading deadline in a given semester. U - Unsatisfactory progress, For thesis, research, dissertation, independent study, special problems, distance learning and other courses which K11352_Ilvento_CH01.indd 9 7/29/13 10:39 AM 10 chapter 1 Introduction to Statistics, Plain and Simple span two semesters or in which assignments extend beyond the grading deadline in a given semester. Temporary grades of S and U are recorded for work in progress pending completion of the project(s). Final grades are reported only at the end of the semester in which the work was completed. N - No grade reported by instructor. So, a student who gets a B+ in my class at the University of Delaware would get 3.67 quality points multiplied by 3 credits which equals 11.10 quality points toward their overall GPA. The overall GPA is the total number of quality points divided by the number of credits, which brings us back to a number between zero and four. In this system, credits that are Pass/Fail or granted from a test or another institution are not counted in the GPA, although they do count toward graduation. As a result, the ordinal system of letter grades is now converted back to something that looks like a continuous variable. The final GPA for a student often uses two or three decimal places to distinguish one student from another. R I C A R D , One might argue that it would be better to leave the number grades alone and calculate a GPA based on a 0 to 100 numerical system. Some high schools and some universities do that, which seems to make more sense to me from a statistical point of view. Each time we convert to a letter grade we lose information, even with a system that uses pluses and minuses. The same is true when we collapse a continuous variable into categories, such as age or income, that are converted into range categories. In such a coding system, instead of each subject having their age in years, they are given a category, such as 18 to 25, 26 to 34, and so forth. Whenever we collapse a variable into ordinal categories, some information is lost in the process. A D R I E N N E The discussion of measuring grades is actually more complicated than simply the conversion from a continuous to ordinal level of measurement. Every student knows that some courses are easier than others, so the meaning of a grade is different from course to course. While I can argue that comparing one student’s grade to another in a class is relatively straightforward, the same could not be said for comparing a grade in a course in Statistics to a grade in English. It is also complicated to determine how easy or hard a course is. It is a function of the background, experiences, and natural ability of the student in the subject matter, combined with the level of the course and the demands and grading philosophy of the instructor. There is no an easy answer as to which course is easier or more difficult, but every student knows that some courses are easier than others, at least for them. 2 4 7 9 T S I use this example to point out several things. First, measurement is a complicated process and should not be taken lightly. Even seemingly simple measurements such as a grade for a course are far more complicated than we might first realize. Second, the level of measurement does matter and information is lost or gained by the level of measurement. In the grade example we went from continuous to ordinal and then back to continuous. That may be one of the strangest measurement processes, but it is not unusual to go from continuous to ordinal. For example, surveys often do this when asking the subject’s age. And third, making comparisons across subjects on a particular measurement could be more difficult than we first realize. It may not be fair to compare grades across different subjects because the demands and expectations of the courses can be very different. K11352_Ilvento_CH01.indd 10 7/29/13 10:39 AM chapter 1 Introduction to Statistics, Plain and Simple 11 The meaning of an A in History might be different from the meaning of an A in Physics. To be clear, I am not arguing that one subject is necessarily easier than another, but I am saying that it might not be realistic to think the grade has the same meaning across subject matters without some understanding of the level of the course and the demands made on the students by the instructor. Additional Problems 1. There are many excellent sources of statistical information on the Internet. I list a few sites that contain interesting statistics on life in the United States and around the world. You might know of or find some other sources in your discipline. R I C A R D , The U.S. Bureau of the Census http://www.census.gov/ The home of a major data collector—the U.S. Bureau of the Census American Community Survey http://www.census.gov/acs/www/ A revolving survey conducted by the U.S. Census Bureau, used to make estimates of the population for cities, counties, states, and the nation between each 10-year census Statistical Abstract of the United States http://www.census.gov/ compendia/statab/ An annual publication of the U.S. Bureau of the Census containing facts and figures on a range of topics from the federal budget to education expenditures to births, deaths, marriages, and divorces (all the data can be downloaded to a spreadsheet) A D R I E N N E Current Population Survey http://www.census.gov/cps/ A large monthly survey of U.S. households that estimates issues of the labor force (unemployment, hours worked, occupations), basic demographic information (age, sex, marital status), and other social issues that affect households U.S. Dept. of Agriculture, Data and Statistics http://www.usda.gov/wps/ portal/usda/usdahome?navid=DATA_STATISTICS Sources of data on U.S. agriculture across three subagencies: Economic Research Service, Foreign Agricultural Service, and National Agricultural Statistics Service (NASS), which conducts the Census of Agriculture 2 4 7 9 T S DATA.Gov http://www.data.gov/home A one-stop place for a wide range of data collected by federal agencies Bureau of Economic Analysis http://www.bea.gov/ An agency focused on measures of the U.S. economy Digest of Education Statistics http://nces.ed.gov/programs/digest/ Statistical information covering American education from grade school through higher education National Center for Health Statistics http://www.cdc.gov/nchs/ The Centers for Disease Control and Prevention (CDC) site for health statistics in the United States and around the world a. Select a site from the list above (or one you have found) and explore it. Briefly describe the data available and in what format (tables, pdf files, downloadable data) they are available. K11352_Ilvento_CH01.indd 11 7/29/13 10:39 AM 12 chapter 1 Introduction to Statistics, Plain and Simple b. For one data source, select one variable and look up its definition. Explain how the data are collected (survey, model, full census count, or other means) and the details of the definition. For example, I looked up the 2007 Census of Agriculture and learned that it primarily collects its data through a mail survey of farms, followed up by telephone, Internet, and personal enumeration (face-to-face). The agency seeks to get a full count of all farms, but recognizes its mailing list is incomplete and its response rate is 85.2%. The agency uses statistical means to adjust for missing data and missing operations. I focused on the definition of farm. I was surprised to learn it was defined by relatively small sales: “an operation that produces, or would normally produce and sell, $1,000 or more of agricultural products per year.” This search took about 20 minutes to complete. R I C A R D , 2. Body Mass Index (BMI) is often used in health care discussions about weight and obesity; however, the measure is not without controversy. Read the following discussion and search the Internet for alternative viewpoints. Briefly summarize your own feelings about this measure, including the pros and cons of the current measure and whether you think it is a valid measure of obesity (write 2–3 paragraphs). Bo dy Mass In dex (BMI) A D R I E N N E The Body Mass Index has been around a long time. According to Jeremy Singer-Vine of Slate magazine, the BMI was first developed by Adolphe Quetelet, a Belgian mathematician who was trying to develop ideas about the “normal” person’s dimensions in 1832. In his work, he suggested that a person’s weight varied in proportion to a person’s height squared. He developed the following formula to express this relationship. Metric Formula: Weight (kg)/[Height (m)]2 Example: Weight = 68 kg, Height = 165 cm (1.65 m) Calculation: 68 ÷ (1.65)2 = 24.98 Nonmetric Formula: Weight (lbs)/[Height (in)]2 * 703 Example: Weight = 150 lbs, Height = 5’5” (65”) Calculation: [150 ÷ (65)2] x 703 = 24.96 2 4 7 9 T S About 100 years later, the measure caught on in the medical community. Ancel Keys published a paper in 1972 that used the Quetelet’s formula as the best predictor of body fat percentage. He coined the term Body Mass Index (BMI). Keys felt it was a good predictor of the percentage of body fat, but only in a general way, and it should not be used as an individual predictor of body fat. The advantage of the BMI was that it was easy to calculate, only requiring two fast and inexpensive measurements from a subject. Other methods of calculating percentage of body fat require more elaborate data collection strategies and are more invasive. Because of the ease of use, BMI has become a dominant body fat measurement and the leading obesity indicator in the United States since Keys’ article, despite his warnings to the contrary. According to Keys, the BMI was never intended as a personal measure of obesity, and it should not be used to diagnosis or treat an individual patient. So BMI has become a widely used measure of obesity as well as a controversial measure. K11352_Ilvento_CH01.indd 12 7/29/13 10:39 AM chapter 1 Introduction to Statistics, Plain and Simple 13 Here is the BMI website of the National Institutes of Health (NIH http://www. nhlbisupport.com/bmi/). NIH allows you to calculate your own BMI with an online calculator (sort of a self-diagnosis). The NIH uses the following BMI categories: Underweight = < 18.5 Normal weight = 18.5–24.9 Overweight = 25–29.9 Obese = ≥ 30 R I C A R D , According to the NIH, the limitations of BMI are expressed in this caution: “BMI is a reliable indicator of total body fat, which is related to the risk of disease and death. The score is valid for both men and women but it does have some limits. The limits are: • “It may overestimate body fat in athletes and others who have a muscular build. • “It may underestimate body fat in older persons and others who have lost muscle mass.” According to Singer-Vine, there is a growing controversy on the reliance of the BMI as an obesity measurement. He indicates, “Faulty readings could promote a negative self-image among healthy people and lead them to pursue unnecessary diets.” Almost anyone can now go online and find a BMI calculator and see their own personal BMI. Singer-Vine feels there is some danger in that. A D R I E N N E BMI is what statisticians would call an indicator variable. An indicator variable is defined as seeking to easily measure something that is complex in an easier, cheaper, and still meaningful way. There are other ways to measure body fat, but they are more costly and invasive (e.g., you have to get into a body of water or be pinched or touched). With the BMI, you only need a person’s height and weight. An indicator variable should be highly correlated (for now, think of correlated as related) with a more accurate measure to be considered valid. Thus measures of total body fat and BMI should agree across a wide sample of subjects. However, it is only an indicator and not the true measure of body fat. It is based on a model and not a direct measure. 2 4 7 9 T S Sources: National Institutes of Health. Retrieved from http://www.nhlbisupport.com/bmi/ Singer-Vine, J. (2009, July 20). Beyond BMI: Why doctors won’t stop using an outdated measure for obesity. Slate magazine. 3. Each year the Social Security Administration (SSA) issues a press release on the most popular baby names. This list is based on the names sent to the SSA when applying for a Social Security number. The website can be found here: http://www.ssa.gov/OACT/babynames/#ht=1 (or search for most popular baby names). a. Go to the website. Toward the bottom of the page, choose the option for Popularity of a Name. Enter your name and search for 100 years. The website returns the rank of your name for each of the past 100 years. Years are omitted if your name is not within the top 1,000 names of that year. b. The Wceb table can be easily copied and pasted into Excel. Grab the two columns (year and rank) and copy them. Then open Excel and paste in the results. It should go smoothly. K11352_Ilvento_CH01.indd 13 7/29/13 10:39 AM 14 chapter 1 Introduction to Statistics, Plain and Simple c. Use Excel to create a chart of the rank of your name over the past 100 years. In Excel insert a graph, choose Scatterplot, and name the column with the Y data (for this problem, it is the rank). For the rest of the graph, you are on your own to add a connecting line, title, and subtitles. Explore Excel help and the Internet to find a way forward. I have included a graph of my name rank over the past 100 years (it is becoming less common after being at or near the top 10 until the early 1960s). R I C A R D , A D R I E N N E 2 4 7 9 T S K11352_Ilvento_CH01.indd 14 7/29/13 10:39 AM Measures of Central Tendency R Most statistics, whether they refer to a single variable or a complex model I involving many variables, are designed to help us describe things in a more C simple manner. We seek summary statistics to describe our data. A useful concept when summarizing data is to find some way to measure the center A of the data. It has been referred to as the typical value, the average, or the R center. The central tendency of a variable is the tendency of the data to cluster or center about certain numerical values. Central tendency is in contrast D to another concept which will be discussed in the next chapter, variability or , focus on the mean, the the spread of the data. For central tendency we will mode, and the median. chapter 3 A D The Mean Rmeasurements ­divided The arithmetic mean or mean is the sum of the by the number of measurements contained in the I data set. As the symbol for the mean of a sample we use x with a bar over it. For a population, E we use the Greek letter, m (mu). N The formula for the mean is given below, represented in two different ways. N Both formulas would yield the same result. The first formula (to the left) is E you use to calculate the the more familiar one and is the one I recommend mean. The second formula (to the right) presents the mean like an expectation in probability, and thus connects the mean to probability theory coming 2 in a future chapter. n x= ∑x i =1 i n The sum of all the values, divided by the number of values 4 n 7 x = ∑ (x i /n) i =1 9 Tvalues weighted by the The sum of number of values S The first formula is the more familiar formula and reflects that the mean is the average observation. The second formula yields the same result and emphasizes the mean is a weighted summation with the weights being the probability of each observation in the data set (i.e., 1/n) and as such is an expectation of a probability distribution. Let us look at a small data example to see how the mean is calculated and to compare the two formulas. This data set has only ten observations for a variable, x. The values for x are in the second column and include values 21.0, K11352_Ilvento_CH03.indd 35 7/29/13 10:42 AM 36 chapter 3 Measures of Central Tendency table 3.1 A Small Data Example for Measures of Central Tendency OBS 1 2 3 4 5 6 7 8 9 10 n Sum Mean X 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0 30.0 10 255 25.5 X/n R I C A R D , 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 10 25.5 22.0, 23.0, and so forth. The sum of these values, given at the bottom of the table, is the addition of each value and equals 255. The mean is calculated as A 255/10 = 25.5. D The third column contains each value divided by the sample size, which is Rat the bottom of this column, the sum of each 10 in this example. Looking value weighted by the sample I size (i.e., 1/10) is 25.5. This value is the mean. So whether we take the sum of each value and divide by the sample size, or we sum each value dividedEby the sample size, we generate the same value for the mean. N N As a measure of central tendency the mean has several advantages over other measures of centralEtendency, and a few negative attributes. The first advantage is that the mean uses information from all the values in a variable—all the values of the variable are added together and divided by 2 the sample size. 4 A second important property of the mean is that it has inferential proper7 which we can draw conclusions from our ties that are known and from sample. Furthermore, these 9 inferential properties are relatively simple and straightforward. By this we mean that we can make inferences from a sample to a population T for the mean. Some other descriptive statistics of central tendency do notShave inferential properties, but the mean does. We will learn more about the inferential properties of the mean in future modules. A third property of the mean is that it forms the basis for a number of other statistics known as product moment statistics, which include the variance, correlation, and regression coefficients. Thus the mean finds its way into many formulas used in an introductory statistics class, some of which are simple and others quite complex. The mean is an essential building block for statistics. K11352_Ilvento_CH03.indd 36 7/29/13 10:42 AM chapter 3 Measures of Central Tendency 37 Finally, a fourth property that is a disadvantage of the mean as a measure of central tendency is that it is sensitive to outliers and extremes in the data. The mean is “pulled” toward extreme values in the data and it is not as ­“resistant” to these values as are other measures of central tendency. If there are extremely high values in the data set relative to the other values, the mean will get larger than if they were not there. Likewise, the mean will be pulled toward extreme low values in the data set. The mean has two important mathematical properties that are important in statistics. The first is that the sum of the deviations about the mean equals zero. This means that if we took each value in a variable, subtracted the mean from each one, and added the results of the subtractions, the total would equal zero. If you think about this it might R make sense why this is so—the mean is the middle of the distribution and all the values are centered I around it. C The second property of the mean is that the sum of squared deviations about A the mean is a minimum. The latter property is called the least squares property. R subtract the mean from By this we mean if we take each value in the variable, that value, and then square the result, and finally add D all these calculations, we would have a value that represents the squared deviations around the mean. , This second property tells us that the sum of squared deviations around the mean is smaller than around any other value. For example, if we calculated A it would be larger than the sum of squared deviations around the median that calculated for the mean. The latter property isD exploited when looking at the spread of the data for the variance and for more sophisticated data analyR sis techniques such as regression. I E The Median N The median is the middle value when the measurements are arranged in N ascending order. It is a positional measure because it is based on the middle E we first must sort the case in a variable. In order to find the median value, data in ascending or descending order, find the position of the middle value, and then read that value (see below for more details). 2 The median is an intuitive measure of central tendency—the value at the 4 middle of the ordered data. However, the median is actually difficult to com7 pute because it requires you to sort the data. Fortunately, spreadsheets and statistical software packages will now calculate the median for us. 9 T so it is not used when The median has very limited inferential properties, making inferences from a sample or in hypothesisStesting. Nonetheless, the median is often used in skewed data because it is not as sensitive to outliers. The median is often the preferred measure of the center in data with extreme values, such as income. The median is one of many positional statistics. By position we mean that they are based on their order in the data. You first must find the position, and then read the value associated with the position. Other order statistics include percentiles (e.g., 90th percentile), deciles (the 10th, 20th, and so forth positions), and quartiles (the 25th, 50th, and 75th percentiles). K11352_Ilvento_CH03.indd 37 7/29/13 10:42 AM 38 chapter 3 Measures of Central Tendency Quartiles are used in box plots and in constructing the inter-quartile range. The first quartile is referred to as Q1 and is the 25th percentile. The second quartile is the 50th percentile and is the same as the median. The third quartile is the 75th percentile. We will be working with quartiles as a useful way to describe the range of the middle 50 percent of the values in a variable. As such the IQR is used to describe the spread of the data (see Chapter 4) and is used to form the whiskers in a box plot (see Chapter 2). Steps to Calculate the Median An order statistic requires the data to be sorted from lowest to highest. The next step is to find the position that is of interest—the ith value in the data set R that marks a certain position, such as the middle, a quartile (25 percentile), I For the median, we are looking for the middle or a decile (a 10th percentile). position or the 50th percentile. C A median. Here are the steps to find the R 1. Sort the data D of observations (denoted as n) 2. Do a count of the number 3. If the count is odd, the median is the (n + 1)/2 position. For example, if n = , 63, the median position is the (63 + 1)/2 = 32nd position in the ordered data. Count to the 32nd position in the sorted data and read the value that is there. The median is A the value of the 32nd position 4. If the count is even, there is not an exact middle. So, we have to take the D positions and call this the median. The middle average of the middle two two positions are the n/2 R position and the n/2 + 1 position. For example, if the count is 64, we need to find the values at the 32nd and 33rd posiI tions in an ordered data set (ordered from lowest to highest), and take the average of these two values. E N For the data in Table 3.1, the sample size n is equal to 10. Since it is even, we are looking for the fifth andNsixth values in an ordered list of the data. Since the data are already ordered E from lowest to highest, the median is the average of 25 and 26. Calculations for the median2 • n = 10, so use the 10/2 =45th and the 10/2 + 1 = 6th positions • Median = (25 + 26)/2 = 25.5 7 9 confuse the median position with the median With the median, student often value. Remember, first we T locate the median position(s) in the ordered and sorted data, and then we identify the median value. When the number of S the average of the two middle positions. ­observations is even, we take Mode The mode is the most frequently occurring value in a variable. While this is an intuitive concept of the center or a typical value for most of us, the mode has its limitations. The most frequent value may not be anywhere near other measures of center. And, in a continuously measured variable, there may not be a single most frequently occurring value, leaving the mode undefined. As a result, the mode is viewed as less useful than the mean or median. K11352_Ilvento_CH03.indd 38 7/29/13 10:42 AM chapter 3 39 Measures of Central Tendency ­ owever, the mode can provide some insights as to the most common value H in a variable and the shape of a distribution. In some cases there are multiple “modes,” referred to as bi-modal or tri-modal distributions. Multiple modes or groupings around a value may reflect different subgroups within a variable. Figure 3.1 below shows a bi-modal distribution with a histogram of weight of a group of 249 subjects. The distribution shows two distinct peaks around 120 and another around 150. The two “modes” reflect the center of weight for females and for males. R I C A R D , A D Histogram of Subject Weight R I value that is the most freIn continuous level data, there may not be any single quent, and thus technically the mode is undefined. The mode may make more E sense in reference to qualitative data. With a qualitative variable, we refer to the N with the most responses. modal class or category which represents the category N E Outliers and the Measures of Central Tendency Outliers can have a dramatic effect on some measures of central tendency, 2 one measure of cenand a minimal effect on others. In fact, we may choose tral tendency over another based on the amount of spread in the data. For 4 example, the median is often used when referring to house values or income 7 because there is so much spread in the data. 9 Let us look at an example of how the spread of the data can influence the T are primarily looking at measures of central tendency. In this example we the difference between the mean and the median S and how they are influenced by extreme values in the data. The data we will use for this is the marriage rate for the 50 states and the District of Columbia (n = 51). The marriage rate is calculated as the number of marriages divided by the population in the state, expressed per 1,000 people. In the United States the marriage rate in 2005 was 7.6 marriages per 1,000 population. However, this rate varied by state. The rates for each state and the District of Columbia are presented in Table 3.2. The rates are sorted from lowest to highest, and it is easy to see that the rate for Nevada is much higher at 57.9 per 1,000 people. That is because many people travel to Nevada, specifically Las Vegas, to get married. Thus the K11352_Ilvento_CH03.indd 39 figure 3.1 table 3.2 Marriage Rates by State, 2005 State District of Columbia New Jersey Mississippi Pennsylvania Illinois Connecticut Delaware Minnesota Michigan Wisconsin Massachusetts California Arizona Ohio Washington New Mexico New York Kansas North Dakota Maryland Georgia Iowa Indiana Marriage Rate 4.0 5.7 5.8 5.8 5.8 5.9 6.0 6.0 6.0 6.1 6.1 6.4 6.4 6.5 6.5 6.7 6.8 6.9 6.9 6.9 6.9 6.9 7.0 (continued) 7/29/13 10:42 AM 40 chapter 3 Measures of Central Tendency number of marriages in Nevada reflects many people from other states rather than just people in Nevada. The next highest rate is also a state known for marriages, Hawaii. After these two states, the rates drop dramatically. table 3.2 (continued) Marriage Rates by State, 2005 State Nebraska Missouri Rhode Island New Hampshire Oklahoma Oregon North Carolina West Virginia Montana Colorado Texas Louisiana Alaska Virginia Maine South Carolina South Dakota Kentucky Florida Vermont Alabama Wyoming Utah Idaho Tennessee Arkansas Hawali Nevada Marriage Rate 7.0 7.0 7.1 7.3 7.3 7.3 7.3 7.4 7.4 7.5 7.7 8.1 8.1 8.2 8.3 8.3 8.4 8.7 8.9 8.9 9.2 9.5 9.6 10.5 10.9 12.9 23.1 57.9 Source: 2010 Statistical Abstract of the United States Table 3.3 shows the measures of central tendency for the data in Table 3.2 including and not including Nevada, which is the largest outlier. This will ­enable us to see what happens to the mean and the median when we remove an extreme data point. The sum of the values including all 51 observations is 444.04, and with the 51 observations the mean is 444.04/51 = 8.71. Note this is slightly different from the overall U.S. rate since each state is weighted the same in this calculation. The median value is 7.05. However, when we remove RNevada from the data, the sum drops to 386.10 and the number of observations decreases to 50. Now the mean is 386.10/50 I = 7.72. This does not seem like a huge drop, but is represents a (8.71 7.72)/8.71 = .1137 or 11.4% C decrease due to one extreme value. In contrast, the median hardly changes, with a new median of 7.04. As noted earlier, A the mean is much more sensitive to extreme values in the data when compared to the median. That R is why the median is more often used when the data have extreme values D or it is skewed (see the next section for more on what we mean by “skew”). , Comparing the Mean, Median, and Mode A If we have a variable withDa distribution that reflects a symmetrical, bell shaped curve, the mean, median, and mode would be very similar to one R another. The normal distribution is a very special bell shaped curve where the mean, median, and mode are equal to each other by definition. The symI metrical, bell shaped curve is important in statistics because it allows us to E make inferences about distributions of variables. N The distribution of a variable can give us insight to the measures of center N as well as the spread of the data. The spread will be covered in Chapter 4, so E and measures of center. The mean tends to be we focus on the distribution pulled by extreme values in the data, so whenever the mean is larger than the median, we tend to have extreme high values in the data. How far the 2 mean is from the median reflects the extent of the outliers. Similarly, when the mean is less than the median, the mean is being pulled by extreme low 4 values in the data. 7 This concept is captured in9 the skew of the data. The skew of the data reflects a tail in the distribution pulled by extreme values, either high or low. If the Tthere are a few extreme high values. In this case, data are skewed to the right, the mean is greater than the Smedian because the mean is being pulled by the table 3.3 Measures of Central Tendency for the Marriage Rate Data W Nevada Sum Count Mean Median Mode K11352_Ilvento_CH03.indd 40 444.04 51.00 8.71 7.05 #N/A W/O Nevada 386.10 50.00 7.72 7.04 #N/A 7/29/13 10:42 AM chapter 3 Measures of Central Tendency 41 extreme values. If the data are skewed to the left, there are a few extreme low values and the mean is less than the median. Simply comparing the mean to the median can give us a sense of the presence of extreme values or outliers, and in which direction we can expect the skew. A histogram or a stem and leaf plot is an excellent way to look at skew in a variable. The following Figure shows three examples of a skewed left, symmetrical, and a skewed right distributions. The graphs provide a visual of the distribution of the variable and the notion of skew. The mean and median values are included for each distribution so you can see how the mean is pulled by outliers. Skewed Left Distribution: The Mean is less than the Median Mean = 22.35 Median = 25.00 Symmetrical Distribution: The Mean is equal to the Median Mean = 24.97 Median = 25.00 Skewed Right Distribution: The Mean is greater than the Median Mean = 26.85 Median = 25.00 R I C A R D , A D R I E N N E 2 4 7 9 T S Skewed Left, Symmetrical, and Skewed Right Distributions figure 3.2 Summary Measures of Central Tendency are a central concept to statistics. They give us a summary measure of the center of a distribution. We discussed three measures of the center—the mean, median, and the mode. The mean or average is by far the most common measure of the center of the data. It has ­important mathematical properties that are used in other summary measures, such as the variance and standard deviation. In fact, we will see the K11352_Ilvento_CH03.indd 41 7/29/13 10:42 AM 42 chapter 3 Measures of Central Tendency mean or summary measures or statistical techniques based on the mean throughout the rest of this course. However, we also noted that the mean is sensitive to extreme values in the data and can be misleading when outliers are present. The median is an alternative measure of the center of the data that is a positional measure. By positional we mean that the median is found by sorting the data, noting the center position, and then reading the value at the center position. In comparison to the mean, the median is far more resistant to outliers in the data and is useful whenever we find highly skewed data, such as income of people or the value of houses. The mode, or most frequent R value in a data set, was also noted as a measure of central tendency, but it is far less useful than the other two. In some I variables, the mode is undefined. However, the bunching of data around a particular value can be useful C in looking at graphs of variables such as histograms, where we might think of the mode as the most frequent class or A category in the graph. R We ended the chapter with D a discussion of how the mean, median, and mode relate to each other when looking at the distribution of some variables. It was noted that in, symmetrical, mound-shaped distributions, the mean tends to equal the median which also tends to equal the mode. The normal distribution is one such mound-shaped distribution which is very A important in statistics. D Additional Problems R I 1. Dr. Ilvento uses a smartphone app to track the distance he walks. While E time, he noticed the distance stated in the app taking the same walk each varied a lot. So he began Nto record the distance in an experiment. He also recorded the distance in a car and found it to be consistently 2.4 miles. The data are given in N a stem-and-leaf plot below, measured in miles (n = 19 observations). Also E included is summary information (e.g., sum of the values). Stem Leaf 2 2.4 8 4 2.5 4 6 6 9 2.6 0 0 1 2 3 4 5 7 7 79 2.7 0 2 9 2.8 T 2.9 0 S 3.0 3.1 3.2 3.3 3.4 3.5 0 2.4|8 represents 2.48 Sum of x Sum of X^2 Q1 Q3 K11352_Ilvento_CH03.indd 42 50.930 137.365 2.595 2.680 7/29/13 10:42 AM chapter 3 Measures of Central Tendency 43 a. Calculate the mean, median, and mode for this data. b. What is the median position? c. In your opinion, which measure of central tendency best represents the center of this data? 2. The table below presents the winning times for the women’s Olympic 100-meter race from 1948 to 2012 (n = 17). The data are given in a stemand-leaf plot, measured in seconds to 2 decimal places. Also included is summary information (e.g., sum of the values). Stem Leaf 105 4 106 107 5 5 8 108 2 109 3 4 7 110 6 7 8 8 111 8 112 113 114 9 115 116 7 117 118 2 119 120 121 122 0 109|3 represents 10.93 Sum of X Sum of x^2 Q1 Q3 189.130 2107.142 10.820 11.180 R I C A R D , A D R I E N N E 2 4 a. Calculate the mean, median, and mode for this data. 7 b. What is the median position? 9 T c. In your opinion, which measure of central tendency best represents the center of this data? S K11352_Ilvento_CH03.indd 43 7/29/13 10:42 AM 44 chapter 3 Measures of Central Tendency 3. An experiment was conducted concerning queuing (standing in a line) methods at two similar fast food restaurants. In both stores A and B there was a single line and customers were funneled toward the next available register. The experiment was done at off-peak times in Store A and during a rush hour in Store B. The number of minutes until the customer was served was recorded. The data are given below in two stem-and-leaf plots, measured in minutes to 2 decimal places. Also included is summary information (e.g., sum of the values). Store A Stem Leaf 1 1* 6 7 9 2 04 2* 5 5 5 7 8 9 3 0000 3* 6 6 8 9 9 4 2 4* 5 4|2 represents 4.2 Count Sum of X Sum of x^2 Q1 Q3 R I C A R D , 21.00 60.50 A 185.53 D 2.50 R 3.60 Store B Stem Leaf 1 1* 7 2 2* 5 8 8 9 3 11334 3* 5 5 6 7 8 9 4 1122 4* 7 5 4|2 represents 4.2 Count Sum of X Sum of x^2 Q1 Q3 21.00 72.20 257.58 3.10 3.90 I E b. Compare the results N for each store. N E a. Calculate the mean, median, and mode for each store. 2 4 7 9 T S K11352_Ilvento_CH03.indd 44 7/29/13 10:42 AM chapter 3 Measures of Central Tendency 45 4. The rate of cesarean births has increased dramatically in the U.S. (based on a report from the Centers for Disease Control and Prevention, NCHS Data Brief no. 35, March 2010). In 1996 the U.S. rate of cesarean births was 20.7%, and by 2007 this had increased to 31.8%. The rate varied by state. The data for 2007 are given below in a stem-and-leaf plot, measured as a rate with one decimal place. Also included is summary information (e.g., sum of the values). Stem 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 Leaf 3 2 2 29 668 1134556778 01126 01334789 044488 24 224689 08 0 3 26 R I C A R D , A D R Stem is the whole number; leaf is the decimal place. I 22|2 represents 22.2 E N Count 51 Sum(x) 1561.40 N Sum(x^2) 48538.38 E Q1 28.30 Q2 Q3 30.70 33.50 2 a. Calculate the mean, median, and mode for 4 2007 cesarean rates by state. 7 b. The U.S. rate is given as 31.8%. The mean rate 9 you calculated from the data above is slightly different. Why do you think the two rates are difT ferent (Hint: The difference is sometimes referred to as a weighted or S unweighted mean calculation)? K11352_Ilvento_CH03.indd 45 7/29/13 10:42 AM 46 chapter 3 Measures of Central Tendency 5. A recent exam in an introductory graduate statistics course resulted in the following distribution (scores are based on 100). The data are given in a stem-and-leaf plot. Also included is summary information (e.g., sum of the values). Exam Results Stem Leaf 1 8 2 3 4 2 5 R 6 3 7 29 I 8 126777899 C 9 112234556666678888 A 10 0 0 0 9|1 represents 91 R Count Sum(x) Sum(x^2) Q1 Q3 D , 35 3062.00 277386.00 87.00 A 96.00 D R a. At least two of the scores I are extreme values (18 and 42). Remove these values from the data and recalculate the mean, median, and mode. To do this, subtract eachEvalue from the Sum(x), and modify the count to reflect 33 rather thanN 35. N by removing the outliers? b. Did the results change E Calculate the mean, median, and mode of the test scores. 6. Forbes magazine estimated the value of all major league baseball teams in March 2013, represented as millions of dollars of worth. A stem-andleaf plot from JMP software is given below, along with summary infor2 mation (e.g., sum of the values). There are two large outliers, the New 4 York Yankees, valued at $2.300 billion (expressed as 2300 in the data) and 7 valued at $1.615 billion (expressed as 1615 in the Los Angeles Dodgers the data). 9 T S K11352_Ilvento_CH03.indd 46 7/29/13 10:42 AM chapter 3 Measures of Central Tendency 47 Summary Statistics Count 30 Sum 22,307.0 SumSq 20,882,751.0 Q1 559.8 Q2 627.5 Q3 752.5 Min 451.0 Max 2,300.0 a. Calculate the mean, median, and mode of the team values (Note: The value is in millions, so a value of 1312 is $1.312 billion dollars, or R $1,312,000,000). I b. At least two of the values are extreme values (2300 and 1615). Remove C the mean, median, and these values from the data and recalculate mode. To do this, subtract each value from the A Sum(x), and modify the count to reflect 28 rather than 30. R D c. Did the results change by removing the outliers? , 7. Scramble with Friends is a social word game on smartphones. It is a 4x4 table of letters, and the object is to generate as many as words as possible in 2 minutes to win. Letters for words must A be contiguous but they can go in any direction. Each game has three rounds, and the second and D score. Each round lasts third rounds have bonus options to increase the 2 minutes. A player earns points based on word R length, the particular letters used, and sometimes bonus letters or word points in rounds 2 and 3. I Dr. Ilvento played and recorded his score for 60 E games in 2012 and 60 games in 2013 (the number of rounds was 60*3 = 180 in each year). He N well as the number of recorded the number of points in each round as words. N E The data below are the number of words per round in 2013 (n = 180). A histogram and stem-and-leaf plot from JMP software are given below, along with summary information (e.g., sum of the values). 2 4 7 9 T S K11352_Ilvento_CH03.indd 47 7/29/13 10:42 AM 48 chapter 3 Measures of Central Tendency Summary Statistics Count 180 Sum(x) 10,062.00 Sum(x^2) 578,652.00 Q1 50.00 Q2 55.50 Q3 62.25 a. Calculate the mean, median, and mode for the number of words in 2013. b. Briefly compare the three measures of central tendency for this data. R the measures of central tendency, what do they Based on the plots and suggest about the distribution of words in 2013? I 8. Scramble with Friends C is a social word game on smartphones. It is a 4x4 table of letters, and theAobject is to generate as many as words as possible in 2 minutes to win. Letters for words must be contiguous but they can go in any direction.R Each game has three rounds, and the second and third rounds have bonus D options to increase the score. Each round lasts 2 minutes. A player earns points based on word length, the particular let, ters used, and sometimes bonus letters or word points in rounds 2 and 3. Dr. Ilvento played and recorded his score for 60 games in 2012 and 60 A games in 2013 (the number of rounds was 60*3 = 180 in each year). He recorded the number ofDpoints in each round as well as the number of words. R The data below are theIscore per round in 2013 (n = 180). A histogram from JMP software is given E below, along with summary information (e.g., sum of the values). N N E 2 4 7 9 T S Summary Statistics Count 180 Sum(x) 105,557.00 Sum(x^2) K11352_Ilvento_CH03.indd 48 72,512,839.00 Q1 Q2 Q3 Min 429.00 513.00 689.00 242.00 Max 2228.00 7/29/13 10:42 AM chapter 3 Measures of Central Tendency 49 a. Calculate the mean, median, and mode for the score per round in 2013. b. Briefly compare the three measures of central tendency for this data. Based on the histogram and the measures of central tendency, what do they suggest about the distribution of score per round in 2013? 9. Each year the Academy of Motion Picture Arts and Sciences picks a best male actor and best female actor in a film. Below is the stem-and-leaf plot from JMP software of the ages for the best female actor from 1934 to 2012 (n = 80, since there were two winners in 1968). R I C A R D , Summary Statistics Count 80 Sum(x) 2,874.0 Sum(x^2) 113,902.0 Q1 Q2 Q3 Min Max 28.8 33.0 40.3 21.0 80.0 A D R I E N N E a. Calculate the mean, median, and mode for the ages of the best female actors in a film. 2 b. Briefly compare the three measures of central tendency for this data. 4 Based on the stem-and-leaf plot and the measures of central tendency, what do they suggest about the distribution of average age for 7 females? 9 T S K11352_Ilvento_CH03.indd 49 7/29/13 10:42 AM 50 chapter 3 Measures of Central Tendency 10. Each year the Academy of Motion Picture Arts and Sciences picks a best male actor and best female actor in a film. Below is the stem-and-leaf plot from JMP software of the ages for best male actor from 1934 to 2012 (n = 79). R I C A Summary Statistics R Count D79 Sum(x) 3,458.0 , Sum(x^2) 157,514.0 Q1 Q2 Q3 Min Max 38.0 42.0 A 49.0 29.0 D 76.0 R a. Calculate the mean, Imedian, and mode for the ages of the best male actor winners. E b. Briefly compare the N three measures of central tendency for this data. Based on the stem-and-leaf plot and the measures of central tendency, N what do they suggest about the distribution of average age for males? E c. The average age for females who won best actor over the same time period is 35.9 years. Compare the two average ages. 2 4 7 9 T S K11352_Ilvento_CH03.indd 50 7/29/13 10:42 AM Measures of Variability R Central tendency only tells part of the story when describing a variable. I Another aspect of data is the spread or variability of data. We are still looking for summary statistics of data—simplified measures C that help describe the data—but we will concentrate on why cases differ from one another. There A are several intuitive measures of spread of data, including the range, the inter-quartile range (IQR), the variance, standardR deviation, and the coefficient of variation. D , chapter 4 A Simple Example of Why the Spread Is Important Imagine we have two data sets. Data set 1 has a A mean, median, and mode of 5, and Data set 2 also has a mean, median, andD mode of 5. We might conclude they are one in the same since all the measures of central tendency R agree with each other. However, if we look at the data we see a different story. I Variable 1 has the following values: {2, 3, 4, 5, 5, 6, 7, 8}. The sum of X1 = 40, E and n = 8, so the mean = 5. N Variable 2 has the following values: {5, 5, 5, 5, 5, 5, N5, 5}. The sum of X2 = 40, and n = 8, so the mean = 5. E However, all the values are the same in variable 2—there is no variability. X2 is a constant! We need something more to help describe a variable—the 2 around the measures variability. Variability is the spread of the data, typically of the center of the data. If there is no variability,4thenX is thought to be a constant and is no longer a variable—all the values are the same. 7 In this chapter we will focus on the following measures of variability: 9 T • Range • Inter-quartile range (IQR) S • Variance • Standard deviation • Coefficient of variation (CV) We will also discuss the variance and standard deviation in relation to the mean, and how to interpret them for some types of variables using Chebyshev’s rule and the Empirical Rule. These rules in turn will give us a new way to define what is an outlier. K11352_Ilvento_CH04.indd 51 7/24/13 4:14 PM 52 chapter 4 Measures of Variability As with the measures of central tendency, I will start with a small data set to illustrate the measures of spread, and then I will demonstrate using other data (Table 4.1). I will explain why I squared the X-values when I get to the computational formula for the variance. Range The range is a fairly intuitive measure of the spread of the data. It is calculated as the difference between the highest (maximum or max) and lowest (minimum or min) value in the data. The range provides a sense of the extremes in the data. It is an order statistic and depends upon only the two most extreme values in the data. As such, the range may be seriously influenced by outliers. R I C is 30.0 Minimum is 21.0 Maximum A Range = 30.0 - 21.0 = 9.0 R One of the limitations of the Drange as a measure of spread is that it depends upon only two values of the variable—the two most extreme values. Thus it , doesnot use much information in the variable for its calculation. This leads The range for the sample data in Table 4.1 is calculated as: to the second limitation of the range, which is that it is sensitive to extreme values in the variable. One or two extreme values can have a large influence on the range. Another way A to say this is that the range is sensitive to outliers. D Inter-Quartile Range R I is the inter-quartile range, which is the differAn alternative to the range ence between the 3rd quartile (75th percentile) and the 1st quartile (25th E percentile). The abbreviation for the inter-quartile range is the IQR. The N a sense of the range in the middle of the data inter-quartile range provides and is not as sensitive to extreme values in the data. It is also a positional N statistic because it depends upon finding positions of values in a variable E lowest to highest. that has been ordered from table 4.1 A Small Data Example for Measures of Variability 2 OBS 1 2 3 4 5 6 7 8 9 10 n Sum Mean K11352_Ilvento_CH04.indd 52 X 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0 30.0 10 255 25.5 X Squared 4 7441.0 9484.0 529.0 576.0 T 625.0 S 676.0 729.0 784.0 841.0 900.0 6585.0 7/24/13 4:14 PM chapter 4 Measures of Variability 53 For the purposes of this course, we will not worry about how to actually calculate the 25th and 75th percentiles. The formulas are somewhat similar to the median, but with small sample sizes it can be complicated to calculate. However, I do want to give you some sense of the value of the IQR when dealing with outliers or extreme values in a variable. The IQR for the sample variable in Table 4.1 is: Q1 (the 25th percentile) is between the 2nd and 3rd observations = 22.75 Q3 (the 75th percentile) is between the 8th and 9th observations = 28.25 The inter-quartile range is: 28.25 - 22.75 = 5.50 The IQR becomes an alternative to the range asRa measure of the spread of the middle 50 percent of the values in a variable, that is, between the I 25th and 75th percentiles. The IQR is also used in constructing box plots (see Chapter 2). C A R Variance D intuitively appealing as The concept of deviations around the mean can be a measure of spread of the data. If the mean is a, good measure of central tendency, then it is reasonable to ask how different (or how far away) a particular value of a variable (X) is from the mean of X. Taking this idea a step further, we might ask what is the average distance A of all values in the variable from the mean. We start with this idea when calculating the variance as a D statistical summary measure of spread in a variable. R Getting to an average deviation in a variable based on the mean can be a I mean is that the sum of tricky thing. Remember, one of the properties of the deviations around the mean equals zero. As a result, E we cannot simply calculate an average deviation around the mean, because that answer will always N be zero (see equation to the left below). N One alternative summary measure is the mean absolute difference—which is E the sum of the absolute differences between each value and the mean (see ­equation to the right below). This simple adjustment does get around the limitation that the sum of the deviations around the mean 2 will equal zero by making each ­deviation a positive value. The mean absolute difference does generate a unique ­summary measure for each variable (rather4than always equaling zero as with the mean deviation). However, this approach 7 does not have good inferential properties and there is another approach that is viewed as being more 9 useful—the ­variance. n ∑(x i =1 i n −X ) n ∑ (x i =1 i T S −X ) n A third approach would be to square the differences from the mean, because the square will always yield positive deviations around the mean. This approach is called the variance. More specifically, we square the deviations around the mean and take an average squared deviation by dividing by the number of observations (by deviations, we refer to the difference of the value from the mean). This formula is called the variance. K11352_Ilvento_CH04.indd 53 7/24/13 4:14 PM 54 chapter 4 Measures of Variability Let us look at the formula for the variance more closely. We will use the Greek symbol s 2 (sigma-squared, see equation below) to represent the variance of a population along with the population mean m. The sample term for the variance will be s2 (s-squared, see equation below). When we calculated the sample variance we divided by n - 1. This has to do with a concept called degrees of freedom (see box below). The need to divide by the degrees of freedom has to do with making inferences from a sample to the population. If we used n in the formula for the sample variance we would tend to underestimate the population variance. This concept is difficult to understand at this point of the course, so for now you will just have to accept this on faith. N n 2 R xi − X ) ( ∑ ( ) ∑ s 2 = i =1 σ 2 = i =1 I (n − 1) N C The formula on the right with A n - 1 in the denominator will be the formula we will use throughout this course, since ultimately we will be interested in R making an inference to a population. Be careful with your calculator (and a spreadsheet such as Excel)Dif you are using a function to calculate the variance. Most calculators will have a formula for both the population variance , and the sample variance. Almost always in this course we will want the xi − 2 sample variance. A The numerator of the variance reflects the sum of the square of each value D (see equation on the following page). The in the variable minus the mean numerator is called the total R sum of squares (TSS, a term we will see later in ANOVA and regression). Since we take the square of the deviations around the mean, Ithe numerator will always be a positive term. Once we divide by n − 1, the E degrees of freedom based on the number of observations, the variance reflects the average squared deviation around the mean. A property of N the mean is that the TSS about the mean will be smaller than any otherNconstant value that can be placed in the formula to replace the mean. This is called the minimum variance property E of the mean. 2 4 Deg 7 rees of Freedom When we are dealing with 9 a sample of the population, and our ultimate goal is some sort of inference, the formula T for the variance and standard deviation must be adjusted for degrees of freedom. The adjustment to the sample formula uses n − 1 in the denominator. We will useSthe n − 1 formula for the variance (and the square root of this formula for the standard deviation) almost exclusively for the rest of this course. Degrees of freedom is an important concept in inferential statistics and will be seen in more advanced analyses. While it is a difficult concept to comprehend at this level, think of it as a necessary adjustment when dealing with a sample. Using n in the ­formula for the sample variance tends to underestimate the population variance. Note that the adjustment makes more of a difference when the sample size is small (less than 30) than when the sample is large (greater than 1,000). K11352_Ilvento_CH04.indd 54 7/24/13 4:14 PM chapter 4 Measures of Variability 55 Another way to describe the variance is that it is the mean squared deviation. n ( Total Sum of Squares = ∑ x i − X i =1 2 ) Let us calculate the variance using the sample data from Table 4.1 (now referred to as Table 4.2). I provided several more columns of data for the sample data to make this easier for us to calculate. Table 4.2, column 3 clearly shows that the sum of the deviations about the mean equals zero, and this is why this approach is not useful to calculate a measure of spread or variability in the data. The sum of the squared deviaR tions about the mean (column 5) is equal to 82.5. When I divide this value by I variance, which is 9.167. the degrees of freedom(10 − 1 = 9), I get the sample C ∑ ( x − 25.5) 82.5 A s = = = 9.167 (10 − 1) 9 R D A Small Data Example for Measures of Variability with,Additional Columns 10 2 OBS 1 2 3 4 5 6 7 8 9 10 n Sum Mean Median i =1 2 i X X Squared X-Mean 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0 30.0 441.0 484.0 529.0 576.0 625.0 676.0 729.0 784.0 841.0 900.0 −4.5 −3.5 −2.5 −1.5 −0.5 0.5 1.5 2.5 3.5 4.5 6585.0 0.0 10 255 25.5 25.5 A D R I E N N E table 4.2 Squared Dev. 20.25 12.25 6.25 2.25 0.25 0.25 2.25 6.25 12.25 20.25 82.5 2 4 7 9 Computational F ormula for the Variance T The calculations for the variance can be tedious. Subtracting each value S from the mean and then squaring the result, adding each of these squared ­ eviations, and then dividing by the degrees of freedom involves a lot of cald culations. For the sake of this course, I will not require you to do this by hand very often. Most times we will let a calculator or software calculate this for us. However, I want to introduce a computational formula for the variance that makes the computation of the variance for a variable more accurate, and perhaps easier. A computational formula is a modification of the original formula that will result in the same answer, but it will either make it easier to calculate the statistic, or it has less rounding error. K11352_Ilvento_CH04.indd 55 7/24/13 4:14 PM 56 chapter 4 Measures of Variability 2 n n s2 = ∑(x i =1 i −X 2 ) s2 = (n − 1) The Formula for the Sample Variance n ∑ (x ) i =1 n ∑ (x i2 ) − i i =1 (n − 1) The Computational Formula for the Sample Variance The two formulas above will yield an equivalent result for the sample variance. I will use these formulas throughout the course. In terms of calculating the variance with the computational formula, I only need to give you: R I • The sum of all the x values • The sum of each x value Csquared • The sample size (n) A With this information you should R easily be able to calculate the mean and the variance. Let me demonstrate using the information from Table 4.2. Be sure D that you see how each element of the computational formula is derived and , it. that you know how to calculate • The sum of all the x values = 255 Asquared = 6585 • The sum of each x value • The sample size (n) = 10 D R2 ∑ (xIi ) n i =1 2 2552 (x i ) − E ∑ 6585 − n 10 s 2 = i =1 = (n − 1) N 9 N E n = (6585 − 6502.5) = 9.167 9 Standard Deviation Average squared deviations around the mean are awkward to discuss and 2 terms is difficult to describe. Fortunately, it interpret. Anything in squared is relatively easy to alter the 4 variance and put it back into regular terms. If we take the square root of the variance, we have a value that is no longer in squared terms and we 7 bring the measure back into the original metric terms of the variable. This new 9 term is the standard deviation, or the average deviation around the mean. We use the Greek letter s (sigma) to represent T the population standard deviation and the term s to represent the sample S standard deviation. The formula below shows the formula for the standard deviation. Note that it is simply the square root of the variance. n s= K11352_Ilvento_CH04.indd 56 ∑(x i =1 i −X 2 ) (n − 1) 7/24/13 4:14 PM chapter 4 Measures of Variability 57 Interpreting the Standard Deviation We can use the standard deviation to express the proportion of cases that might fall within one or two or more standard deviations from the mean. We will use two theorems to help interpret the standard deviation. 1. Chebyshev’s rule (also known as Tchebysheff’s theorem) 2. Empirical rule Chebyshev’s rule is simply a mathematical theorem for any variable, regardless of its distribution. It states that at least 3/4 of the values within a variable will fall within ± 2 standard deviations from the mean. This does not mean it couldnot be more, but at least 3/4 of them will. Also note that Chebyshev’s R rule does not say the values will be distributed symmetrically around the mean. Chebyshev’s rule also states that at leastI8/9 of the measurements (about 89%) will fall within ± 3 standard deviationsCfrom the mean. A of its distribution, but it Chebyshev’s rule applies to any variable, regardless is not that specific. If our variable is symmetricalRand mound-shaped in its distribution, we can use the empirical rule to make some statements to interD that the distribution is pret the standard deviation. By symmetrical we mean the same (or reasonably close) to the left and right , of the mean. By moundshaped we mean that the largest proportion of the observations are centered around the middle of the distribution, and the mean, median, and mode of the variable are reasonably close. A If our variable is symmetrical and mound-shaped,Dthe empirical rule tells us that approximately 68% of the observations should R be plus or minus 1 standard deviation; 95% should be within plus or minus 2 standard deviations, I and nearly all the observations (99.7%) should be plus or minus 3 standard E deviations around the mean. We can express this as: • 68% of the observations are ± 1*s • 95% of the observations are ± 2*s • 99.7% of the observations are ± 3*s N N E 2 This rule allows us to say how likely or unlikely it would be to find a variable 4 from the mean. that is a certain number of standard deviations away 7 9 Z-scores The z-score approach is a method of transformingTdata to reflect the relative standing of the value in relation to the mean, in terms S of the standard deviation. A z-score is calculated by subtracting the mean from a value and then dividing by the standard deviation (see below). zi = (x i −X s ) The result represents the distance between a given measurement X and its mean, expressed in standard deviations. A positive z-score means that K11352_Ilvento_CH04.indd 57 7/24/13 4:14 PM 58 chapter 4 Measures of Variability ­ easurement is larger than the mean while a negative z-score means that it m is smaller than the mean. By dividing through by the standard deviation we are able to say how far away a value is from its mean in a relative way. The relative expression is how far away in standard deviations. If we were to convert an entire variable to z-scores—take each value, subtract the mean, and divide by the standard deviation—we would create a new variable that has a mean equal to zero and a standard deviation equal to one. The new variable would be in standardized units and thus would allow us to compare different values to each other in terms of how many standard deviations away from the mean they are. A z-score transformation R does not change the order of the data or the shape of the distribution of the data. This is because we are subtracting I and dividing through by constant values (i.e., the mean and standard deviation). Use of a z-score transformation can help in interpretation of a variC able, comparison of variables measured on different scales, and in cases of A variables whose measurement is somewhat contrived and arbitrary, such R as an index. D , In terms of the empirical rule, z-scores have an even easier interpretation. • Approximately 68% of the measurements will have a z-score between -1 and 1 • Approximately 95% ofA the measurements will have a z-score between -2 and 2 D • Almost all the measurements (99.7%) will have a z-score between R -3 and 3 I Transforming to z-scores makes these types of problems even easier and E to statistics. The rare event approach is a leads to a rare event approach basic strategy of data analysis N which will fit very well with hypothesis testing later on. In the rare event approach, we start with a hypothesized freN quency distribution to describe a population of measurements. In many E cases the hypothesized distribution will reflect a world where nothing is going on with the data, i.e., the status quo. Next we draw a sample of data from the population (most often in a random fashion). We then compare the 2 sample statistic to the hypothesized frequency distribution to see how likely or unlikely it is that the sample came from the hypothesized distribution. 4 If the sample value is very unusual relative to the hypothesized value, we 7 that our sample is different from the hypothwould have strong evidence esized value. 9 T us a rule of thumb to determine if a value The empirical rule also gives is an outlier. If a value is more S than three standard deviations away from the mean, it is extremely rare. In a probabilistic framework, we would say that it is possible, but not very probable. Thus, if we had a compact car that gets less than 29.8 mpg or more than 44.2 mpg we might ask questions. Perhaps it is a performance car that is part of a different population of compact cars, or if it is on the high end a specialty hybrid that is unique. Or, someone could have made a mistake in measuring mpg or in entering the data in a computer. The fact that a value is extreme does not make it wrong or bad, but it should cause us to ask questions and examine it further. K11352_Ilvento_CH04.indd 58 7/24/13 4:14 PM chapter 4 Coeffici t of Variation Another way to express the standard deviation is in relation to the mean. The coefficient of variation (CV) is the ratio of the standard deviation to the absolute value of the mean, usually expressed as a percentage. By taking a ratio, we express the standard deviation relative to the mean and it provides a way to say how much variability there is in a variable relative to the size of the mean. The higher the percentage is, the more variabilitythere is in our variable. The CV is particularly useful when comparing the variability of different variables. For example, suppose we have a data set on customers and we want to compare the variability of education level and their income. It R would not be useful to compare the standard deviations because the metI ric on income is so much larger than that of education. However, we could compare the CVs for each variableand talk aboutCwhich variable has more variability. The CV formula is given below. CV = s * 100 X A R D , A The Variance and Outliers D Because the variance is calculated as squared deviations around the mean, R it can be sensitive to outliers in the data. Values that are far away from the I mean result in large deviations, and once we square them they can ­contribute E sensitive to outliers, but a lot to the variance. The standard deviation is also somewhat less so since it is the square root of the variance. N Let us return to the marriage rate data for the 50Nstates and the District of Columbia from Chapter 3 (see Table 4.3). Recall the E marriage rate is calculated as the total number of marriages divided by the population for each state and the District of Columbia. Nevada, because of its reputation for quick marriages, had a much higher marriage rate than2the other states for 2005. We can calculate the measures of variability to see how much an outlier can 4 influence the various measures of spread. 7 The maximum value is 57.94 (Nevada) and the minimum is 4.03 (District of 9 Columbia). As a result the range is 53.91, a substantial difference between the highest and lowest rates. As noted earlier, the T range relies on the two most extreme values in the data. When Nevada is removed from the data, S the range reduces to 19.06. Just one value had a considerable impact on the range. In comparison, the inter-quartile range is much smaller. The first quartile is 6.47 and the third quartile is 8.30, with an IQR of 1.83. We would interpret this as, the difference between the low and high value within the middle 50 percent of the values is only 1.83. The middle 50 percent of the values has considerably less variability than the total variable. When Nevada is removed from the data, the IQR barely changes (now 1.82). One way to say this is that the IQR is resistant to outliers compared with the range. K11352_Ilvento_CH04.indd 59 59 Measures of Variability table 4.3 Marriage Rates by State, 2005 State District of Columbia New Jersey Mississippi Pennsylvania Illinois Connecticut Delaware Minnesota Michigan Wisconsin Massachusetts California Arizona Ohio Washington New Mexico New York Kansas North Dakota Maryland Georgia Iowa Indiana Nebraska Missouri Rhode Island New Hampshire Oklahoma Oregon North Carolina West Virginia Montana Colorado Texas Louisiana Alaska Virginia Maine South Carolina South Dakota Kentucky Florida Vermont Alabama Wyoming Utah Idaho Marriage Rate 4.0 5.7 5.8 5.8 5.8 5.9 6.0 6.0 6.0 6.1 6.1 6.4 6.4 6.5 6.5 6.7 6.8 6.9 6.9 6.9 6.9 6.9 7.0 7.0 7.0 7.1 7.3 7.3 7.3 7.3 7.4 7.4 7.5 7.7 8.1 8.1 8.2 8.3 8.3 8.4 8.7 8.9 8.9 9.2 9.5 9.6 10.5 (continued) 7/24/13 4:14 PM 60 chapter 4 Measures of Variability table 4.3 (continued) Marriage Rates by State, 2005 State Tennessee Arkansas Hawaii Nevada Marriage Rate 10.9 12.9 23.1 57.9 Source: 2010 Statistical Abstract of the United States table 4.4 When we examine the variance and standard deviation, the effect of the outlier Nevada is considerable. For the full data set, the total sum of squares is 6,695.12. When just one value is removed, the Nevada rate, the sum of squares reduces to 3,338.19.This is a 50 percent decrease from just one value! The variance for the full data is 56.58 while the variance for the reduced data is 7.28, a nearly eight-fold decrease. Likewise the standard deviation declines from 7.52 to 2.70 once Nevada is removed from the data. The coefficient of variation shows the effect of the outlier nicely. The CV is 86.39 for the full data, meaning the standard deviation is about 86 percent of the mean. Once Nevada is removed, the CV drops to 34.94, indicatingthe standard deviation is only 35 percent of the mean. R Marriage Rate Data, with and without Nevada Measures of Variability for the 2005 Measure of Spread Sum Count Mean Median Mode Sum of Squares Min Max Range IQ1 IQ3 Variance Std Deviation Coeffici t of Variation I W Nevada C 444.04 A 51.00 R 8.71 7.05 D #N/A , 6695.12 4.03 57.94 53.91 6.47 8.30 56.58 7.52 86.39 W/O Nevada 386.10 50.00 7.72 7.04 #N/A 3338.19 4.03 23.09 19.06 6.45 8.27 7.28 2.70 34.94 A D R I E N or without Nevada, hardly has a symmetrical, The marriage rate data, with mound-shaped distribution.N Thus the empirical rule does not apply. ­However, it might be useful to calculate a z-score for Nevada in the data to see just how many standard deviations itEis from the mean. ZNevada = (57.94 - 8.71)/7.52 = 6.55 2 The marriage rate for Nevada 4 is 6.55 standard deviations above the mean. Z-scores that are above 3 are considered rare, so this value is very unusual. 7 An extreme value more than three standard deviations from the mean is not 9 it does indicate there is an extreme value in wrong or bad in itself. However, the data, and that this value can have great influence on some of the meaT sures of central tendency and variability. In this case, Nevada is so unusual S that we might want to exclude it from any furcompared to all the other states ther analysis. The next highest value, for Hawaii, is only 1.91 standard deviations above the mean ZHawaii = (23.1 - 8.71)/7.52 = 1.91 K11352_Ilvento_CH04.indd 60 7/24/13 4:14 PM chapter 4 Measures of Variability 61 A Data Example: Summary Statistics for a Symmetrical, Mound-Shaped Variable Let us use all the information we have learned thus far to describe a variable. This variable has 154 observations and is measured on a continuous level. We will use the software package JMP to help generate the summary statistics. You will note from the output that JMP generates many more summary statistics of central tendency or variability than we have discussed thus far, and some that we have discussed, such as the mode, are not included. However, the core measures such as the mean, median, variance, and standard deviation are included. JMP also generates graphs such as a histogram, a box plot, and a stem and leaf plot for our use. R The mean for this variable is 24.97, which is nearly identical to the median. I all show the distribuThe histogram, box plot, and the stem and leaf plot tion is symmetric and mound-shaped. The rangeCis 16 (33-17) and the IQR is 4 (27-24). The variance is 10.95 and standard deviation is 3.31. There A is not much spread in this variable since the coefficient of variation is only 13.25. R D If we calculated a range of plus/minus 3 standard deviations about the mean, we would have an interval of: , 24.97 ± 3*3.31 = 24.97 ± 9.93 = 15.04 to 34.90. A D R One observation is 18.0. I can calculate a z-score for one of the observations. The z-score for this observation is: I E Z = (18.0 - 24.97)/3.31 = -2.11 N This observation is 2.11 standard deviations below the mean. N E All the observations of this variable are within this interval. 2 4 7 9 T S Summary Statistics from JMP for a Symmet...
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Done...


Anonymous
I use Studypool every time I need help studying, and it never disappoints.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Similar Content

Related Tags