R
I
C
A
R
D
,
A
D
R
I
E
N
N
E
2
4
7
9
T
S
COLORS: cyan magenta yellow black
KH
Final Proof: 7-24-13 BOOK: 8.5x10.88
SPINE: _.80___ for Perfect Binding
Introduction to Statistics,
Plain and Simple
Statistics is an important tool for many fields—business, the physical
sciences, economics and the social sciences, engineering, and the biological
sciences. It enables us to examine and test important research questions concerning individual variables and the relationships among a set of variables.
The results, if used properly, can help make difficult decisions. There is hardly
a field that does not use some form of statistical data analysis as a prime tool
in the research process. Most likely, your own field of study uses statistical
techniques in research and analysis of data and that is why you are required
to take a course in statistics.
R
I
C
A
R
D
,
chapter
1
A
D
R
I
E
N
N
E
Many students approach statistics with some fear and trepidation. Common
concerns involve anxiety over math skills, a worry of not being able to get
the logic behind a statistical test, and a feeling of distrust for the relevance
of statistics. In terms of the former concerns, modern desktop and laptop
computers have made most of the calculations of statistics easy and painless. These tools enable us to focus on the more important aspects of sound
data analysis practice and interpreting the results of our analysis correctly.
While we will present formulas and equations, the emphasis in this book is
on understanding how statistics will be applied, and gaining insight into the
logic behind their use. As for the latter concern, the distrust of things statistical, I hope to help you gain some appreciation for the topic as we move
through the course. Statistics can be and are very relevant in the research
process, provided they are applied with respect and care.
2
4
7
9
T
S
There are many examples of the importance of statistics that we encounter
each and every day. For example:
• How do they know how much rain or snow has fallen in a given period of
time?
• Can a business person make decisions about the future by analyzing data
from the past?
• Can researchers ever get a good measurement of crowd size at a war or
political protest?
• Can a sales team make decisions on new products from a sample of
consumers?
• How do drug trials lead to the acceptance of a new drug that can be
brought to market?
The answer to each of these questions involves the use of statistics. In many
cases the answers are estimates from samples, and they come with some
concerns about the ability to truly measure the concept and the ability of
the sample to represent the population. Some are direct measurements,
such as the amount of snowfall, and some require a model to predict or
K11352_Ilvento_CH01.indd 1
7/29/13 10:39 AM
2
chapter 1
Introduction to Statistics, Plain and Simple
forecast future events. People in the field will use trial and error, experience,
and theoretical mathematics to make reasonable estimates of the amount
of snowfall, future trends in the economy, and whether a new drug can be
effective. You will learn that the answers to these and many more questions
contain some error. At times the conclusions will be wrong, and that is a
chance we have to accept when working with samples and models. Again
and again I will state that the inferential aspects of statistics are not about
certainty. They are about making reasonable conclusions based on a sample
of data and knowledge of statistical theory. Overall, statistics provide us with
powerful tools for analyzing data and making decisions from an experiment
or sample of data, but we can be wrong in our conclusions.
R
I
C
A
R
D
,
The focus of this book is on understanding the basics of statistics. I would
like you to:
• Gain an appreciation for how descriptive and inferential statistics are
used in various fields of study, from business to agriculture, healthcare,
and economics;
• Learn how to analyze a set of data;
• Learn how to present the data and make meaningful and coherent conclusions to others, and
• Learn how to critique the use of statistics by others.
A
D
R
I
E
N
N
E
This book provides an overview of most of the topics you might expect to
find in a beginning statistics course. It is meant to be a general overview to
build a foundation for further work. Obviously, anyone who wants to use
statistics in research or a work environment will need to take additional
courses and consult additional resources. The title of the book expresses my
personal goal in this course: to help you understand statistics plainly and
simply. I do not seek to cover every topic in statistics, or even as many topics
one might find in other introductory textbooks. My goal is to build a good
foundation for students to learn about statistics. It is my hope for many of
you that you will take additional courses and gain more depth of knowledge
about and insight into statistics and statistical techniques.
We will begin with some general concepts and terms that you will need to
know to begin to speak the language of research methods and statistics.
Some ideas, such as measurement, are not statistical concepts per se, but
they are important in understanding and interpreting statistics in research.
This chapter will be decidedly non-mathematical. However, later chapters
will include formulas and real data.
2
4
7
9
T
S
What Are Statistics?
There are many concepts involving the word statistics, and there are many
definitions of what it means. Statistics are thought to be the data itself
(the government released the latest statistics on unemployment); a field of
study within mathematics; and a set of tools used by many disciplines to
analyze data.
In its broadest sense, statistics are the science of data. The field refers to
aspects of
• Collecting data
• Classifying, summarizing, and organizing data
K11352_Ilvento_CH01.indd 2
7/29/13 10:39 AM
chapter 1
Introduction to Statistics, Plain and Simple
3
• Analysis of data
• Interpretation of the analysis of data
It is important to note that statistics is both a field of study and the application
of a set of tools to analyze data. Statisticians work primarily in developing and testing a set of tools to analyze data, especially in relation to making inferences from a sample of data to a population. Much of the work
of statisticians focuses on derivations of techniques, theoretical proofs,
and providing a literature as to the effectiveness and usefulness of various
techniques. Statisticians also work directly with data and work in companies, government agencies, and in research institutions to apply statistical
techniques to a range of data.
R
I
C
A
R
D
,
However, most of applications of statistics are done by people who are not
statisticians. Sociologists, psychologists, economists, political scientists, biologists, physicians and nurses, and business analysts are not statisticians, but
they rely on the work of statisticians to apply various techniques to data. Mostly
likely, you are taking a course in statistics because your discipline feels you
need some exposure to statistics to read the literature or even to participate in
the research process.
Most of the statistical techniques we learn in a statistics course help us
describe and summarize our data. For example, the mean or average is a
summary measure of central tendency of the data. With the mean, we can
use a single statistical measure, and summarize 100, 1,000, or even 10,000
data points. In some cases we want to go further than simply summarizing
the data—we want to make an inference from an experiment or a sample of
data. In order to do so, we need to first distinguish between description and
inference.
A
D
R
I
E
N
N
E
Descriptive versus Inferential Statistics
We make a distinction between two main approaches in the use of statistics
for data analysis, descriptive versus inferential statistics. Descriptive statistics
uses measures and graphs to summarize the data with an emphasis on parsimony. The strategy is to find summary measures which describe the data
adequately and succinctly, be they a percentage, average, or a standard deviation. Descriptive statistics also involve describing the relationships between
variables or sets of variables through the use of very sophisticated techniques,
such as correlation, regression, factor analysis, and logistic regression.
2
4
7
9
T
S
Inferential statistics involves many of the same techniques as in descriptive statistics, but it goes a step further. Inferential statistics seeks to make
statements from analysis of a sample or a set of experiments to a larger
population. Almost all research is focused on using inferential statistics.
In inferential statistics, we use some of the same techniques as in descriptive
statistics, but now the focus is also on making estimates, decisions, predictions, or generalizations about a population from a smaller subset or sample.
The sample can be a subset of a population as a cross-section of the population at a point in time, or as a sample in time or space. Inferential statistics are
a powerful tool for research. They enable us to make statements about a large
group from a much smaller sample. Thus, we can survey a sample of 1,000
people and make statements about 309 million people in the United States,
K11352_Ilvento_CH01.indd 3
7/29/13 10:39 AM
4
chapter 1
Introduction to Statistics, Plain and Simple
as is typically done in survey research for elections (in March 2010, the
e
stimated population of the United States was nearly 309 million people).
Whenever we work with a set of experiments or a sample of data, there is a
chance that the results we see are partly a function of the random fluctuations we expect from sample to sample. In other words, it is possible to get
an unusual sample and the result we observe does not reflect the population values. Statisticians have worked out strategies to know how unusual a
result is given a sampling process (primarily a random process) and a sample size. This is always done within a probabilistic framework with a chance
of being right or wrong in our conclusion. In general, we want the probabilities of making a good decision to be as large as possible. When we deal with
samples, it is never about certainty. There is always a chance of being wrong
in our conclusions with sample estimates.
R
I
C
A
R
D
,
Populations and Samples
A population is the total number of units involved in the research question.
The units are the members (or elements) of the population. The population
is what you are focused on when conducting a study—it is the group about
which you would like to make conclusions. Even when a sample is used in
research, the sample is expected to represent a population. Depending on
the focus of a study, populations could be:
•
•
•
•
•
•
•
A
D
R
I
E
N
N
E
People
Animals
Cells
Plants
Courses
Geographic places
Objects
The population should be clearly defined in any research endeavor. The
population is defined by:
1. T
he purpose of the study - what are we trying to understand and what
questions are we trying to answer?
2. The units and elements involved - what are the basic units that make up
the population?
3. Geographic coverage - what is the particular geographic area of interest,
a county, a state, or the whole country?
4. Time frame - is there a clear delineation of the time frame involved?
Things do change over time and it is important to note the time frame for
the population under consideration.
2
4
7
9
T
S
A census is when we collect data on all elements in a population. Sometimes
it is difficult or impossible to get information on the entire population. An
alternative is to take a sample of the population. A sample is a subset of
the units or elements of a population. To be valid, we want our sample to
represent the population. In other words, we want the characteristics of the
sample to resemble the characteristics of the population so that we can make
generalizations from the sample to the population. Samples are also defined
by the same considerations as the population—the focus of the study, the
units or elements involved, the geographic coverage, and the time frame.
K11352_Ilvento_CH01.indd 4
7/29/13 10:39 AM
chapter 1
Introduction to Statistics, Plain and Simple
5
Why Should We Sample?
The major reason we sample is that sampling saves time, money, and other
resources (for example, computation time on a computer). In some cases,
it may actually be impossible to collect information on every element of
the population and sampling becomes a reasonable alternative. Could we
actually count every unemployed person in a nation of 305 million people,
or everyone who is a supporter of the president? Sampling allows us to
collect data for a research project and still have some confidence that the
results represent something of value.
R
I
C
A
R
D
,
So, the most important reason we sample is because it works. We can design
a study based on a smaller sample and have a very good chance that the
data represents the population. In fact, a well conducted sample may actually be more accurate in providing estimates than population studies that
attempt to get every subject in the population, but fall way short. Every ten
years we conduct a census of the population in the United States. While
the U.S. Constitution requires that we attempt to count every person via the
census, most of the more interesting data—education, employment, poverty, and marital status—are based on a sample of the population. In fact,
the Census Bureau has argued again and again that they could a better job
in making estimates of the population with well designed samples than with
their current attempts to count everyone. The reason is that it is very difficult
to count everyone.
A
D
R
I
E
N
N
E
A valuable property of a sample is that it is representative of the population.
By this we mean that the sample characteristics resemble those possessed
by the population. Inferential statistics require a sample to be representative
of the population, and that can be done when the sample is drawn through
a random process. A random sample is when each element or unit has the
same chance of being selected. Classic statistical inference requires that the
sample be selected through a random process.
Measurement and Levels of Measurement
2
4
7
9
T
S
Measurement is the process of assigning a number or value to variables of
the individual elements of the population (or sample). Measurement is a very
important issue. Some measurement seems relatively straightforward—
distance, weight, dollars spent. However, even straightforward measurements
can come with some error and perhaps even bias.
With other types of variables, the measurement is not so straightforward.
How do we measure intelligence, anger, social networks, support for a policy,
or willingness to pay for a product or service? These measures are more difficult to conceptualize and therefore their measurement is more difficult. Some
of these may require multiple measures to fully assess the concept. Consider
your grade in a course. It is supposed to measure your comprehension of the
material in the course. But even that is debatable—some might argue that
exams and assignments simply measure a person’s ability to memorize the
material, and not the ability to comprehend it. Even so, few would be happy
with a single test for a course as the final determinant of the grade. Most
students prefer multiple tests or measures, averaged out, as a better indication of their performance. In essence, we would argue for multiple indicators
to measure our grade in the course.
K11352_Ilvento_CH01.indd 5
7/29/13 10:39 AM
6
chapter 1
Introduction to Statistics, Plain and Simple
With measurement we must also deal with issues of validity (are we measuring
what we think we are measuring) and reliability (is the measuring device
consistent). A user of data is responsible for asking questions and in some
cases doing preliminary analysis to determine if the measures are valid and
consistent. The process of measurement is often complex—do not take it for
granted.
Levels of Measurement is the term we use to reflect that variables can be
measured by numbers or classifications. There are various ways to characterize measurement of variables. An easy dichotomy in measurement is
qualitative versus quantitative data. Qualitative data do not follow a natural numerical scale and thus are classified into categories such as male or
female; customers versus non-customers; and race (white, African American,
Asian, and so forth). Qualitative data are often called categorical data.
R
I
C
A
R
D
,
Quantitative data use measures that are recorded on a naturally occurring
scale, such as age, income, or time. There is a continuous nature to these
data. In the extreme case, the measurement is continuous and smaller and
smaller increments can be used in making measurements, depending upon
how accurate you need to be. For example, we can measure distance in
yards, feet, inches, or fractions of inches.
A more elaborate description involves three levels of measurement—nominal,
ordinal, and continuous. Statistical programs, such as JMP, often use these
three levels to characterize the data and determine the appropriate statistics.
Nominal data (or categorical) have no implied order or superiority and can
be thought of as qualitative. A middle ground is ordinal data, where there
is an implied order or rank, but the distance between units is not well specified. Rankings, opinion questions that use ordered categories such as strongly
agree to strongly disagree, and variables that use an ordered scale from one
to ten are examples of ordinal data.
A
D
R
I
E
N
N
E
Continuous data are the same as quantitative data. These data are measured
on a continuous scale and the distance between measures is better understood. Most, but not all, of the advanced statistical techniques require continuous level data in order to meet the assumption of the method and to
extract the most information from the data.
2
4
7
9
T
S
Levels of measurement are not trivial to the use of statistics. Many statistical techniques are predicated on certain levels of measurement of the
variables involved. Some techniques or formulas assume a certain level is
used and applying the technique to the wrong type of variable can lead to
results that are biased or misleading. A software package such as JMP will
change the techniques of an analysis based on level of measure for the key
variables used.
Sources of Statistical Data
There are many ways to think of the type of research studies where statistics
are employed. I will use a basic breakdown that that includes observational
studies, experiments, and secondary data. From this perspective observational studies are any where the research observes or questions participants,
but does not structure or manipulate the participants (such as assigning them
to a treatment or control group). Field studies of nature as well as surveys
would fall under observational studies.
K11352_Ilvento_CH01.indd 6
7/29/13 10:39 AM
chapter 1
Introduction to Statistics, Plain and Simple
7
In contrast, experiments involve the researcher actively manipulating the
subjects, often into treatment and control groups, as a way to control for
extraneous factors that may influence the outcome of the experiment. An
experimental design, when conducted properly, can have less threats to validity and therefore we can have more confidence in drawing conclusions from
the results. However, not all research lends itself to experimental designs.
Thus, the need for observational studies.
A third type of study uses data from published sources—secondary data,
also known as existing data. In this case someone else collected the data
and made it available to you. Economists often use existing data about
the economy—sales, unemployment, interest rates—to develop statistical
models that forecast the future. Likewise, climatologists use weather data to
develop models and demographers use census data to study such things as
migration. Sources of existing data include:
•
•
•
•
•
R
I
C
A
R
D
,
Census of Population
Current Population Survey
Sports statistics
Unemployment data
The stock exchange
A word of caution on studies using existing data. When you use data collected by someone else, most of the data decisions are out of your control.
These are decision about whether to use a sample, the size of the sample, which data items to collect, at what geographic level the data will be
available, and the time frame when the data will be collected. With existing
data you are often a “data taker” and must settle on the decisions made
by someone else. For example, you might want to analyze monthly data
on unemployment by county, but the Bureau of Labor Statistics only has
monthly data at the state level. At times you will need to compromise your
study objectives in order to use these data. Working with existing data also
will require you to become very familiar with data definitions and data decision before you use the data.
A
D
R
I
E
N
N
E
2
4
7
9
T
S
Critical Thinking with Statistics
I urge you to be a critical thinker when looking at how statistics are used.
When you read about a study, particularly in the news, you should be asking
questions about the study. Statistics involves making critical decisions and
rational thought as to how a set of data is:
Sampled
Measured
Collected
Analyzed
Interpreted
If the study or report does not tell you details about these decisions, you
are limited in making a judgment on the validity and worth of the study.
It is important to ask questions and be a critical thinker when it comes to
how people use, or misuse, statistics. Throughout this book I will present
applications and at times challenges for you to look at statistical results.
I urge you to always question the data and results and see if the logic of
the analysis makes sense. I will end Chapter 1 with an example of a critical
K11352_Ilvento_CH01.indd 7
7/29/13 10:39 AM
8
chapter 1
Introduction to Statistics, Plain and Simple
look at a measurement issue that all students in college face—grades! You
should have some personal experience with this topic, and I hope a viewpoint. Grades in most U.S. universities tend to be a strange measurement.
A Measurement Example: What Level of Measurement
Are College Grades?
Most U.S. universities have a very curious system of grading for courses
and then ultimately for their system of grade point averaging (GPA). Almost
every course, my courses included, use a point system that has an absolute
zero and an upper bound. Some professors have a point system that goes
beyond 100. In these systems the total points for the grade could be 200, 300,
or a higher figure. Some requirements get more points, such as an exam,
and some are worth less. The final grade is based on a percentage of the total
points that each student earns, which converts their total points back to a
0 to 100 system. For example, if the course had 250 total points and a student
earned 205 points, her grade would be:
R
I
C
A
R
D
,
Percentage = 205/250 x 100 = 82.0%
I use a different system, where each exam, assignment, or quiz is weighted
to yield a final score of 100. For example, an exam might be worth 15 points
toward the final 100 points for the course. A student who receives an 85 on
the exam gets a 85x.15 = 12.75 points toward her final grade. In this way the
grade is converted to a scale of 0 to 100, with exams, assignments, and other
requirements weighted differently toward the total.
A
D
R
I
E
N
N
E
In either strategy, the grade can be thought of as a continuous level of measurement. However, most universities rely on a grading system with letter
grades: A, B, C, D, and F for failure. Some universities include a plus and
minus allowing for a wider range of grades, such as A, A-, B+, B , B-, C+, C,
C-, D+, D, D-, and F. A letter grade system is clearly ordinal. Once we convert
from the numerical system to letters, even if we use pluses and minuses, a
lot of information is lost. A grade of an A in my class can be based on percentage of 93.1 or a 99—the letter grade makes no distinction between the
two students. Both will receive an A for the course. Some information is
clearly lost in such a system. In fact, some students realize that their grade
will not change whether they turn in a final assignment or not, and forgo the
assignment because their grade will not change.
2
4
7
9
T
S
This is the way I make the conversion from a continuous measure with a
theoretical distribution of 0 to 100 to a letter grade.
A
AB+
B
BC+
C
CD+
D
DF
K11352_Ilvento_CH01.indd 8
93 to 100
90 to 92.9
87 to 89.9
83 to 86.9
80 to 82.9
77 to 79.9
73 to 76.9
70 to 72.9
67 to 69.9
63 to 66.9
60 to 62.9
Below 60
7/29/13 10:39 AM
chapter 1
Introduction to Statistics, Plain and Simple
9
I would note that college professors have a tremendous amount of latitude
as to the cut-off points for letter grades or pluses and minuses. One professor
may use 60 as the cut-off for passing while another uses 65. Some professors
curve the grades and a final grade of 40 might be passing. The variation from
course to course can be enormous.
At this point we moved from a continuous grade to an ordinal letter grade,
which shows up on the transcript. However, universities do not stop there.
Most use a grade point average (GPA) system which converts the letter grade
back to a point system, weighted by the number of credits for the grade. At
the University of Delaware we refer to these as quality points per credit. The
University of Delaware uses the following system to convert grades. I am
including the entire list just to show you how complicated grading can be
with Pass/Fail options, incompletes, and listeners, to name a few.
A
AB+
B
BC+
C
CD+
D
DF
R
I
C
A
R
D
,
Excellent 4.00 quality points per credit
3.67 quality points per credit
3.33 quality points per credit
Good 3.00 quality points per credit
2.67 quality points per credit
2.33 quality points per credit
Fair 2.00 quality points per credit
1.67 quality points per credit
1.33 quality points per credit
Poor 1.00 quality points per credit
0.67 quality points per credit
Failure 0.00 quality points per credit
A
D
R
I
E
N
N
E
X - Failure, 0.00 quality points per credit (Academic Dishonesty)
Z - Failure, 0.00 quality points per credit (Unofficial Withdrawal)
L - Listener (Audit), Registration without credit or grade. Class attendance is
required, but class participation is not.
LW - Listener Withdrawn, A listener who does not attend sufficient class
meetings to be eligible, in the judgment of the instructor, for the grade of L
will receive the grade LW.
2
4
7
9
T
S
NR - No grade required.
P – Passing, For specifically authorized courses. P grades are not calculated in
indexes. (For further explanation, see Pass/Fail grade option section.)
W - Official Withdrawal, Passing at time of withdrawal.
The following temporary grades are used:
I – Incomplete, For incomplete assignments, absences from the final or
other examinations, or any other course work not completed by the end
of the semester.
S - Satisfactory progress, For thesis, research, dissertation, independent
study, special problems, distance learning and other courses which span
two semesters or in which assignments extend beyond the grading deadline in a given semester.
U - Unsatisfactory progress, For thesis, research, dissertation, independent study, special problems, distance learning and other courses which
K11352_Ilvento_CH01.indd 9
7/29/13 10:39 AM
10
chapter 1
Introduction to Statistics, Plain and Simple
span two semesters or in which assignments extend beyond the grading
deadline in a given semester.
Temporary grades of S and U are recorded for work in progress pending
completion of the project(s). Final grades are reported only at the end of the
semester in which the work was completed.
N - No grade reported by instructor.
So, a student who gets a B+ in my class at the University of Delaware would
get 3.67 quality points multiplied by 3 credits which equals 11.10 quality
points toward their overall GPA. The overall GPA is the total number of quality points divided by the number of credits, which brings us back to a number
between zero and four. In this system, credits that are Pass/Fail or granted
from a test or another institution are not counted in the GPA, although they
do count toward graduation. As a result, the ordinal system of letter grades is
now converted back to something that looks like a continuous variable. The
final GPA for a student often uses two or three decimal places to distinguish
one student from another.
R
I
C
A
R
D
,
One might argue that it would be better to leave the number grades alone and
calculate a GPA based on a 0 to 100 numerical system. Some high schools
and some universities do that, which seems to make more sense to me from
a statistical point of view. Each time we convert to a letter grade we lose
information, even with a system that uses pluses and minuses. The same is
true when we collapse a continuous variable into categories, such as age or
income, that are converted into range categories. In such a coding system,
instead of each subject having their age in years, they are given a category,
such as 18 to 25, 26 to 34, and so forth. Whenever we collapse a variable into
ordinal categories, some information is lost in the process.
A
D
R
I
E
N
N
E
The discussion of measuring grades is actually more complicated than simply the conversion from a continuous to ordinal level of measurement. Every
student knows that some courses are easier than others, so the meaning of
a grade is different from course to course. While I can argue that comparing
one student’s grade to another in a class is relatively straightforward, the
same could not be said for comparing a grade in a course in Statistics to
a grade in English. It is also complicated to determine how easy or hard a
course is. It is a function of the background, experiences, and natural ability
of the student in the subject matter, combined with the level of the course
and the demands and grading philosophy of the instructor. There is no an
easy answer as to which course is easier or more difficult, but every student
knows that some courses are easier than others, at least for them.
2
4
7
9
T
S
I use this example to point out several things. First, measurement is a complicated process and should not be taken lightly. Even seemingly simple
measurements such as a grade for a course are far more complicated than
we might first realize. Second, the level of measurement does matter and
information is lost or gained by the level of measurement. In the grade
example we went from continuous to ordinal and then back to continuous.
That may be one of the strangest measurement processes, but it is not
unusual to go from continuous to ordinal. For example, surveys often do
this when asking the subject’s age. And third, making comparisons across
subjects on a particular measurement could be more difficult than we
first realize. It may not be fair to compare grades across different subjects
because the demands and expectations of the courses can be very different.
K11352_Ilvento_CH01.indd 10
7/29/13 10:39 AM
chapter 1
Introduction to Statistics, Plain and Simple
11
The meaning of an A in History might be different from the meaning of an
A in Physics. To be clear, I am not arguing that one subject is necessarily
easier than another, but I am saying that it might not be realistic to think the
grade has the same meaning across subject matters without some understanding of the level of the course and the demands made on the students
by the instructor.
Additional Problems
1. There are many excellent sources of statistical information on the Internet. I list a few sites that contain interesting statistics on life in the United States and around the world. You might know of or find some other
sources in your discipline.
R
I
C
A
R
D
,
The U.S. Bureau of the Census http://www.census.gov/
The home of a major data collector—the U.S. Bureau of the Census
American Community Survey http://www.census.gov/acs/www/
A revolving survey conducted by the U.S. Census Bureau, used to make
estimates of the population for cities, counties, states, and the nation
between each 10-year census
Statistical Abstract of the United States http://www.census.gov/
compendia/statab/
An annual publication of the U.S. Bureau of the Census containing facts
and figures on a range of topics from the federal budget to education
expenditures to births, deaths, marriages, and divorces (all the data can
be downloaded to a spreadsheet)
A
D
R
I
E
N
N
E
Current Population Survey http://www.census.gov/cps/
A large monthly survey of U.S. households that estimates issues of the
labor force (unemployment, hours worked, occupations), basic demographic information (age, sex, marital status), and other social issues
that affect households
U.S. Dept. of Agriculture, Data and Statistics http://www.usda.gov/wps/
portal/usda/usdahome?navid=DATA_STATISTICS
Sources of data on U.S. agriculture across three subagencies: Economic
Research Service, Foreign Agricultural Service, and National Agricultural
Statistics Service (NASS), which conducts the Census of Agriculture
2
4
7
9
T
S
DATA.Gov http://www.data.gov/home
A one-stop place for a wide range of data collected by federal agencies
Bureau of Economic Analysis http://www.bea.gov/
An agency focused on measures of the U.S. economy
Digest of Education Statistics http://nces.ed.gov/programs/digest/
Statistical information covering American education from grade school
through higher education
National Center for Health Statistics http://www.cdc.gov/nchs/
The Centers for Disease Control and Prevention (CDC) site for health
statistics in the United States and around the world
a. Select a site from the list above (or one you have found) and explore it.
Briefly describe the data available and in what format (tables, pdf files,
downloadable data) they are available.
K11352_Ilvento_CH01.indd 11
7/29/13 10:39 AM
12
chapter 1
Introduction to Statistics, Plain and Simple
b. For one data source, select one variable and look up its definition. Explain how the data are collected (survey, model, full census count, or
other means) and the details of the definition. For example, I looked up
the 2007 Census of Agriculture and learned that it primarily collects its
data through a mail survey of farms, followed up by telephone, Internet, and personal enumeration (face-to-face). The agency seeks to get
a full count of all farms, but recognizes its mailing list is incomplete
and its response rate is 85.2%. The agency uses statistical means to
adjust for missing data and missing operations. I focused on the definition of farm. I was surprised to learn it was defined by relatively small
sales: “an operation that produces, or would normally produce and
sell, $1,000 or more of agricultural products per year.” This search took
about 20 minutes to complete.
R
I
C
A
R
D
,
2. Body Mass Index (BMI) is often used in health care discussions about
weight and obesity; however, the measure is not without controversy.
Read the following discussion and search the Internet for alternative
viewpoints. Briefly summarize your own feelings about this measure, including the pros and cons of the current measure and whether you think
it is a valid measure of obesity (write 2–3 paragraphs).
Bo dy Mass In dex (BMI)
A
D
R
I
E
N
N
E
The Body Mass Index has been around a long time. According to Jeremy Singer-Vine
of Slate magazine, the BMI was first developed by Adolphe Quetelet, a Belgian mathematician who was trying to develop ideas about the “normal” person’s dimensions in
1832. In his work, he suggested that a person’s weight varied in proportion to a person’s height squared. He developed the following formula to express this relationship.
Metric Formula: Weight (kg)/[Height (m)]2
Example: Weight = 68 kg, Height = 165 cm (1.65 m)
Calculation: 68 ÷ (1.65)2 = 24.98
Nonmetric Formula: Weight (lbs)/[Height (in)]2 * 703
Example: Weight = 150 lbs, Height = 5’5” (65”)
Calculation: [150 ÷ (65)2] x 703 = 24.96
2
4
7
9
T
S
About 100 years later, the measure caught on in the medical community. Ancel Keys
published a paper in 1972 that used the Quetelet’s formula as the best predictor of
body fat percentage. He coined the term Body Mass Index (BMI). Keys felt it was a good
predictor of the percentage of body fat, but only in a general way, and it should not
be used as an individual predictor of body fat.
The advantage of the BMI was that it was easy to calculate, only requiring two fast and
inexpensive measurements from a subject. Other methods of calculating percentage
of body fat require more elaborate data collection strategies and are more invasive.
Because of the ease of use, BMI has become a dominant body fat measurement and
the leading obesity indicator in the United States since Keys’ article, despite his warnings to the contrary. According to Keys, the BMI was never intended as a personal
measure of obesity, and it should not be used to diagnosis or treat an individual patient. So BMI has become a widely used measure of obesity as well as a controversial
measure.
K11352_Ilvento_CH01.indd 12
7/29/13 10:39 AM
chapter 1
Introduction to Statistics, Plain and Simple
13
Here is the BMI website of the National Institutes of Health (NIH http://www.
nhlbisupport.com/bmi/). NIH allows you to calculate your own BMI with an online
calculator (sort of a self-diagnosis).
The NIH uses the following BMI categories:
Underweight = < 18.5
Normal weight = 18.5–24.9
Overweight = 25–29.9
Obese = ≥ 30
R
I
C
A
R
D
,
According to the NIH, the limitations of BMI are expressed in this caution: “BMI is a reliable indicator of total body fat, which is related to the risk of disease and death. The
score is valid for both men and women but it does have some limits. The limits are:
• “It may overestimate body fat in athletes and others who have a muscular build.
• “It may underestimate body fat in older persons and others who have lost muscle
mass.”
According to Singer-Vine, there is a growing controversy on the reliance of the BMI
as an obesity measurement. He indicates, “Faulty readings could promote a negative
self-image among healthy people and lead them to pursue unnecessary diets.” Almost anyone can now go online and find a BMI calculator and see their own personal
BMI. Singer-Vine feels there is some danger in that.
A
D
R
I
E
N
N
E
BMI is what statisticians would call an indicator variable. An indicator variable is defined as seeking to easily measure something that is complex in an easier, cheaper,
and still meaningful way. There are other ways to measure body fat, but they are
more costly and invasive (e.g., you have to get into a body of water or be pinched
or touched). With the BMI, you only need a person’s height and weight. An indicator
variable should be highly correlated (for now, think of correlated as related) with a
more accurate measure to be considered valid. Thus measures of total body fat and
BMI should agree across a wide sample of subjects. However, it is only an indicator
and not the true measure of body fat. It is based on a model and not a direct measure.
2
4
7
9
T
S
Sources:
National Institutes of Health. Retrieved from http://www.nhlbisupport.com/bmi/
Singer-Vine, J. (2009, July 20). Beyond BMI: Why doctors won’t stop using an outdated measure for obesity.
Slate magazine.
3. Each year the Social Security Administration (SSA) issues a press release
on the most popular baby names. This list is based on the names sent to
the SSA when applying for a Social Security number. The website can
be found here: http://www.ssa.gov/OACT/babynames/#ht=1 (or search for
most popular baby names).
a. Go to the website. Toward the bottom of the page, choose the option
for Popularity of a Name. Enter your name and search for 100 years.
The website returns the rank of your name for each of the past 100
years. Years are omitted if your name is not within the top 1,000 names
of that year.
b. The Wceb table can be easily copied and pasted into Excel. Grab the
two columns (year and rank) and copy them. Then open Excel and
paste in the results. It should go smoothly.
K11352_Ilvento_CH01.indd 13
7/29/13 10:39 AM
14
chapter 1
Introduction to Statistics, Plain and Simple
c. Use Excel to create a chart of the rank of your name over the past
100 years. In Excel insert a graph, choose Scatterplot, and name the
column with the Y data (for this problem, it is the rank). For the rest
of the graph, you are on your own to add a connecting line, title, and
subtitles. Explore Excel help and the Internet to find a way forward. I
have included a graph of my name rank over the past 100 years (it is
becoming less common after being at or near the top 10 until the early
1960s).
R
I
C
A
R
D
,
A
D
R
I
E
N
N
E
2
4
7
9
T
S
K11352_Ilvento_CH01.indd 14
7/29/13 10:39 AM
Measures of Central Tendency
R
Most statistics, whether they refer to a single variable or a complex model
I
involving many variables, are designed to help us describe things in a more
C
simple manner. We seek summary statistics to describe
our data. A useful
concept when summarizing data is to find some way
to
measure
the center
A
of the data. It has been referred to as the typical value, the average, or the
R
center. The central tendency of a variable is the tendency
of the data to cluster or center about certain numerical values. Central
tendency
is in contrast
D
to another concept which will be discussed in the next chapter, variability or
, focus on the mean, the
the spread of the data. For central tendency we will
mode, and the median.
chapter
3
A
D
The Mean
Rmeasurements divided
The arithmetic mean or mean is the sum of the
by the number of measurements contained in the
I data set. As the symbol
for the mean of a sample we use x with a bar over it. For a population,
E
we use the Greek letter, m (mu).
N
The formula for the mean is given below, represented in two different ways.
N
Both formulas would yield the same result. The first formula (to the left) is
E you use to calculate the
the more familiar one and is the one I recommend
mean. The second formula (to the right) presents the mean like an expectation in probability, and thus connects the mean to probability theory coming
2
in a future chapter.
n
x=
∑x
i =1
i
n
The sum of all the values, divided
by the number of values
4
n
7
x = ∑ (x i /n)
i =1 9
Tvalues weighted by the
The sum of
number of values
S
The first formula is the more familiar formula and reflects that the mean
is the average observation. The second formula yields the same result and
emphasizes the mean is a weighted summation with the weights being the
probability of each observation in the data set (i.e., 1/n) and as such is an
expectation of a probability distribution.
Let us look at a small data example to see how the mean is calculated and to
compare the two formulas. This data set has only ten observations for a variable, x. The values for x are in the second column and include values 21.0,
K11352_Ilvento_CH03.indd 35
7/29/13 10:42 AM
36
chapter 3
Measures of Central Tendency
table 3.1
A Small Data Example for Measures of Central Tendency
OBS
1
2
3
4
5
6
7
8
9
10
n
Sum
Mean
X
21.0
22.0
23.0
24.0
25.0
26.0
27.0
28.0
29.0
30.0
10
255
25.5
X/n
R
I
C
A
R
D
,
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
10
25.5
22.0, 23.0, and so forth. The sum of these values, given at the bottom of the
table, is the addition of each value and equals 255. The mean is calculated as
A
255/10 = 25.5.
D
The third column contains each value divided by the sample size, which is
Rat the bottom of this column, the sum of each
10 in this example. Looking
value weighted by the sample
I size (i.e., 1/10) is 25.5. This value is the mean.
So whether we take the sum of each value and divide by the sample size, or
we sum each value dividedEby the sample size, we generate the same value
for the mean.
N
N
As a measure of central tendency the mean has several advantages over
other measures of centralEtendency, and a few negative attributes. The
first advantage is that the mean uses information from all the values in a
variable—all the values of the variable are added together and divided by
2
the sample size.
4
A second important property of the mean is that it has inferential proper7 which we can draw conclusions from our
ties that are known and from
sample. Furthermore, these
9 inferential properties are relatively simple
and straightforward. By this we mean that we can make inferences from
a sample to a population T
for the mean. Some other descriptive statistics
of central tendency do notShave inferential properties, but the mean does.
We will learn more about the inferential properties of the mean in future
modules.
A third property of the mean is that it forms the basis for a number of other
statistics known as product moment statistics, which include the variance,
correlation, and regression coefficients. Thus the mean finds its way into
many formulas used in an introductory statistics class, some of which are
simple and others quite complex. The mean is an essential building block
for statistics.
K11352_Ilvento_CH03.indd 36
7/29/13 10:42 AM
chapter 3
Measures of Central Tendency
37
Finally, a fourth property that is a disadvantage of the mean as a measure
of central tendency is that it is sensitive to outliers and extremes in the data.
The mean is “pulled” toward extreme values in the data and it is not as
“resistant” to these values as are other measures of central tendency. If there
are extremely high values in the data set relative to the other values, the
mean will get larger than if they were not there. Likewise, the mean will be
pulled toward extreme low values in the data set.
The mean has two important mathematical properties that are important in
statistics. The first is that the sum of the deviations about the mean equals
zero. This means that if we took each value in a variable, subtracted the
mean from each one, and added the results of the subtractions, the total
would equal zero. If you think about this it might
R make sense why this is
so—the mean is the middle of the distribution and all the values are centered
I
around it.
C
The second property of the mean is that the sum of squared deviations about
A
the mean is a minimum. The latter property is called the least squares property.
R subtract the mean from
By this we mean if we take each value in the variable,
that value, and then square the result, and finally add
D all these calculations, we
would have a value that represents the squared deviations around the mean.
,
This second property tells us that the sum of squared deviations around the
mean is smaller than around any other value. For example, if we calculated
A it would be larger than
the sum of squared deviations around the median
that calculated for the mean. The latter property isD
exploited when looking at
the spread of the data for the variance and for more sophisticated data analyR
sis techniques such as regression.
I
E
The Median
N
The median is the middle value when the measurements are arranged in
N
ascending order. It is a positional measure because it is based on the middle
E we first must sort the
case in a variable. In order to find the median value,
data in ascending or descending order, find the position of the middle value,
and then read that value (see below for more details).
2
The median is an intuitive measure of central tendency—the
value at the
4
middle of the ordered data. However, the median is actually difficult to com7
pute because it requires you to sort the data. Fortunately,
spreadsheets and
statistical software packages will now calculate the
median
for us.
9
T so it is not used when
The median has very limited inferential properties,
making inferences from a sample or in hypothesisStesting. Nonetheless, the
median is often used in skewed data because it is not as sensitive to outliers.
The median is often the preferred measure of the center in data with extreme
values, such as income.
The median is one of many positional statistics. By position we mean that
they are based on their order in the data. You first must find the position,
and then read the value associated with the position. Other order statistics
include percentiles (e.g., 90th percentile), deciles (the 10th, 20th, and so forth
positions), and quartiles (the 25th, 50th, and 75th percentiles).
K11352_Ilvento_CH03.indd 37
7/29/13 10:42 AM
38
chapter 3
Measures of Central Tendency
Quartiles are used in box plots and in constructing the inter-quartile range.
The first quartile is referred to as Q1 and is the 25th percentile. The second
quartile is the 50th percentile and is the same as the median. The third quartile is the 75th percentile. We will be working with quartiles as a useful way
to describe the range of the middle 50 percent of the values in a variable. As
such the IQR is used to describe the spread of the data (see Chapter 4) and is
used to form the whiskers in a box plot (see Chapter 2).
Steps to Calculate the Median
An order statistic requires the data to be sorted from lowest to highest. The
next step is to find the position that is of interest—the ith value in the data set
R
that marks a certain position, such as the middle, a quartile (25 percentile),
I For the median, we are looking for the middle
or a decile (a 10th percentile).
position or the 50th percentile.
C
A median.
Here are the steps to find the
R
1. Sort the data
D of observations (denoted as n)
2. Do a count of the number
3. If the count is odd, the median
is the (n + 1)/2 position. For example, if n =
,
63, the median position is the (63 + 1)/2 = 32nd position in the ordered
data. Count to the 32nd position in the sorted data and read the value that
is there. The median is A
the value of the 32nd position
4. If the count is even, there is not an exact middle. So, we have to take the
D positions and call this the median. The middle
average of the middle two
two positions are the n/2
R position and the n/2 + 1 position. For example,
if the count is 64, we need to find the values at the 32nd and 33rd posiI
tions in an ordered data set (ordered from lowest to highest), and take the
average of these two values.
E
N
For the data in Table 3.1, the sample size n is equal to 10. Since it is even, we
are looking for the fifth andNsixth values in an ordered list of the data. Since
the data are already ordered
E from lowest to highest, the median is the average of 25 and 26.
Calculations for the median2
• n = 10, so use the 10/2 =45th and the 10/2 + 1 = 6th positions
• Median = (25 + 26)/2 = 25.5
7
9 confuse the median position with the median
With the median, student often
value. Remember, first we T
locate the median position(s) in the ordered and
sorted data, and then we identify the median value. When the number of
S the average of the two middle positions.
observations is even, we take
Mode
The mode is the most frequently occurring value in a variable. While this is
an intuitive concept of the center or a typical value for most of us, the mode
has its limitations. The most frequent value may not be anywhere near other
measures of center. And, in a continuously measured variable, there may not
be a single most frequently occurring value, leaving the mode undefined.
As a result, the mode is viewed as less useful than the mean or median.
K11352_Ilvento_CH03.indd 38
7/29/13 10:42 AM
chapter 3
39
Measures of Central Tendency
owever, the mode can provide some insights as to the most common value
H
in a variable and the shape of a distribution. In some cases there are multiple
“modes,” referred to as bi-modal or tri-modal distributions.
Multiple modes or groupings around a value may reflect different subgroups
within a variable. Figure 3.1 below shows a bi-modal distribution with a histogram of weight of a group of 249 subjects. The distribution shows two
distinct peaks around 120 and another around 150. The two “modes” reflect
the center of weight for females and for males.
R
I
C
A
R
D
,
A
D
Histogram of Subject Weight
R
I value that is the most freIn continuous level data, there may not be any single
quent, and thus technically the mode is undefined. The mode may make more
E
sense in reference to qualitative data. With a qualitative variable, we refer to the
N with the most responses.
modal class or category which represents the category
N
E
Outliers and the Measures of Central Tendency
Outliers can have a dramatic effect on some measures of central tendency,
2 one measure of cenand a minimal effect on others. In fact, we may choose
tral tendency over another based on the amount of spread in the data. For
4
example, the median is often used when referring to house values or income
7
because there is so much spread in the data.
9
Let us look at an example of how the spread of the data can influence the
T are primarily looking at
measures of central tendency. In this example we
the difference between the mean and the median
S and how they are influenced by extreme values in the data. The data we will use for this is the
marriage rate for the 50 states and the District of Columbia (n = 51). The
marriage rate is calculated as the number of marriages divided by the population in the state, expressed per 1,000 people. In the United States the
marriage rate in 2005 was 7.6 marriages per 1,000 population. However, this
rate varied by state. The rates for each state and the District of Columbia are
presented in Table 3.2.
The rates are sorted from lowest to highest, and it is easy to see that the rate
for Nevada is much higher at 57.9 per 1,000 people. That is because many
people travel to Nevada, specifically Las Vegas, to get married. Thus the
K11352_Ilvento_CH03.indd 39
figure 3.1
table 3.2
Marriage Rates by State, 2005
State
District of Columbia
New Jersey
Mississippi
Pennsylvania
Illinois
Connecticut
Delaware
Minnesota
Michigan
Wisconsin
Massachusetts
California
Arizona
Ohio
Washington
New Mexico
New York
Kansas
North Dakota
Maryland
Georgia
Iowa
Indiana
Marriage Rate
4.0
5.7
5.8
5.8
5.8
5.9
6.0
6.0
6.0
6.1
6.1
6.4
6.4
6.5
6.5
6.7
6.8
6.9
6.9
6.9
6.9
6.9
7.0
(continued)
7/29/13 10:42 AM
40
chapter 3
Measures of Central Tendency
number of marriages in Nevada reflects many people from other states
rather than just people in Nevada. The next highest rate is also a state known
for marriages, Hawaii. After these two states, the rates drop dramatically.
table 3.2 (continued)
Marriage Rates by State, 2005
State
Nebraska
Missouri
Rhode Island
New Hampshire
Oklahoma
Oregon
North Carolina
West Virginia
Montana
Colorado
Texas
Louisiana
Alaska
Virginia
Maine
South Carolina
South Dakota
Kentucky
Florida
Vermont
Alabama
Wyoming
Utah
Idaho
Tennessee
Arkansas
Hawali
Nevada
Marriage Rate
7.0
7.0
7.1
7.3
7.3
7.3
7.3
7.4
7.4
7.5
7.7
8.1
8.1
8.2
8.3
8.3
8.4
8.7
8.9
8.9
9.2
9.5
9.6
10.5
10.9
12.9
23.1
57.9
Source: 2010 Statistical Abstract of the
United States
Table 3.3 shows the measures of central tendency for the data in Table 3.2
including and not including Nevada, which is the largest outlier. This will
enable us to see what happens to the mean and the median when we remove
an extreme data point. The sum of the values including all 51 observations is
444.04, and with the 51 observations the mean is 444.04/51 = 8.71. Note this
is slightly different from the overall U.S. rate since each state is weighted the
same in this calculation. The median value is 7.05.
However, when we remove
RNevada from the data, the sum drops to 386.10
and the number of observations decreases to 50. Now the mean is 386.10/50
I
= 7.72. This does not seem like a huge drop, but is represents a (8.71 7.72)/8.71 = .1137 or 11.4% C
decrease due to one extreme value. In contrast,
the median hardly changes, with a new median of 7.04. As noted earlier,
A
the mean is much more sensitive to extreme values in the data when compared to the median. That R
is why the median is more often used when the
data have extreme values D
or it is skewed (see the next section for more on
what we mean by “skew”).
,
Comparing the Mean, Median,
and Mode
A
If we have a variable withDa distribution that reflects a symmetrical, bell
shaped curve, the mean, median, and mode would be very similar to one
R
another. The normal distribution
is a very special bell shaped curve where
the mean, median, and mode
are
equal to each other by definition. The symI
metrical, bell shaped curve is important in statistics because it allows us to
E
make inferences about distributions
of variables.
N
The distribution of a variable can give us insight to the measures of center
N
as well as the spread of the data. The spread will be covered in Chapter 4, so
E and measures of center. The mean tends to be
we focus on the distribution
pulled by extreme values in the data, so whenever the mean is larger than
the median, we tend to have extreme high values in the data. How far the
2
mean is from the median reflects
the extent of the outliers. Similarly, when
the mean is less than the median,
the mean is being pulled by extreme low
4
values in the data.
7
This concept is captured in9
the skew of the data. The skew of the data reflects
a tail in the distribution pulled by extreme values, either high or low. If the
Tthere are a few extreme high values. In this case,
data are skewed to the right,
the mean is greater than the
Smedian because the mean is being pulled by the
table 3.3
Measures of Central Tendency for the Marriage Rate Data
W Nevada
Sum
Count
Mean
Median
Mode
K11352_Ilvento_CH03.indd 40
444.04
51.00
8.71
7.05
#N/A
W/O Nevada
386.10
50.00
7.72
7.04
#N/A
7/29/13 10:42 AM
chapter 3
Measures of Central Tendency
41
extreme values. If the data are skewed to the left, there are a few extreme low
values and the mean is less than the median. Simply comparing the mean to
the median can give us a sense of the presence of extreme values or outliers,
and in which direction we can expect the skew.
A histogram or a stem and leaf plot is an excellent way to look at skew in a
variable. The following Figure shows three examples of a skewed left, symmetrical, and a skewed right distributions. The graphs provide a visual of the
distribution of the variable and the notion of skew. The mean and median
values are included for each distribution so you can see how the mean is
pulled by outliers.
Skewed Left Distribution: The Mean is less
than the Median
Mean = 22.35
Median = 25.00
Symmetrical Distribution: The Mean is
equal to the Median
Mean = 24.97
Median = 25.00
Skewed Right Distribution: The Mean is
greater than the Median
Mean = 26.85
Median = 25.00
R
I
C
A
R
D
,
A
D
R
I
E
N
N
E
2
4
7
9
T
S
Skewed Left, Symmetrical, and Skewed Right Distributions
figure 3.2
Summary
Measures of Central Tendency are a central concept to statistics. They give
us a summary measure of the center of a distribution. We discussed three
measures of the center—the mean, median, and the mode. The mean or
average is by far the most common measure of the center of the data. It
has important mathematical properties that are used in other summary measures, such as the variance and standard deviation. In fact, we will see the
K11352_Ilvento_CH03.indd 41
7/29/13 10:42 AM
42
chapter 3
Measures of Central Tendency
mean or summary measures or statistical techniques based on the mean
throughout the rest of this course. However, we also noted that the mean is
sensitive to extreme values in the data and can be misleading when outliers
are present.
The median is an alternative measure of the center of the data that is a positional measure. By positional we mean that the median is found by sorting
the data, noting the center position, and then reading the value at the center
position. In comparison to the mean, the median is far more resistant to outliers in the data and is useful whenever we find highly skewed data, such as
income of people or the value of houses.
The mode, or most frequent
R value in a data set, was also noted as a measure of central tendency, but it is far less useful than the other two. In some
I
variables, the mode is undefined. However, the bunching of data around a
particular value can be useful
C in looking at graphs of variables such as histograms, where we might think of the mode as the most frequent class or
A
category in the graph.
R
We ended the chapter with
D a discussion of how the mean, median, and
mode relate to each other when looking at the distribution of some variables. It was noted that in, symmetrical, mound-shaped distributions, the
mean tends to equal the median which also tends to equal the mode. The
normal distribution is one such mound-shaped distribution which is very
A
important in statistics.
D
Additional Problems R
I
1. Dr. Ilvento uses a smartphone app to track the distance he walks. While
E time, he noticed the distance stated in the app
taking the same walk each
varied a lot. So he began
Nto record the distance in an experiment. He also
recorded the distance in a car and found it to be consistently 2.4 miles.
The data are given in N
a stem-and-leaf plot below, measured in miles
(n = 19 observations). Also
E included is summary information (e.g., sum
of the values).
Stem
Leaf 2
2.4 8
4
2.5 4 6 6 9
2.6 0 0 1 2 3 4 5 7 7
79
2.7 0 2
9
2.8
T
2.9 0
S
3.0
3.1
3.2
3.3
3.4
3.5 0
2.4|8 represents 2.48
Sum of x
Sum of X^2
Q1
Q3
K11352_Ilvento_CH03.indd 42
50.930
137.365
2.595
2.680
7/29/13 10:42 AM
chapter 3
Measures of Central Tendency
43
a. Calculate the mean, median, and mode for this data.
b. What is the median position?
c. In your opinion, which measure of central tendency best represents the
center of this data?
2. The table below presents the winning times for the women’s Olympic
100-meter race from 1948 to 2012 (n = 17). The data are given in a stemand-leaf plot, measured in seconds to 2 decimal places. Also included is
summary information (e.g., sum of the values).
Stem
Leaf
105 4
106
107 5 5 8
108 2
109 3 4 7
110 6 7 8 8
111 8
112
113
114 9
115
116 7
117
118 2
119
120
121
122 0
109|3 represents 10.93
Sum of X
Sum of x^2
Q1
Q3
189.130
2107.142
10.820
11.180
R
I
C
A
R
D
,
A
D
R
I
E
N
N
E
2
4
a. Calculate the mean, median, and mode for this data.
7
b. What is the median position?
9
T
c. In your opinion, which measure of central tendency
best represents the
center of this data?
S
K11352_Ilvento_CH03.indd 43
7/29/13 10:42 AM
44
chapter 3
Measures of Central Tendency
3. An experiment was conducted concerning queuing (standing in a line)
methods at two similar fast food restaurants. In both stores A and B there
was a single line and customers were funneled toward the next available register. The experiment was done at off-peak times in Store A and
during a rush hour in Store B. The number of minutes until the customer
was served was recorded. The data are given below in two stem-and-leaf
plots, measured in minutes to 2 decimal places. Also included is summary information (e.g., sum of the values).
Store A
Stem
Leaf
1
1* 6 7 9
2 04
2* 5 5 5 7 8 9
3 0000
3* 6 6 8 9 9
4 2
4*
5
4|2 represents 4.2
Count
Sum of X
Sum of x^2
Q1
Q3
R
I
C
A
R
D
,
21.00
60.50
A
185.53
D
2.50
R
3.60
Store B
Stem
Leaf
1
1* 7
2
2* 5 8 8 9
3 11334
3* 5 5 6 7 8 9
4 1122
4* 7
5
4|2 represents 4.2
Count
Sum of X
Sum of x^2
Q1
Q3
21.00
72.20
257.58
3.10
3.90
I
E
b. Compare the results N
for each store.
N
E
a. Calculate the mean, median, and mode for each store.
2
4
7
9
T
S
K11352_Ilvento_CH03.indd 44
7/29/13 10:42 AM
chapter 3
Measures of Central Tendency
45
4. The rate of cesarean births has increased dramatically in the U.S. (based
on a report from the Centers for Disease Control and Prevention, NCHS
Data Brief no. 35, March 2010). In 1996 the U.S. rate of cesarean births was
20.7%, and by 2007 this had increased to 31.8%. The rate varied by state.
The data for 2007 are given below in a stem-and-leaf plot, measured as a
rate with one decimal place. Also included is summary information (e.g.,
sum of the values).
Stem
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
Leaf
3
2
2
29
668
1134556778
01126
01334789
044488
24
224689
08
0
3
26
R
I
C
A
R
D
,
A
D
R
Stem is the whole number; leaf is the decimal place. I
22|2 represents 22.2
E
N
Count
51
Sum(x)
1561.40
N
Sum(x^2)
48538.38
E
Q1
28.30
Q2
Q3
30.70
33.50
2
a. Calculate the mean, median, and mode for
4 2007 cesarean rates by
state.
7
b. The U.S. rate is given as 31.8%. The mean rate
9 you calculated from the
data above is slightly different. Why do you think the two rates are difT
ferent (Hint: The difference is sometimes referred to as a weighted or
S
unweighted mean calculation)?
K11352_Ilvento_CH03.indd 45
7/29/13 10:42 AM
46
chapter 3
Measures of Central Tendency
5. A recent exam in an introductory graduate statistics course resulted in
the following distribution (scores are based on 100). The data are given in
a stem-and-leaf plot. Also included is summary information (e.g., sum of
the values).
Exam Results
Stem
Leaf
1 8
2
3
4 2
5
R
6 3
7 29
I
8 126777899
C
9 112234556666678888
A
10 0 0 0
9|1 represents 91
R
Count
Sum(x)
Sum(x^2)
Q1
Q3
D
,
35
3062.00
277386.00
87.00
A 96.00
D
R
a. At least two of the scores
I are extreme values (18 and 42). Remove these
values from the data and recalculate the mean, median, and mode. To
do this, subtract eachEvalue from the Sum(x), and modify the count to
reflect 33 rather thanN
35.
N by removing the outliers?
b. Did the results change
E
Calculate the mean, median, and mode of the test scores.
6. Forbes magazine estimated the value of all major league baseball teams
in March 2013, represented as millions of dollars of worth. A stem-andleaf plot from JMP software
is given below, along with summary infor2
mation (e.g., sum of the values). There are two large outliers, the New
4
York Yankees, valued at $2.300 billion (expressed as 2300 in the data) and
7 valued at $1.615 billion (expressed as 1615 in
the Los Angeles Dodgers
the data).
9
T
S
K11352_Ilvento_CH03.indd 46
7/29/13 10:42 AM
chapter 3
Measures of Central Tendency
47
Summary Statistics
Count
30
Sum
22,307.0
SumSq
20,882,751.0
Q1
559.8
Q2
627.5
Q3
752.5
Min
451.0
Max
2,300.0
a. Calculate the mean, median, and mode of the team values (Note:
The value is in millions, so a value of 1312 is $1.312 billion dollars, or
R
$1,312,000,000).
I
b. At least two of the values are extreme values (2300 and 1615). Remove
C the mean, median, and
these values from the data and recalculate
mode. To do this, subtract each value from the
A Sum(x), and modify the
count to reflect 28 rather than 30.
R
D
c. Did the results change by removing the outliers?
,
7. Scramble with Friends is a social word game on smartphones. It is a 4x4
table of letters, and the object is to generate as many as words as possible in 2 minutes to win. Letters for words must
A be contiguous but they
can go in any direction. Each game has three rounds, and the second and
D score. Each round lasts
third rounds have bonus options to increase the
2 minutes. A player earns points based on word
R length, the particular letters used, and sometimes bonus letters or word points in rounds 2 and 3.
I
Dr. Ilvento played and recorded his score for 60
E games in 2012 and 60
games in 2013 (the number of rounds was 60*3 = 180 in each year). He
N well as the number of
recorded the number of points in each round as
words.
N
E
The data below are the number of words per round in 2013 (n = 180). A
histogram and stem-and-leaf plot from JMP software are given below,
along with summary information (e.g., sum of the values).
2
4
7
9
T
S
K11352_Ilvento_CH03.indd 47
7/29/13 10:42 AM
48
chapter 3
Measures of Central Tendency
Summary Statistics
Count
180
Sum(x)
10,062.00
Sum(x^2)
578,652.00
Q1
50.00
Q2
55.50
Q3
62.25
a. Calculate the mean, median, and mode for the number of words in
2013.
b. Briefly compare the three measures of central tendency for this data.
R the measures of central tendency, what do they
Based on the plots and
suggest about the distribution
of words in 2013?
I
8. Scramble with Friends C
is a social word game on smartphones. It is a 4x4
table of letters, and theAobject is to generate as many as words as possible in 2 minutes to win. Letters for words must be contiguous but they
can go in any direction.R
Each game has three rounds, and the second and
third rounds have bonus
D options to increase the score. Each round lasts
2 minutes. A player earns points based on word length, the particular let,
ters used, and sometimes bonus letters or word points in rounds 2 and 3.
Dr. Ilvento played and recorded his score for 60 games in 2012 and 60
A
games in 2013 (the number of rounds was 60*3 = 180 in each year). He
recorded the number ofDpoints in each round as well as the number of
words.
R
The data below are theIscore per round in 2013 (n = 180). A histogram
from JMP software is given
E below, along with summary information
(e.g., sum of the values).
N
N
E
2
4
7
9
T
S
Summary Statistics
Count
180
Sum(x)
105,557.00
Sum(x^2)
K11352_Ilvento_CH03.indd 48
72,512,839.00
Q1
Q2
Q3
Min
429.00
513.00
689.00
242.00
Max
2228.00
7/29/13 10:42 AM
chapter 3
Measures of Central Tendency
49
a. Calculate the mean, median, and mode for the score per round in 2013.
b. Briefly compare the three measures of central tendency for this data.
Based on the histogram and the measures of central tendency, what do
they suggest about the distribution of score per round in 2013?
9. Each year the Academy of Motion Picture Arts and Sciences picks a best
male actor and best female actor in a film. Below is the stem-and-leaf
plot from JMP software of the ages for the best female actor from 1934 to
2012 (n = 80, since there were two winners in 1968).
R
I
C
A
R
D
,
Summary Statistics
Count
80
Sum(x)
2,874.0
Sum(x^2)
113,902.0
Q1
Q2
Q3
Min
Max
28.8
33.0
40.3
21.0
80.0
A
D
R
I
E
N
N
E
a. Calculate the mean, median, and mode for the ages of the best female
actors in a film.
2
b. Briefly compare the three measures of central tendency for this data.
4
Based on the stem-and-leaf plot and the measures
of central tendency, what do they suggest about the distribution
of
average age for
7
females?
9
T
S
K11352_Ilvento_CH03.indd 49
7/29/13 10:42 AM
50
chapter 3
Measures of Central Tendency
10. Each year the Academy of Motion Picture Arts and Sciences picks a best
male actor and best female actor in a film. Below is the stem-and-leaf
plot from JMP software of the ages for best male actor from 1934 to 2012
(n = 79).
R
I
C
A
Summary Statistics
R
Count
D79
Sum(x)
3,458.0
,
Sum(x^2)
157,514.0
Q1
Q2
Q3
Min
Max
38.0
42.0
A
49.0
29.0
D
76.0
R
a. Calculate the mean, Imedian, and mode for the ages of the best male
actor winners.
E
b. Briefly compare the N
three measures of central tendency for this data.
Based on the stem-and-leaf
plot and the measures of central tendency,
N
what do they suggest about the distribution of average age for males?
E
c. The average age for females who won best actor over the same time
period is 35.9 years. Compare the two average ages.
2
4
7
9
T
S
K11352_Ilvento_CH03.indd 50
7/29/13 10:42 AM
Measures of Variability
R
Central tendency only tells part of the story when describing a variable.
I
Another aspect of data is the spread or variability of data. We are still looking
for summary statistics of data—simplified measures
C that help describe the
data—but we will concentrate on why cases differ from one another. There
A
are several intuitive measures of spread of data, including the range, the
inter-quartile range (IQR), the variance, standardR
deviation, and the coefficient of variation.
D
,
chapter
4
A Simple Example of Why the Spread Is Important
Imagine we have two data sets. Data set 1 has a A
mean, median, and mode
of 5, and Data set 2 also has a mean, median, andD
mode of 5. We might conclude they are one in the same since all the measures of central tendency
R
agree with each other. However, if we look at the data we see a different story.
I
Variable 1 has the following values: {2, 3, 4, 5, 5, 6, 7, 8}. The sum of X1 = 40,
E
and n = 8, so the mean = 5.
N
Variable 2 has the following values: {5, 5, 5, 5, 5, 5,
N5, 5}. The sum of X2 = 40,
and n = 8, so the mean = 5.
E
However, all the values are the same in variable 2—there is no variability.
X2 is a constant! We need something more to help describe a variable—the
2 around the measures
variability. Variability is the spread of the data, typically
of the center of the data. If there is no variability,4thenX is thought to be a
constant and is no longer a variable—all the values are the same.
7
In this chapter we will focus on the following measures
of variability:
9
T
• Range
• Inter-quartile range (IQR)
S
• Variance
• Standard deviation
• Coefficient of variation (CV)
We will also discuss the variance and standard deviation in relation to
the mean, and how to interpret them for some types of variables using
Chebyshev’s rule and the Empirical Rule.
These rules in turn will give us a new way to define what is an outlier.
K11352_Ilvento_CH04.indd 51
7/24/13 4:14 PM
52
chapter 4
Measures of Variability
As with the measures of central tendency, I will start with a small data set
to illustrate the measures of spread, and then I will demonstrate using other
data (Table 4.1). I will explain why I squared the X-values when I get to the
computational formula for the variance.
Range
The range is a fairly intuitive measure of the spread of the data. It is calculated
as the difference between the highest (maximum or max) and lowest (minimum or min) value in the data. The range provides a sense of the extremes in
the data. It is an order statistic and depends upon only the two most extreme
values in the data. As such, the range may be seriously influenced by outliers.
R
I
C is 30.0
Minimum is 21.0 Maximum
A
Range = 30.0 - 21.0 = 9.0
R
One of the limitations of the
Drange as a measure of spread is that it depends
upon only two values of the variable—the two most extreme values. Thus it
,
doesnot use much information
in the variable for its calculation. This leads
The range for the sample data in Table 4.1 is calculated as:
to the second limitation of the range, which is that it is sensitive to extreme
values in the variable. One or two extreme values can have a large influence
on the range. Another way A
to say this is that the range is sensitive to outliers.
D
Inter-Quartile Range R
I is the inter-quartile range, which is the differAn alternative to the range
ence between the 3rd quartile (75th percentile) and the 1st quartile (25th
E
percentile). The abbreviation for the inter-quartile range is the IQR. The
N a sense of the range in the middle of the data
inter-quartile range provides
and is not as sensitive to extreme
values in the data. It is also a positional
N
statistic because it depends upon finding positions of values in a variable
E lowest to highest.
that has been ordered from
table 4.1
A Small Data Example for Measures
of Variability
2
OBS
1
2
3
4
5
6
7
8
9
10
n
Sum
Mean
K11352_Ilvento_CH04.indd 52
X
21.0
22.0
23.0
24.0
25.0
26.0
27.0
28.0
29.0
30.0
10
255
25.5
X Squared
4
7441.0
9484.0
529.0
576.0
T
625.0
S
676.0
729.0
784.0
841.0
900.0
6585.0
7/24/13 4:14 PM
chapter 4
Measures of Variability
53
For the purposes of this course, we will not worry about how to actually
calculate the 25th and 75th percentiles. The formulas are somewhat similar to the median, but with small sample sizes it can be complicated to
calculate. However, I do want to give you some sense of the value of the
IQR when dealing with outliers or extreme values in a variable.
The IQR for the sample variable in Table 4.1 is:
Q1 (the 25th percentile) is between the 2nd and 3rd observations = 22.75
Q3 (the 75th percentile) is between the 8th and 9th observations = 28.25
The inter-quartile range is: 28.25 - 22.75 = 5.50
The IQR becomes an alternative to the range asRa measure of the spread
of the middle 50 percent of the values in a variable, that is, between the
I
25th and 75th percentiles. The IQR is also used in constructing box plots (see
Chapter 2).
C
A
R
Variance
D intuitively appealing as
The concept of deviations around the mean can be
a measure of spread of the data. If the mean is a, good measure of central
tendency, then it is reasonable to ask how different (or how far away) a particular value of a variable (X) is from the mean of X. Taking this idea a step
further, we might ask what is the average distance A
of all values in the variable
from the mean. We start with this idea when calculating the variance as a
D
statistical summary measure of spread in a variable.
R
Getting to an average deviation in a variable based on the mean can be a
I mean is that the sum of
tricky thing. Remember, one of the properties of the
deviations around the mean equals zero. As a result,
E we cannot simply calculate an average deviation around the mean, because that answer will always
N
be zero (see equation to the left below).
N
One alternative summary measure is the mean absolute difference—which is
E
the sum of the absolute differences between each value and the mean (see
equation to the right below). This simple adjustment does get around the limitation that the sum of the deviations around the mean
2 will equal zero by making
each deviation a positive value. The mean absolute difference does generate a
unique summary measure for each variable (rather4than always equaling zero
as with the mean deviation). However, this approach
7 does not have good inferential properties and there is another approach that is viewed as being more
9
useful—the variance.
n
∑(x
i =1
i
n
−X
)
n
∑ (x
i =1
i
T
S
−X
)
n
A third approach would be to square the differences from the mean, because
the square will always yield positive deviations around the mean. This
approach is called the variance. More specifically, we square the deviations
around the mean and take an average squared deviation by dividing by the
number of observations (by deviations, we refer to the difference of the value
from the mean). This formula is called the variance.
K11352_Ilvento_CH04.indd 53
7/24/13 4:14 PM
54
chapter 4
Measures of Variability
Let us look at the formula for the variance more closely. We will use the
Greek symbol s 2 (sigma-squared, see equation below) to represent the variance of a population along with the population mean m. The sample term for
the variance will be s2 (s-squared, see equation below). When we calculated
the sample variance we divided by n - 1. This has to do with a concept called
degrees of freedom (see box below). The need to divide by the degrees of
freedom has to do with making inferences from a sample to the population.
If we used n in the formula for the sample variance we would tend to underestimate the population variance. This concept is difficult to understand at
this point of the course, so for now you will just have to accept this on faith.
N
n
2
R
xi − X )
(
∑
(
)
∑
s 2 = i =1
σ 2 = i =1
I
(n − 1)
N
C
The formula on the right with
A n - 1 in the denominator will be the formula
we will use throughout this course, since ultimately we will be interested in
R
making an inference to a population.
Be careful with your calculator (and a
spreadsheet such as Excel)Dif you are using a function to calculate the variance. Most calculators will have a formula for both the population variance
,
and the sample variance. Almost always in this course we will want the
xi −
2
sample variance.
A
The numerator of the variance reflects the sum of the square of each value
D (see equation on the following page). The
in the variable minus the mean
numerator is called the total
R sum of squares (TSS, a term we will see
later in ANOVA and regression). Since we take the square of the deviations around the mean, Ithe numerator will always be a positive term.
Once we divide by n − 1, the
E degrees of freedom based on the number of
observations, the variance reflects the average squared deviation around
the mean. A property of N
the mean is that the TSS about the mean will
be smaller than any otherNconstant value that can be placed in the formula to replace the mean. This is called the minimum variance property
E
of the mean.
2
4
Deg
7 rees of Freedom
When we are dealing with 9
a sample of the population, and our ultimate goal is
some sort of inference, the formula
T for the variance and standard deviation must be
adjusted for degrees of freedom. The adjustment to the sample formula uses n − 1 in
the denominator. We will useSthe n − 1 formula for the variance (and the square root
of this formula for the standard deviation) almost exclusively for the rest of this course.
Degrees of freedom is an important concept in inferential statistics and will be seen
in more advanced analyses. While it is a difficult concept to comprehend at this level,
think of it as a necessary adjustment when dealing with a sample. Using n in the
formula for the sample variance tends to underestimate the population variance.
Note that the adjustment makes more of a difference when the sample size is small
(less than 30) than when the sample is large (greater than 1,000).
K11352_Ilvento_CH04.indd 54
7/24/13 4:14 PM
chapter 4
Measures of Variability
55
Another way to describe the variance is that it is the mean squared deviation.
n
(
Total Sum of Squares = ∑ x i − X
i =1
2
)
Let us calculate the variance using the sample data from Table 4.1 (now
referred to as Table 4.2). I provided several more columns of data for the
sample data to make this easier for us to calculate.
Table 4.2, column 3 clearly shows that the sum of the deviations about the
mean equals zero, and this is why this approach is not useful to calculate a
measure of spread or variability in the data. The sum of the squared deviaR
tions about the mean (column 5) is equal to 82.5. When I divide this value by
I variance, which is 9.167.
the degrees of freedom(10 − 1 = 9), I get the sample
C
∑ ( x − 25.5) 82.5 A
s =
=
= 9.167
(10 − 1)
9
R
D
A Small Data Example for Measures of Variability with,Additional Columns
10
2
OBS
1
2
3
4
5
6
7
8
9
10
n
Sum
Mean
Median
i =1
2
i
X
X Squared
X-Mean
21.0
22.0
23.0
24.0
25.0
26.0
27.0
28.0
29.0
30.0
441.0
484.0
529.0
576.0
625.0
676.0
729.0
784.0
841.0
900.0
−4.5
−3.5
−2.5
−1.5
−0.5
0.5
1.5
2.5
3.5
4.5
6585.0
0.0
10
255
25.5
25.5
A
D
R
I
E
N
N
E
table 4.2
Squared Dev.
20.25
12.25
6.25
2.25
0.25
0.25
2.25
6.25
12.25
20.25
82.5
2
4
7
9
Computational F ormula for the Variance
T
The calculations for the variance can be tedious. Subtracting each value
S
from the mean and then squaring the result, adding each of these squared
eviations, and then dividing by the degrees of freedom involves a lot of cald
culations. For the sake of this course, I will not require you to do this by hand
very often. Most times we will let a calculator or software calculate this for us.
However, I want to introduce a computational formula for the variance that
makes the computation of the variance for a variable more accurate, and
perhaps easier. A computational formula is a modification of the original
formula that will result in the same answer, but it will either make it easier to
calculate the statistic, or it has less rounding error.
K11352_Ilvento_CH04.indd 55
7/24/13 4:14 PM
56
chapter 4
Measures of Variability
2
n
n
s2 =
∑(x
i =1
i
−X
2
)
s2 =
(n − 1)
The Formula for the
Sample Variance
n
∑ (x )
i =1
n
∑ (x i2 ) −
i
i =1
(n − 1)
The Computational Formula
for the Sample Variance
The two formulas above will yield an equivalent result for the sample variance. I will use these formulas throughout the course. In terms of calculating
the variance with the computational
formula, I only need to give you:
R
I
• The sum of all the x values
• The sum of each x value
Csquared
• The sample size (n)
A
With this information you should
R easily be able to calculate the mean and the
variance. Let me demonstrate using the information from Table 4.2. Be sure
D
that you see how each element of the computational formula is derived and
, it.
that you know how to calculate
• The sum of all the x values = 255
Asquared = 6585
• The sum of each x value
• The sample size (n) = 10
D
R2
∑ (xIi )
n
i =1
2
2552
(x i ) −
E
∑
6585 −
n
10
s 2 = i =1
=
(n − 1) N
9
N
E
n
=
(6585 − 6502.5)
= 9.167
9
Standard Deviation
Average squared deviations around the mean are awkward to discuss and
2 terms is difficult to describe. Fortunately, it
interpret. Anything in squared
is relatively easy to alter the
4 variance and put it back into regular terms. If
we take the square root of the variance, we have a value that is no longer
in squared terms and we 7
bring the measure back into the original metric
terms of the variable. This new
9 term is the standard deviation, or the average
deviation around the mean. We use the Greek letter s (sigma) to represent
T
the population standard deviation and the term s to represent the sample
S
standard deviation.
The formula below shows the formula for the standard deviation. Note that it
is simply the square root of the variance.
n
s=
K11352_Ilvento_CH04.indd 56
∑(x
i =1
i
−X
2
)
(n − 1)
7/24/13 4:14 PM
chapter 4
Measures of Variability
57
Interpreting the Standard Deviation
We can use the standard deviation to express the proportion of cases that
might fall within one or two or more standard deviations from the mean. We
will use two theorems to help interpret the standard deviation.
1. Chebyshev’s rule (also known as Tchebysheff’s theorem)
2. Empirical rule
Chebyshev’s rule is simply a mathematical theorem for any variable, regardless of its distribution. It states that at least 3/4 of the values within a variable
will fall within ± 2 standard deviations from the mean. This does not mean it
couldnot be more, but at least 3/4 of them will. Also note that Chebyshev’s
R
rule does not say the values will be distributed symmetrically around the
mean. Chebyshev’s rule also states that at leastI8/9 of the measurements
(about 89%) will fall within ± 3 standard deviationsCfrom the mean.
A of its distribution, but it
Chebyshev’s rule applies to any variable, regardless
is not that specific. If our variable is symmetricalRand mound-shaped in its
distribution, we can use the empirical rule to make some statements to interD that the distribution is
pret the standard deviation. By symmetrical we mean
the same (or reasonably close) to the left and right
, of the mean. By moundshaped we mean that the largest proportion of the observations are centered
around the middle of the distribution, and the mean, median, and mode of
the variable are reasonably close.
A
If our variable is symmetrical and mound-shaped,Dthe empirical rule tells us
that approximately 68% of the observations should
R be plus or minus 1 standard deviation; 95% should be within plus or minus 2 standard deviations,
I
and nearly all the observations (99.7%) should be plus or minus 3 standard
E
deviations around the mean.
We can express this as:
• 68% of the observations are ± 1*s
• 95% of the observations are ± 2*s
• 99.7% of the observations are ± 3*s
N
N
E
2
This rule allows us to say how likely or unlikely it would be to find a variable
4 from the mean.
that is a certain number of standard deviations away
7
9
Z-scores
The z-score approach is a method of transformingTdata to reflect the relative
standing of the value in relation to the mean, in terms
S of the standard deviation. A z-score is calculated by subtracting the mean from a value and then
dividing by the standard deviation (see below).
zi =
(x
i
−X
s
)
The result represents the distance between a given measurement X and
its mean, expressed in standard deviations. A positive z-score means that
K11352_Ilvento_CH04.indd 57
7/24/13 4:14 PM
58
chapter 4
Measures of Variability
easurement is larger than the mean while a negative z-score means that it
m
is smaller than the mean. By dividing through by the standard deviation we
are able to say how far away a value is from its mean in a relative way. The
relative expression is how far away in standard deviations.
If we were to convert an entire variable to z-scores—take each value, subtract
the mean, and divide by the standard deviation—we would create a new
variable that has a mean equal to zero and a standard deviation equal to one.
The new variable would be in standardized units and thus would allow us to
compare different values to each other in terms of how many standard deviations away from the mean they are.
A z-score transformation R
does not change the order of the data or the
shape of the distribution of the data. This is because we are subtracting
I
and dividing through by constant values (i.e., the mean and standard deviation). Use of a z-score transformation
can help in interpretation of a variC
able, comparison of variables measured on different scales, and in cases of
A
variables whose measurement is somewhat contrived and arbitrary, such
R
as an index.
D
,
In terms of the empirical rule, z-scores have an even easier interpretation.
• Approximately 68% of the measurements will have a z-score between
-1 and 1
• Approximately 95% ofA
the measurements will have a z-score between
-2 and 2
D
• Almost all the measurements (99.7%) will have a z-score between
R
-3 and 3
I
Transforming to z-scores makes these types of problems even easier and
E to statistics. The rare event approach is a
leads to a rare event approach
basic strategy of data analysis
N which will fit very well with hypothesis testing later on. In the rare event approach, we start with a hypothesized freN
quency distribution to describe a population of measurements. In many
E
cases the hypothesized distribution
will reflect a world where nothing is
going on with the data, i.e., the status quo. Next we draw a sample of data
from the population (most often in a random fashion). We then compare the
2
sample statistic to the hypothesized
frequency distribution to see how likely
or unlikely it is that the sample
came
from the hypothesized distribution.
4
If the sample value is very unusual relative to the hypothesized value, we
7 that our sample is different from the hypothwould have strong evidence
esized value.
9
T us a rule of thumb to determine if a value
The empirical rule also gives
is an outlier. If a value is more
S than three standard deviations away from
the mean, it is extremely rare. In a probabilistic framework, we would say
that it is possible, but not very probable. Thus, if we had a compact car
that gets less than 29.8 mpg or more than 44.2 mpg we might ask questions. Perhaps it is a performance car that is part of a different population
of compact cars, or if it is on the high end a specialty hybrid that is unique.
Or, someone could have made a mistake in measuring mpg or in entering the data in a computer. The fact that a value is extreme does not make
it wrong or bad, but it should cause us to ask questions and examine it
further.
K11352_Ilvento_CH04.indd 58
7/24/13 4:14 PM
chapter 4
Coeffici
t of Variation
Another way to express the standard deviation is in relation to the mean.
The coefficient of variation (CV) is the ratio of the standard deviation to the
absolute value of the mean, usually expressed as a percentage. By taking a
ratio, we express the standard deviation relative to the mean and it provides
a way to say how much variability there is in a variable relative to the size
of the mean. The higher the percentage is, the more variabilitythere is in our
variable.
The CV is particularly useful when comparing the variability of different
variables. For example, suppose we have a data set on customers and
we want to compare the variability of education level and their income. It
R
would not be useful to compare the standard deviations because the metI
ric on income is so much larger than that of education.
However, we could
compare the CVs for each variableand talk aboutCwhich variable has more
variability.
The CV formula is given below.
CV =
s
* 100
X
A
R
D
,
A
The Variance and Outliers
D
Because the variance is calculated as squared deviations
around the mean,
R
it can be sensitive to outliers in the data. Values that are far away from the
I
mean result in large deviations, and once we square them they can contribute
E sensitive to outliers, but
a lot to the variance. The standard deviation is also
somewhat less so since it is the square root of the variance.
N
Let us return to the marriage rate data for the 50Nstates and the District of
Columbia from Chapter 3 (see Table 4.3). Recall the
E marriage rate is calculated as the total number of marriages divided by the population for each
state and the District of Columbia. Nevada, because of its reputation for quick
marriages, had a much higher marriage rate than2the other states for 2005.
We can calculate the measures of variability to see how much an outlier can
4
influence the various measures of spread.
7
The maximum value is 57.94 (Nevada) and the minimum is 4.03 (District of
9
Columbia). As a result the range is 53.91, a substantial
difference between
the highest and lowest rates. As noted earlier, the
T range relies on the two
most extreme values in the data. When Nevada is removed from the data,
S
the range reduces to 19.06. Just one value had a considerable impact on
the range. In comparison, the inter-quartile range is much smaller. The first
quartile is 6.47 and the third quartile is 8.30, with an IQR of 1.83. We would
interpret this as, the difference between the low and high value within
the middle 50 percent of the values is only 1.83. The middle 50 percent of
the values has considerably less variability than the total variable. When
Nevada is removed from the data, the IQR barely changes (now 1.82).
One way to say this is that the IQR is resistant to outliers compared with
the range.
K11352_Ilvento_CH04.indd 59
59
Measures of Variability
table 4.3
Marriage Rates by State, 2005
State
District of Columbia
New Jersey
Mississippi
Pennsylvania
Illinois
Connecticut
Delaware
Minnesota
Michigan
Wisconsin
Massachusetts
California
Arizona
Ohio
Washington
New Mexico
New York
Kansas
North Dakota
Maryland
Georgia
Iowa
Indiana
Nebraska
Missouri
Rhode Island
New Hampshire
Oklahoma
Oregon
North Carolina
West Virginia
Montana
Colorado
Texas
Louisiana
Alaska
Virginia
Maine
South Carolina
South Dakota
Kentucky
Florida
Vermont
Alabama
Wyoming
Utah
Idaho
Marriage Rate
4.0
5.7
5.8
5.8
5.8
5.9
6.0
6.0
6.0
6.1
6.1
6.4
6.4
6.5
6.5
6.7
6.8
6.9
6.9
6.9
6.9
6.9
7.0
7.0
7.0
7.1
7.3
7.3
7.3
7.3
7.4
7.4
7.5
7.7
8.1
8.1
8.2
8.3
8.3
8.4
8.7
8.9
8.9
9.2
9.5
9.6
10.5
(continued)
7/24/13 4:14 PM
60
chapter 4
Measures of Variability
table 4.3 (continued)
Marriage Rates by State, 2005
State
Tennessee
Arkansas
Hawaii
Nevada
Marriage Rate
10.9
12.9
23.1
57.9
Source: 2010 Statistical Abstract of the
United States
table 4.4
When we examine the variance and standard deviation, the effect of the
outlier Nevada is considerable. For the full data set, the total sum of squares
is 6,695.12. When just one value is removed, the Nevada rate, the sum of
squares reduces to 3,338.19.This is a 50 percent decrease from just one value!
The variance for the full data is 56.58 while the variance for the reduced data
is 7.28, a nearly eight-fold decrease. Likewise the standard deviation declines
from 7.52 to 2.70 once Nevada is removed from the data. The coefficient of
variation shows the effect of the outlier nicely. The CV is 86.39 for the full
data, meaning the standard deviation is about 86 percent of the mean. Once
Nevada is removed, the CV drops to 34.94, indicatingthe standard deviation
is only 35 percent of the mean.
R Marriage Rate Data, with and without Nevada
Measures of Variability for the 2005
Measure of Spread
Sum
Count
Mean
Median
Mode
Sum of Squares
Min
Max
Range
IQ1
IQ3
Variance
Std Deviation
Coeffici t of Variation
I W Nevada
C 444.04
A 51.00
R 8.71
7.05
D #N/A
, 6695.12
4.03
57.94
53.91
6.47
8.30
56.58
7.52
86.39
W/O Nevada
386.10
50.00
7.72
7.04
#N/A
3338.19
4.03
23.09
19.06
6.45
8.27
7.28
2.70
34.94
A
D
R
I
E
N or without Nevada, hardly has a symmetrical,
The marriage rate data, with
mound-shaped distribution.N
Thus the empirical rule does not apply. However,
it might be useful to calculate a z-score for Nevada in the data to see just how
many standard deviations itEis from the mean.
ZNevada = (57.94 - 8.71)/7.52 = 6.55
2
The marriage rate for Nevada
4 is 6.55 standard deviations above the mean.
Z-scores that are above 3 are considered rare, so this value is very unusual.
7
An extreme value more than three standard deviations from the mean is not
9 it does indicate there is an extreme value in
wrong or bad in itself. However,
the data, and that this value can have great influence on some of the meaT
sures of central tendency and variability. In this case, Nevada is so unusual
S that we might want to exclude it from any furcompared to all the other states
ther analysis. The next highest value, for Hawaii, is only 1.91 standard deviations above the mean
ZHawaii = (23.1 - 8.71)/7.52 = 1.91
K11352_Ilvento_CH04.indd 60
7/24/13 4:14 PM
chapter 4
Measures of Variability
61
A Data Example: Summary Statistics for a Symmetrical,
Mound-Shaped Variable
Let us use all the information we have learned thus far to describe a variable.
This variable has 154 observations and is measured on a continuous level.
We will use the software package JMP to help generate the summary statistics. You will note from the output that JMP generates many more summary
statistics of central tendency or variability than we have discussed thus far,
and some that we have discussed, such as the mode, are not included. However, the core measures such as the mean, median, variance, and standard
deviation are included. JMP also generates graphs such as a histogram, a
box plot, and a stem and leaf plot for our use.
R
The mean for this variable is 24.97, which is nearly identical to the median.
I all show the distribuThe histogram, box plot, and the stem and leaf plot
tion is symmetric and mound-shaped. The rangeCis 16 (33-17) and the IQR
is 4 (27-24). The variance is 10.95 and standard deviation is 3.31. There
A
is not much spread in this variable since the coefficient
of variation is
only 13.25.
R
D
If we calculated a range of plus/minus 3 standard deviations
about the mean,
we would have an interval of:
,
24.97 ± 3*3.31 = 24.97 ± 9.93 = 15.04 to 34.90.
A
D
R One observation is 18.0.
I can calculate a z-score for one of the observations.
The z-score for this observation is:
I
E
Z = (18.0 - 24.97)/3.31 = -2.11
N
This observation is 2.11 standard deviations below the mean.
N
E
All the observations of this variable are within this interval.
2
4
7
9
T
S
Summary Statistics from JMP for a Symmet...
Purchase answer to see full
attachment