© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
What do you
want to do?
Describe
How many
variables?
Univariate
Bivariate
What level
of data?
Nominal
Ordinal
Central
tendency
Central
tendency
Mode
16304_CH04_Walker.indd 92
L
I
D
D
E
L
L
,
T
I
F
F
A
N
Y
1
5
6 Median
8
T
S
Make
inferences
Multivariate
Interval/
Ratio
Central
tendency
Mean
8/2/12 3:41:29 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
Chapter 4
Measures of Central
Tendency
L
I
D
Learning Objectives
Dmean as measures of central tendency.
■■ Understand the mode, median, and
■■ Identify the proper measure of central
E tendency to use for each level of measurement.
L
■■ Explain how to calculate the mode, median, and mean.
L
,
4-1 Univariate Descriptive Statistics
Using frequency distributions and graphical
T representation, as in Chapter 3, helps
researchers determine how the data is arranged and summarize it. Frequency distribuI tell the entire story. It is usually necessary
tions and graphs, however, cannot always
to summarize the data further. Instead ofFsummarizing entire distributions, it is more
often efficient to compare only certainFcharacteristics of the data. To conduct this
comparison, it is helpful to know certain information, such as the form of the distribution, the average of the values, and howA
spread out they are within the distribution.
This is where univariate descriptiveN
statistics come into play. Univariate descriptive statistics are used to describe and interpret the meaning of a distribution. They
Y
are called univariate because they pertain to only one variable at a time and do not
attempt to measure relationships between variables. Univariate descriptive statistics
make compact characterizations of distributions
in terms of three properties of the
1
data. First is the central tendency, which translates to the average, middle point, or
5
most common value of the distribution. The second property is the dispersion of the
6 are around the central measure. Finally,
data. This relates to how spread out the values
there is the form of the distribution. The
8 form of a distribution relates to what the
distribution would look like if displayed graphically. Included in the form of a disT
tribution is the number of peaks, skewness, and kurtosis. In this chapter we address
Smeasures of central tendency. Measures of
the first univariate descriptive procedure:
dispersion and measures of the form of a distribution are covered in Chapters 5 and
6, respectively.
93
16304_CH04_Walker.indd 93
8/2/12 3:41:29 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
94 Chapter 4
4-2
n
Measures of Central Tendency
Measures of Central Tendency
Measures of central tendency examine where the central value is in a distribution or
the distribution’s most typical value. There are three common measures of central
tendency, one for each level of measurement (interval and ratio are combined). These
are the mode for nominal level data, the median for ordinal level data, and the mean
for interval and ratio level data.
L
I (symbolized by Mo). The mode is
At the lowest level of sophistication is the mode
used primarily for nominal data to identify the
D category with the greatest number of
cases. The mode is the most frequently occurring value, or case, in a distribution. It
D
is the tallest column on a histogram or the peak on a polygon or line chart. The mode
has the advantage of being spotted easily in aE
distribution, and is often used as a first
indicator of the central tendency of a distribution.
L
The mode is the only measure of central tendency appropriate for nominal variL
ables because it is simply a count of the values. Unlike other measures of central ten,
dency, the mode explains nothing about the ordering
of variables or variation within
Mode
the variables. In fact, the mode ignores information about ordering and interval size
even if it is available. So it is generally not advised to use the mode for ordinal or
T
interval level data (unless it is used in addition to the median or mean) because too
I
much information is lost.
There is no formula or calculation for theFmode for either grouped or ungrouped
data. The procedure is just to count the scores and determine the most frequently
F
occurring value. Consider the data set in Table 4-1, which is the number of prisoner
AHere, there are 15 total escapes. There
escapes from 15 prisons over a 10-year period.
are two 7’s, one 6, three 5’s, two 4’s, four 3’s, one
N 2, and two 1’s. The mode in this data
set would be 3 escapes, because there are more 3’s than any other value.
Y
7
5
4
3
2
7
5
4 13
1
6
5
3 53
1
6 Data
Table 4-1 Ungrouped
8
For grouped data, determining the mode is often even easier because the numbers
are already counted. The data from Table 4-1 T
has been grouped in Table 4-2. What is
the mode of this data set? Here you simply determine
the category that has the highest
S
value. In this case it would be the 3–4 category because it has a frequency of 6. If the
data were plotted on a bar chart or polygon, the distribution would look like that in
Figure 4-1. Here, you can see that the category 3–4 has the highest bar on the bar chart
16304_CH04_Walker.indd 94
8/2/12 3:41:29 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
4-2
Measures of Central Tendency
X
f
7–8
2
5–6
4
3–4
6
1–2
3
95
Table 4-2 Modal Value for Grouped Data
L
I
D
How do you do that?
Obtaining Univariate Statistics in SPSS D
E central tendency, measures of dispersion,
Univariate statistics include measures of
and form. You can obtain all of these statistics
in the same procedure in SPSS, and
L
it is just an extension of the same procedure you used in Chapter 3 to obtain a freL
quency distribution. The steps to follow are:
,
1. Open a data set.
2.
3.
4.
5.
6.
7.
8.
16304_CH04_Walker.indd 95
a. Start SPSS.
b. Select File, then Open, then Data.
T
c. Select the file you want to open, then select Open.
I
Once the data is visible, select Analyze, then Descriptive Statistics, then FreF
quencies.
F is checked.
Make sure the Display Frequency Tables
Select the variables you wish to include
A in your distribution and press the c
between the two windows.
N
Select the Statistics button at the bottom of the window.
Y
Check the boxes of any of the univariate measures you want to include in your
research.
a. For measures of central tendency1(this chapter), check the boxes in the frame
Central Tendency, typically the5mode, median, and mean.
b. For measures of dispersion (Chapter 5), check the boxes in the frame Disper6
sion, typically the standard deviation, variance, and range.
8 check the boxes in the frame Distribution,
c. For measures of form (Chapter 6),
specifically skewness and kurtosis.
T
Select Continue, then ok.
S
An output window should appear containing a distribution similar in format to
Table 4-3.
8/2/12 3:41:29 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
96 Chapter 4
n
Measures of Central Tendency
7
6
5
4
3
2
L
0
I
1–2
3–4
5–6
7–8
D
Figure 4-1 Bar Chart and Polygon of Grouped Data from Table 4-2
D
What is your highest level of education?
E
Valid
Cumulative
L Percent Percent Percent
Value Label
Value Frequency
Less than High School
1
16 L
4.6
4.8
4.8
GED
2
59 ,
17.0
17.6
22.3
1
High School Graduate
3
8
2.3
2.4
24.7
Some College
4
117
33.7
34.8
59.5
College Graduate
5
20.7
21.4
81.0
Post Graduate
6
18.4
19.0
100.0
Missing
Total
N
Valid
Missing
Mean
T
72
I
64
11 F
347 F
336 A
11 N
4.08Y
Median
4.00
Mode
4
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
3.2
100.0
100.00
1
5
2.131
6
2.477
.133
8
2.705
T
.265
S
1.460
5
Table 4-3 Combination Table for Education from the 1993 Little Rock Community
Policing Survey
16304_CH04_Walker.indd 96
8/2/12 3:41:29 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
4-2
Measures of Central Tendency
97
and it forms a hump in the polygon. This highest bar or hump indicates the mode for
that variable.
One caution when discussing the mode. The mode is not the frequency of the
number that occurs most often but rather, the category (or class) itself. It is easy to
want to state that the mode in Table 4-2 is 6 because that is the frequency that is highest. This is not the mode, however; the mode is the category of the value that has the
highest frequency: in this case, 3–4.
Data that is in a frequency table alsoL
makes calculating the mode easy. What is the
mode in the frequency table in Table 4-3? The mode in this case is 4, or some college.
I
Note that in this case, the mode can be written as either 4 or some college. When using
D are assigned to the values, the mode can be
nominal or ordinal data where value labels
expressed as either the value (number) orDthe value label.
The histogram with a polygon overlay for the data in Table 4-3 is shown in Figure
E
4-2. As shown in the figure, the highest bar on the histogram or the hump in the polygon is at the 4 or some college level. TheLmode as calculated here is what is obtained
from SPSS. In the output in Table 4-3, the
Lmode is identified as some college (4), with
a frequency of 117. Notice also that the median, mean, and other measures are also
,
included in this table. This is typical univariate
output from SPSS. It provides most of
the univariate descriptive statistics discussed in this chapter and the two that follow.
Table 4-3 may look somewhat daunting T
right now, but by the time you finish Chapter
6, a frequency table and univariate output such as this should be shorthand for everyI
thing you need to know about a distribution.
A distribution is not confined to having
F only one mode. There are often situations
where a distribution will have several categories
that have the same or similar frequenF
cies. In these cases, the distribution can be said to be bimodal or even multimodal. It is
also possible for a distribution to have noA
mode if the frequencies are the same for each
category. If the data in Table 4-1 is modified,
N a bimodal, multimodal, and a data set
with no mode can be created, as shown in Figures 4-3 to 4-5. In Figure 4-3, categories
Y
140
Frequency
120
100
80
60
40
20
0
1
5
6
8
T
S
Less Than GED High School Some
College
Post
High School
Graduate College Graduate Graduate
Education Type
Figure 4-2 Histogram and Polygon of Education Responses
16304_CH04_Walker.indd 97
8/2/12 3:41:29 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
98 Chapter 4
n
Measures of Central Tendency
3
2
1
0
1
2
Figure 4-3
3
2
1
0
1
2
Figure 4-4
3
L
I
3
4 D 5
6
7
Bimodal
DDistribution
E
L
L
,
T
I
F
F
3
4
5
6
7
A
Multimodal Distribution
N
Y
1
5
6
8
T
S
2
1
0
2
3
4
5
7
Figure 4-5 No-Modal Distribution
16304_CH04_Walker.indd 98
8/2/12 3:41:30 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
4-2
Measures of Central Tendency
99
3 and 4 both have the same frequency: 3. In this case, both the 3 and the 4 would be
the modes because each has the same (highest) value.
In Figure 4-4, the 3, 4, and 5 categories all have frequencies of 3. This means
that all three categories would be the mode for this distribution. When almost half of
the categories in the distribution represent the mode, its use as a measure of central
tendency is reduced.
In Figure 4-5, all of the categories have the same frequency. This does not happen very often, but it is possible, especially
L in survey research or with other data that
have a limited range of categories. The mode as a measure of central tendency in this
I
case is practically useless, although it would be beneficial as a way of stating that
D Although each of the modes in Figures
all of the values have the same frequency.
4-3 (3 and 4), 4-4 (3, 4, and 5), and 4-5D
(2 through 7) have the same frequency, that
does not always have to be the rule. There is some debate as to what constitutes a
E
bimodal or multimodal distribution. Some propose that the frequencies have to be
L Others argue that practically any peaks
the same for a distribution to be multimodal.
in a distribution can represent modes. For
L example, in Figure 4-1, some would argue
that both the 1–2 category and the 3–4 category represent a mode. These people
, spike in the frequency, may represent a
argue that any peak in a polygon, or any
mode. In this text, only the category or categories with the highest frequencies will
be designated the mode.
T
I
F
F
A
Measures of central tendency are among the oldest of all descriptive statistics.
N
The mean, for example, can be traced back to Pythagoras in the 6th century BC,
although its development is surely much
Y earlier. Galton (1883) coined the term
median during his work on percentiles, but the procedure was used before this by
Fechner for arriving at a value of the “middlemost ordinate.” Finally, Karl Pearson
reduced the concept of the “abscissa corresponding to the ordinate of maximum
frequency” to the mode in his 1895 work.
1
5
6
As can be seen from these distributions,
8 the mode quickly becomes ineffective
when there are multiple modes and is worthless
(except for an understanding of the
T
nature of the distribution) when each category has a modal value. This is why the
mode is not widely used as a measure of S
central tendency in statistics except for nominal level data.
16304_CH04_Walker.indd 99
8/2/12 3:41:30 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
100 Chapter 4
n
Measures of Central Tendency
Median
If the data is at least ordinal level, the median (symbolized by Me) may be a better choice for examining the central tendency of the distribution. The median is the
point of the 50th percentile of the distribution. This means that the median is the
exact midpoint of a distribution or the value that cuts the distribution into two equal
parts. For the simple distribution 1, 2, 3, the median would be 2 because it cuts this
distribution in half. Note that 2 is not the most frequently occurring or the product
of some formula but simply the value in the L
middle. The median will always be the
middle value, but sometimes it will be necessary
I to resort to math to determine the
exact middle value.
D because it does not imply distance
The median is used with ordinal level data
D or below it. Recall from “Variables
between intervals, only direction: above the median
and Measurement” (Chapter 2) that the nature
E of ordinal level data is that you can
determine which category is greater than or less than another category, but there are not
equal intervals so there is no way to determineL
how much greater or lesser the category
is. The median also works on this principle, determining
the midpoint of a distribution
L
such that a category can be said to be less than, or greater than the median, but there is
no way to tell by how much. For example, take the following two distributions:
1, 2, 3, 3, 4, 4, 5 1, 1,
T 1, 3, 10, 50, 100
I
Each has the same number of values, 7, although
each has very different numbers.
In this case, the modes would be different: 3 F
and 4 in the first; 1 in the second. Also,
the means would be different: 3.14 in the first, 23.71 in the second. The median for
F
both of these distributions, however, would be 3, the middle value in the distribution.
A the median and three values above
In both distributions there are three values below
the median.
N
The median may be used instead of the mean in a special circumstance where
Y
the distribution takes on the quality of being skewed. The mean (discussed in the
next section) is often highly influenced by extreme scores. For example, if you were
to calculate the mean, or average, age of four
1 people who are 2, 3, 4, and 50 years
old, the mean would be 14.75 years old. Obviously, 14.75 is not a good measure of
5
the central value in this distribution, but because of the way the mean is calculated,
6
that is the value that would be obtained. The median
of that distribution would be 3.5,
which is much more like the central value. Even
8 in the example above, the mean of
the second distribution is 23.71, which is not really representative of the distribution.
T
Note, however, that if the variable is interval and the distribution is not skewed, some
S the median as the measure of central
of the explanatory power of the data is lost using
tendency.
16304_CH04_Walker.indd 100
8/2/12 3:41:30 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
4-2
Measures of Central Tendency 101
Median for Ungrouped Data
Calculation of the median for ungrouped data is relatively simple. All that is needed
is the N for the distribution. If the N is not given, simply count the number of scores
(remember—do not add the scores, count them). The N is then placed in the formula
1N 1 12
. If the data from Table 4-1 were expanded, the median can be calculated as
2
shown in Exhibit 4-1.
There are 23 values here. Adding 1 L
to this number and then dividing by 2 obtains
the exact middle of the distribution, in this case the 12th value. Once this value is
I
calculated, if the numbers are not arranged in order, you should do so. This ensures
Dactually in the middle and that the numbers
that the middle value of the distribution is
are arranged in order. Then, beginning D
with the lowest value, simply count up the
ungrouped data until the value obtained in the formula is reached (the 12th value in
E
Exhibit 4-1). This is the median. In this case, counting to the 12th value would produce
L in this distribution is 3.
a score of 3. So the median number of escapes
There are several issues to note about
L the median. These are important to understand when interpreting the median. First, be careful when calculating the median
,
because two different numbers must be dealt with. The value that is obtained from the
formula is not the median but simply the number of values to count up in the distribution to find the median (or median class T
for grouped data). The median is the score or
class that contains the number from the formula. In the example above, the median is
I up from the beginning of the distribution
not 12. Twelve is only the number to count
F
to find the median, which is 3.
Also, if there is more than one of the
F same score in the median class (there are
three 3’s in Figure 4-5), the median is still that score even though it occurs more than
A value, regardless of how many there are
once. The key value to look for is the middle
of that particular category. This will be N
brought up again in terms of calculating the
median for grouped data where the classY
interval is greater than 1.
Finally, unlike the mode, the median does not have to be a value in the distribution.
For an odd number of scores (as in Exhibit 4-1) the median will be one of the scores
7
6
6
6
6
5
5
5
5
4
4
3
3
3
2
2
2
2
2
2
2
1
1
Exhibit 4-1
16304_CH04_Walker.indd 101
1
5
6
8
T
S
N11
23 1 1
5
2
2
24
5
2
5 12
Ungrouped Data
8/2/12 3:41:31 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
102 Chapter 4
n
Measures of Central Tendency
because it is the point that cuts the distribution in half. If there are an even number of
scores, however, the median will fall in between two of the scores. For example,
1N 1 12
in the distribution 3, 4, 5, 6, 7, 8, 9, 10, the formula
would give a value of
2
4.5. This would put the median between the 6 and 7. When this occurs, the median is a
value halfway between the two scores. In this case, the median would be 6.5. This
holds true even if the two numbers do not have an interval of 1. For example, in the
distribution 5, 6, 8, 10, 11, 12, the number from the formula puts the median between
L
the score of 8 and the score of 10; therefore, the median would be 9.
I
Median for Grouped Data
D
For grouped data where the class interval is 1 or where the entire class can be used
D
as the median, the process for finding the median is essentially the same as that for
E to count up in the distribution using
ungrouped data. The first step is to find the value
1N 1 12
L
the formula
. Then, simply count up the frequency of each class to find the
2
median class. If the data from Exhibit 4-1 is L
grouped into a frequency distribution it
would look like Exhibit 4-2.
,
The first step is to determine the midpoint using the formula. Since the data in
Exhibit 4-2 has not changed, there are still 23 values (escapes). Plugging this value
into the formula would, as before, result in aTvalue of 12. Since the values are in a
frequency distribution, they are probably already
I ordered. Although it is possible to
find the median beginning from either the lowest or highest category, it is best for
F
consistency to begin with the lowest value. In this case, you would begin with the class
F is reached, which is 3. Note that it
of 1 and count the frequencies until the 12th value
is possible here to count from 10 to 12 and still
A be in the 3 class. This is fine, as long
as the value we are looking for, 12, is one of the numbers in that class. If the middle
N
value from the calculation had been a 10, 11, or 12, the 3 class would still have been
Y
the median class.
X
7
6
5
4
3
2
1
N
f
1
4
4
2
3
7
2
23
1
5N 1 1 5 23 1 1
2
2
6
24
8
5
2
T
S 5 12
Exhibit 4-2 Grouped Data with an Interval Class of 1
16304_CH04_Walker.indd 102
8/2/12 3:41:31 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
4-2
Measures of Central Tendency 103
The same procedure is used if the data is grouped with a class interval greater
than 1 but where a median class is sufficient. The process for calculating the median
where only the median class is desired is shown in Exhibit 4-3. This frequency distribution has the same N as the previous distributions, so the first step will be the same:
calculating the value to count up to. Here again, the value is 12. Beginning with the
lowest class and counting up to 12 will put you in the 16–20 class. This class contains
between the 10th and 14th cases, but because it contains the value from the calculation, it is the median class.
L
Looking again at Table 4-3, the data in this distribution could be either nominal
I
or ordinal. It could be argued, for example, that including a GED in the distribution
disrupts the ordering of the categories D
such that the data should properly be called
nominal. It could also be argued, however,
D that the categories are sufficiently ordered
to be called ordinal. For that reason, and to ensure some consistency of examples, the
E
same frequency distribution used to discuss the mode is used here to discuss output
L
for the median.
The median in Table 4-3 is the same
L as the mode, some college (4). This was
obtained in the same manner as described above:
,
347 1 1
5 174
2
T at the 1 category, which is on the top in this
Counting up in the distribution (beginning
example) puts the median in the 4th category
I (16 + 59 + 8 + 117). Since this category
contains between the 83rd and 200th cases, it contains the 174th value. Also, since the
F
category containing the median is sufficient in this instance, the median is said to be
F
some college, or 4.
Calculating an exact median for grouped
A data with a class interval greater than
1 is somewhat more complicated. Using the data set in Exhibit 4-3, the procedure to
N
1N 1 12
calculate an exact median begins as all others,
with
the
formula
. This produces
Y
2
the value 12, the same as in previous examples. This means that the 16–20 class is the
X
31–35
26–30
21–25
16–20
11–15
6–10
1–5
N
f
2
3
4
5
4
3
2
23
Exhibit 4-3
16304_CH04_Walker.indd 103
1
5
N11
Step 1. Find6
the median interval
5 12.
2
8
Step 2. Count up in the frequency for that class.
Step 3. ThatT
is the median class for this distribution (16–20).
S
Calculating Median Class for Grouped Data
8/2/12 3:41:31 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
104 Chapter 4
n
Measures of Central Tendency
median class. As stated above, we could count to 14 in this class, beyond the 12 needed
to establish a median value. The question then becomes: Where in this class does the
median lie? To find out requires interpolation within the class. Assuming the scores
are evenly distributed within the class,1 the formula for calculating the exact median is
Me 5 Lm 1 a
0.5N 2 cfbm
bi
fm
L
where Lm is the lower limit of the median class, cfbm the cumulative frequency of the
I of the median class, and i the width
interval below the median class, fm the frequency
of the interval of the median class. Using thisD
formula with the data from Exhibit 4-3,
the median is calculated as follows:
D
0.5 1 23 2 2 9
Me 5 15.5 1 a E
b5
5
L
11.5 2 9
5 15.5 1 a L
b5
5
,
5 15.5 1 a
2.5
b5
5
T
I
5 15.5 1 2.5
F
5 18
F
Aclass. N is 23, as in all other examples.
The value of 15.5 is the lower limit of the 16–20
The cumulative frequency is determined by adding
N up all the frequencies below the
class containing the median. In this case, there are three classes below the median
Y
class: 1–5, 6–10, and 11–15. The frequencies of these classes (2, 3, and 4, respectively)
5 15.5 1 1 0.5 2 5
equal 9 (cfbm). The frequency of the class containing the median in this case (16–20
class) is 5 (fm). Finally, the interval is calculated
1 by subtracting the lower limit of the
median class from the upper limit (20.5 2 15.5 = 5). In this case, the result of the cal5
culations shows that the exact midpoint of the distribution is 18.2
6
Calculating the exact median in actual research
is less often necessary. Most statistical programs report the exact median from8the ungrouped data, or the researchers
report only the median class or report the midpoint of the median category as the
T
median. There are times, however, when it is necessary to determine the exact median
S you may wish to know the exact
from information in journal articles. For example,
median from Table 4-4. This table shows categories that are not only greater than 1 but
are unequal. The procedure would be the same as discussed above, however. Here, the
median category would be 51 to 75% [(40 + 1)/2 = 20.5]. Interpolating where in that
16304_CH04_Walker.indd 104
8/2/12 3:41:32 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
4-2
Measures of Central Tendency 105
What portion of your professional research focuses on racial or ethnic issues?
Portion
Number
0–10%
6
Percentage
15
11–25%
1
2
26–50%
12
30
51–75%
14
35
Over 75%
18
L7
I40
Source: Edwards, White, and Pezzella (1998).
D
Table 4-4 Research on Racial or Ethnic Issues
D
Eusing the formula given above. Application
class the exact median would be involves
of that formula for the data in Table 4-4 L
is shown below.
L 2 cfbm
0.5N
Me 5 Lm 1 a
bi
, fm
0.5 1 40 2 2 19
b25
14
T
5 50.5 1 a
I 20 2 19
b25
F 14
F1
5 50.5 1 a b25
A14
5 50.5 1 N
0.07 1 25 2
5 50.5 1 Y
1.75
5 50.5 1 a
5 52.25
1
As you would expect, the median does not go very far into the median class in this
5
example. This is evident because the frequency below the median class is 19, and the
6
exact median is only 20.5.
This process is complicated somewhat
8 when the median class is open-ended. For
example, what is the midpoint in a distribution where the upper category for annual
T several methods of dealing with this issue.
income is $30,000 and greater? There are
S what a reasonable midpoint might be.
Probably the best is to attempt to determine
This is also shown in the example in Table 4-4, which has two open-ended categories:
less than high school and postgraduate. This would make it difficult to determine,
for example, where the midpoint of a postgraduate degree would lie (some graduate
16304_CH04_Walker.indd 105
8/2/12 3:41:32 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
106 Chapter 4
n
Measures of Central Tendency
work, master’s degree, law degree, etc.). This would have to be a judgment call by the
researcher based on theory and an understanding of the data.
Mean
A statistician is a person who stands in a bucket of ice water, sticks his head in an
oven, and says, “On average, I feel fine.”
—Unknown
L
The most popular measure of central tendency,Iboth among statisticians and the general
population, is the mean. The mean is used primarily for interval and ratio level data.
D
Because it assumes equality of intervals, the mean is generally not used with nominal
D to statistical analysis because it is the
or ordinal level data. The mean is very important
basis, along with the variance (see the discussion
E of measures of dispersion in Chapter
5), of many of the formulas for higher-order statistical procedures. The mean also
L
serves as a check on the integrity of the data. As discussed above, the mean is often
heavily influenced by extreme scores. So if aL
17 has been mistyped as 177, the mean
will be much larger than expected. Mean scores
, outside what would be expected for
the data should be a signal to recheck the data.
There are actually several different versions of the mean. The mean discussed in
this chapter is the arithmetic mean (from hereTon, called the mean). There are variations of the mean that are less utilized in social
I science research and are not discussed
here. These include the weighted mean, harmonic mean, and geometric mean.
F
The symbolic notation for the mean is different than symbols that have been used
to this point. The mean is symbolized either byFm or X, depending on whether the data
is a population or sample estimate (this distinction
A is used most often in Chapter 15
and beyond). It is interesting that descriptive statistics deals with a population, but it
N
has become convention that the mean most commonly used in descriptive statistics is
Y most texts use this notation for the
actually the symbol for the sample mean (X). Since
mean in descriptive analyses, it will also be used here, even though the more proper
notation would be the population mean (m).
1
The mean is simply the average of all the values in a distribution. To obtain the
mean, add up the scores in a distribution and5divide by N (just as in calculating an
average). In statistical terms, the mean is calculated
6 as
X5
8
Sfx
NT
S the frequency for each value. In the
where fx is calculated by multiplying X times
example used in Exhibit 4-2, the mean would be calculated as in Exhibit 4-4. Here,
each X is multiplied by the frequency for that category (7 3 1, 6 3 4, etc.). That cre-
16304_CH04_Walker.indd 106
8/2/12 3:41:32 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
4-2
X
7
6
5
4
3
2
1
f
fx
1 7
4
24
4
20
2 8
3 9
7
14
2 2
N 23
ofx 84
Measures of Central Tendency 107
Sfx
N
84
5
23
5 3.65
X5
L
I
Exhibit 4-4 Calculating the Mean
D
ates an fx column in the table, which is then
D summed to obtain Sfx (84). That value is
then divided by the N for the distribution (23) to obtain the mean for the distribution.
E
In this case, there were 23 prisons that had a total of 84 escapes, so the mean (average)
L 3.65 escapes.
number of escapes for these 23 prisons was
The procedure for calculating the mean
L for grouped and ungrouped data is the
same. The only difference is that for grouped data where the class interval is greater
,
than 1, the midpoint of the class is used as X.3 For example, in the frequency distribution in Exhibit 4-3, the midpoints of the classes would be 2.5 (5.5 2 0.5 = 5;
5/2 = 2.5), 8.5, 13.5, and so on. These are
T the values that would be used for X in the
formula for the mean.
I
The mean can be estimated from the example output that was used for the mode
and median (as shown in Table 4-3). NoteFthat this data is not interval or ratio level and
is used here only to show the similarities
F and differences among the mean, median,
and mode. Even though this is nominal/ordinal level data, SPSS treats it as interval
A
level and uses the formula above for calculating the mean. In this example, each of
N
the category values (1 through 6) are multiplied
by the frequency for that category
(1 3 16, 2 3 59, etc.). This fx value is summed
Y to achieve a total of 1370. This is then
divided by N minus the 11 missing values for a total of 336. The result is 4.077, which
is what SPSS reported (rounded to 4.08).
1 may not be accompanied by a frequency
In most cases in real research, the mean
distribution, or the frequency distribution5will be more for presentation than for analysis. In such cases, the mean may be reported alone, or it could be reported as part of a
discussion or table of univariate statistics6associated with the research.
8 other measures of central tendency. From
The mean has several advantages over
a practical standpoint, the mean is preferred
T because it is standardized. This means it
can be compared across distributions. This is very beneficial when comparing similar
data from different sources, such as theS
mean number of prisoners per institution in
several states, because the two values can be directly compared. The mean is also
important because the sum of the deviations of the scores from the mean is always
16304_CH04_Walker.indd 107
8/2/12 3:41:33 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
108 Chapter 4
n
Measures of Central Tendency
zero. That is, if each value in a distribution were subtracted from the mean, the sum of
those scores would be zero. This is discussed in detail in Chapter 5. A final important
characteristic of the mean is that the sum of the squared deviations from the mean is
the smallest value for summed deviations (smaller than if the same calculations were
made for the mode or median). This principle of sum of squares is very important
to our discussions in Chapter 5 of the variance and sum of squares as they relate to
regression lines.
As discussed above, the greatest problemLwith the mean is that it is greatly influenced by extreme scores in the distribution. The example in the section on the median,
I
where a mean age of 15 was obtained when all but one of the values was less than
D by extreme scores. That is why the
5, shows how much the mean can be influenced
median is used in cases where the data is skewed.
D
E
4-3 Selecting the Most Appropriate Measure of
L
Central Tendency
L
The goal of many statistical analyses is to be able to develop summary statements,
,
often about a large amount of data. Proper summarization
depends on several factors,
including the level of data, the nature of the data, the purpose of the summarization,
and the interpretation.
T
The level of data has a substantial influence on which measure of central tendency
should be used. As stated earlier, one measureIis most appropriate for a particular level
of data. The mode is most appropriate for nominal
F level data, and its use with ordinal
and interval level data would result in a loss of
Fpower in terms of the information that
could be gained from the data. The median is most appropriate with ordinal level data.
Although it can be used with interval level A
data (especially skewed distributions),
it should not be used with nominal level data
Nbecause the rankings assumed in the
median cannot be achieved with nominal level data. Finally, the mean should be used
Y
only with interval or ratio level data because it assumes equal intervals of the data that
cannot be achieved by nominal and partially ordered ordinal level data. The exception
here is that the mean can be used with dichotomized
nominal level data because this
1
type of data approximates interval level characteristics.
5
Selection of the most appropriate measure of central tendency is also sometimes
6
based on the nature of the distribution. As discussed
above, if a distribution is highly
skewed, or if it can be determined that there are
8 some extreme values (outliers) in the
distribution that would make the mean inaccurate as a measure of central tendency, the
T
median should be used rather than the mean.
S of central tendency is the purpose of
The second criterion for choosing a measure
summarization, typically in terms of what you are trying to predict. Imagine that you
were asked to state one measure that would best capture the nature of a distribution.
16304_CH04_Walker.indd 108
8/2/12 3:41:33 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
4-3
Selecting the Most Appropriate Measure of Central Tendency 109
How would you go about that? To put it another way, you might bet $100 to guess a
number drawn at random from a distribution. Which number would you choose? One
way to address these questions would be to find the score that would be at the “heart”
of the distribution: the most common score, the one that cut the distribution in half, or
the average score. That is the goal and the role of measures of central tendency. There
are several ways to go about this.
If you knew all the values in the distribution, you could calculate the mode easily
and quickly. If you are interested in predicting
L an exact value, you should probably use
the mode because it has the highest probability of occurring in any given distribution.
I
Both the median and the mean may produce values that are not in the distribution, so
if you must guess and be absolutely rightD
as to the number, use the mode. For example,
say you are taking a multiple-choice testDand have no idea which answer to a certain
question is correct. If you had the distribution of correct answers for that professor
E
for that test, you would want to choose the modal answer rather than the median or
L correct or it does not count. As another
mean. This is because you must get the answer
example, consider a prediction based onL
driving a car around an obstruction placed in
front of it. If tests occur over a number of drivers, the distribution would be bimodal:
, right. A suggested course of action would
some steering to the left and some to the
not be the median or mean, however, as that would have the vehicle crashing into the
obstacle even though it minimized the error
T in steering.
If, on the other hand, you want to maximize your prediction by getting closest to
I
the number over several tries, thereby minimizing
your error, the median might be a
better choice. Here, whether you miss high
F or low is irrelevant; what is important is the
size of the error. In a popular game show,F
contestants are given $7 and required to guess
the exact numbers included in the price of a car. For each number they are off, they lose
A all the guesses, they win the car; if they
$1. If they have money left over after making
run out of money, they lose. The probability
N of response plays a big part in the first two
or three numbers. You would not want to guess 9 for the first number, for example. If
Y
contestants are at the fourth or fifth number, however, and still have money left, they
may want to choose the median value (probably a 5) to minimize the error (loss of dollars). Being high or low does not matter here,
1 only deviation from the number.
Finally, if you have the opportunity to average your misses over several guesses
5
and the signs do matter (high guesses can offset low guesses), the mean is the best
6 not know a value, it is often best to choose
choice. The mean is good in that if you do
the average. For example, if you had to 8
guess the weight of a woman whom you had
never seen, you should probably choose the mean weight for women because this
T
would minimize the error. The mean is also practically the only choice when using
S the mathematical properties of both the
estimates in higher-order analyses because
mode and the median are such that they do not lend themselves to inclusion in other
formulas. The mean is less efficient, however, with highly skewed distributions.
16304_CH04_Walker.indd 109
8/2/12 3:41:33 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
110 Chapter 4
n
Measures of Central Tendency
The final criterion for selecting a particular measure of central tendency is the
interpretation. If you chose the wrong level of measurement and base your measure
of central tendency on that choice, your interpretation may very well not make sense.
For example, for a nominal level variable such as paint color, the mode makes sense
(more people chose red than any other color). The median does not make much sense,
however. For example, if you say half or fewer of the respondents chose red, what does
that mean? There is no reference point because there is no order. The same holds true
for the mean. How could you interpret an average
L of 1.8 on paint color; that the average color chosen was slightly different than red? It is easier to use lower measures of
I
central tendency with higher levels of measurement, but you lose some of the power
D correct to say the modal age in a
of your interpretation. For example, it is technically
class is 20, but it is not as precise as saying the
Daverage age is 22.4.
4-4
Conclusion
E
In this chapter, we introduced univariateLanalyses by discussing the first of the
univariate descriptive statistics, measures of central
L tendency. Measures of central tendency are one of the most used descriptive statistics and provide the most information.
,
For example, if you were to ask someone about a group of people, you might provide
an answer in terms of an average age or average income.
The measures of central tendency provideTthe information that their name implies:
a measure of the central value. Think of a seesaw. For a seesaw to work properly, it
I
must have a balance point in the middle so the weight is distributed generally equally
F of central tendency is at the balance
on each side (as in Figure 4-6). Here, the measure
point of the distribution. The picture of a seesaw,
F however, could easily be replaced
with a histogram of a frequency distribution. If only the X axis were retained, the
A
seesaw could look like the bar chart in Figure 4-7. This distribution is actually unique
N 4 is the most frequently occurring
in that, mathematically, the mean equals 4. Since
value, it is also the mode; and because 4 is the
Y middlemost point in the distribution,
it is also the median. If the values of the distribution were changed some, the balance
point would have to shift to keep the balance of the distribution. For example, in
1 different points. This is because of the
Figure 4-8, the mean, median, and mode are at
spread and alignment of the values in the distribution.
5
You can see that just knowing the measure of central tendency is not always
6
enough. Sometimes it is also important to know how spread out the values are or how
they are arranged in the distribution. This is the8reason that more than measures of central tendency are needed for a proper description
T of data. In Chapter 5, we address how
S
Figure 4-6 Balancing a Distribution on the Measure of Central Tendency
16304_CH04_Walker.indd 110
8/2/12 3:41:33 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
4-4
1
2
Figure 4-7
111
Conclusion
L
I
D
D
E
3
5
6
L4
L
Histogram of Balanced
Frequency Distribution
,
7
spread out the values are in the distribution, and in Chapter 6 we discuss the arrangement of the data within the distribution. Together, these three pieces of information
Tvariable (univariate analysis).
make up the complete analysis of a single
I
F
F
A
N
Y
1
2
3
1
5
6
8
T
S4
Mo
Mean Me
5
6
7
Figure 4-8 Histogram of Unbalanced Distribution
16304_CH04_Walker.indd 111
8/2/12 3:41:33 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
112 Chapter 4
4-5
n
Measures of Central Tendency
Key Terms
central tendency
dispersion
form
mean
4-6
median
mode
sum of squares
Summary of Equations
L
Median (Me) for ungrouped data
I
N 1 1D
2 D
Median (Me) for grouped data
E
L 2 cfbm
0.5N
Me 5 Lm 1 a
bi
L fm
,
Mean (X)
X5
fx
aT
NI
F
F
4-7 Exercises
A
The exercises for this chapter and Chapters 5 and 6 use the same examples. This will
N
allow you to work through problems using all three types of univariate descriptive staY
tistics.
1. For the set of data below, calculate:
a. The mode
b The median
c. The mean
2.
1
5
6 10, 12, 14
6, 7, 8, 10, 10,
8
For the set of data below, calculate:
T
a. The mode
b. The median
S
c. The mean
7, 4, 2, 3, 4, 5, 8, 1, 9, 4
16304_CH04_Walker.indd 112
8/2/12 3:41:34 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
4-7
Exercises 113
3. For the set of data below, calculate:
a. The mode
b. The median
c. The mean
Interval
Midpoint
90–100
4.
5.
6.
Frequency
6
L8
80–89
I4
70–79
60–69
D3
50–59
D2
E
For the set of data below, calculate:
a. The mode
L
b. The median
L
c. The mean
,
Interval
f
90–100
5
80–89
7
T
I
70–79
9
F
60–69
4
F
For each of the variables in the frequency tables that follow (from the gang
A
database), describe the level of measurement for each variable and how you
N
determined your answer.
Using the frequency tables that Y
follow (from the gang database), discuss the
three measures of central tendency.
1
5
6
8
T
S
16304_CH04_Walker.indd 113
8/2/12 3:41:34 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
114 Chapter 4
n
Measures of Central Tendency
HOME: What type of house do you live in?
Value Label
Value
Frequency
Percent
Valid
Percent
Cumulative
Percent
House
1
280
81.6
82.4
82.4
Duplex
2
3
.9
.9
83.2
Trailer
3
34
9.9
10.0
93.2
Apartment
4
21
6.1
6.2
99.4
Other
5
.6
.6
100.0
Missing
Total
N
Valid
Missing
Mean
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
L
2
I
3
343 D
D
340
3 E
1.41L
.051
L
1
,
.9
100.0
100.00
1
0.945
T
I
2.001
.132
F
2.613
F
.264
A
5
N
Y
.892
1
5
6
8
T
S
16304_CH04_Walker.indd 114
8/2/12 3:41:34 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
4-7
Exercises 115
ARREST: How many times have you been arrested?
Value
Frequency
0
243
1
23
2
10
3
3
5
24
Missing
Total
N
Valid
Missing
Mean
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
L
2
I 1
D61
343
D
E 282
L 61
L .30
, .093
Valid
Percent
Cumulative
Percent
70.8
86.2
86.2
6.7
8.2
94.3
2.9
3.5
97.9
.9
1.1
98.9
.6
.7
99.6
.3
.4
100.0
Percent
17.8
100.0
100.0
0
0
T1.567
I 2.455
F12.692
F .145
187.898
A
.289
N
24
Y
1
5
6
8
T
S
16304_CH04_Walker.indd 115
8/2/12 3:41:34 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
116 Chapter 4
n
Measures of Central Tendency
TENURE: How long have you lived at your current address (months)?
Value
16304_CH04_Walker.indd 116
Frequency
Percent
Valid
Percent
Cumulative
Percent
1
14
4.1
4.3
4.3
2
6
1.7
1.8
6.1
3
4
1.2
1.2
7.3
4
4
1.2
1.2
8.6
5
6
1.7
1.8
10.4
6
6
1.7
1.8
12.2
7
1
.3
.3
12.5
8
3
.9
.9
13.5
9
2
.6
.6
14.1
10
1
.3
.3
14.4
11
1
.3
.3
14.7
12
11
3.2
3.4
18.0
14
1
.3
.3
18.3
18
5
1.5
1.5
19.9
21
1
.3
.3
20.2
24
30
8.7
9.2
29.4
30
1
.3
.3
29.7
31
1
.3
.3
30.0
32
1
.3
.3
30.3
36
22
6.4
6.7
37.0
42
1
.3
.3
37.3
48
12
3.5
3.7
41.0
60
24
7.0
7.3
48.3
72
14
4.1
4.3
52.6
76
1
.3
.3
52.9
84
8
2.3
2.4
55.4
96
18
5.2
5.5
60.9
108
4
1.2
1.2
62.1
120
9
2.6
2.8
64.8
3.2
3.4
68.2
6.1
6.4
74.6
132
11
144
21
L
I
D
D
E
L
L
,
T
I
F
F
A
N
Y
1
5
6
8
T
S
8/2/12 3:41:34 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
4-7
Exercises 117
TENURE: How long have you lived at your current address (months)?
Percent
156
13
3.8
4.0
78.6
168
11
3.2
3.4
82.0
170
5
1.5
1.5
83.5
180
L 7
I 2
1
D14
D
1
E24
L 3
L 2
, 16
2.0
2.1
85.6
.6
.6
86.2
.3
.3
86.5
4.1
4.3
90.8
.3
.3
91.1
7.0
7.3
98.5
.9
.9
99.4
.6
.6
100.0
186
192
198
204
216
240
Missing
Total
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
16304_CH04_Walker.indd 117
4.7
343
Valid
Missing
Mean
Cumulative
Percent
Frequency
182
N
Valid
Percent
Value
100.0
100.0
T327
16
I 88.77
F 3.880
F 72
A 24
N 70.164
4923.055
Y
.365
.135
121.284
5 .269
6239
8
T
S
8/2/12 3:41:34 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
118 Chapter 4
n
Measures of Central Tendency
SIBS: How many brothers and sisters do you have?
Value
Frequency
39
11.4
11.5
11.5
137
39.9
40.5
52.1
2
79
23.0
23.4
75.4
3
39
11.4
11.5
87.0
4
17
5.0
5.0
92.0
5
13
3.8
3.8
95.9
6
6
1.7
1.8
97.6
7
4
1.2
1.2
98.8
.3
.3
99.1
.3
.3
99.4
.3
.3
99.7
.3
.3
100.0
9
1
10
1
12
1
15
1
5
343
Valid
Missing
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
16304_CH04_Walker.indd 118
1.5
100.0
Total
3.245
2.664
1
.133
5
12.027
.265
6
Std. Error of Kurtosis
Range
L
I
D
D
E
L
L
,
T
338
I
5
1.94F
F
.098
1 A
1 N
1.801
Y
Std. Error of Skewness
Kurtosis
Cumulative
Percent
1
Total
Mean
Valid
Percent
0
Missing
N
Percent
15
8
T
S
8/2/12 3:41:34 PM
© Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION.
4-9
4-8
Notes 119
References
Edwards, W. J., White, N., Bennett, I., & Pezzella, F. (1998). Who has come out of the
pipeline? African Americans in criminology and criminal justice. Journal of Criminal Justice Education, 9(2), 249–266.
Galton, F. (1883). Inquiries into Human Faculty and Its Development. London,
England: Macmillan.
Pearson, K. (1895). Classification of asymmetrical frequency curves in general: Types
L
actually occurring. Philosophical Transactions of the Royal Society of London
(Series A, Vol. 186). London, England:I Cambridge University Press.
4-9
1.
2.
D
D
This may not be a valid assumption,
E and it is possible, for example, that all the
scores could be 14, but it would be impossible to calculate the median withL
out deconstructing the values, so an assumption is made that all values in the
L
median class are equally distributed.
For future reference, this formula ,is the same (except for the 0.5) as the one used
Notes
for computing percentiles because the median is the 50th percentile of the distribution.
T
3. This procedure assumes closed intervals
for each class. If you have a situation,
say, where the oldest category of
an
age
distribution is “6 and above,” it is
I
more difficult to determine the midpoint. It is sometimes necessary to make an
F
estimate of where the central value of the class might be.
F
A
Criminal Justice on the Web
Nto make full use of today’s teaching and techVisit http://criminaljustice.jbpub.com/Stats4e
nology! Our interactive Companion Website has been designed to specifically complement
Y
Statistics in Criminology and Criminal Justice: Analysis and Interpretation, 4th Edition. The
resources available include a Glossary, Flashcards, Crossword Puzzles, Practice Quizzes,
Weblinks, and Student Data Sets. Test yourself
1 today!
5
6
8
T
S
16304_CH04_Walker.indd 119
8/2/12 3:41:34 PM
Purchase answer to see full
attachment