WEEK 1, HW1 (PART 1): DESCRIPTIVE STATISTICS
We start with DESCRIPTIVE STATISTICS where we simply want to “see” what our data set looks like. Later in the course
we move on to INFERENTAIL STATISTICS where we try to learn what our sample data can tell us about the actual, entire
population from which our sample was taken.
INTRODUCTION (please read carefully and post questions if anything is not clear): There are a 1001 expressions that
relate to statistics in our lives. My favorites are: “‘Life is a crap shoot”, “Pay your money, and take your chances”, and
“What could possibly go wrong?” (The last is the “mantra” of the Darwin Awards). Of course there are phrases that
show how we ignore data and the statistical analysis of it: DENIAL, which we have all likely done (and many in this last
election), at least in their own minds: “My mind is made up, don’t confuse me with the facts !!”, “ I really want that car;
I don’t care about its safety rating or gas mileage!”, “I love fried chicken and pork BBQ; I don’t care about the grease
and salt !”, “I don’t need a flu shot !”, “So I’m overweight, smoke, and drink Coke tm who cares ?!” “I only buy ‘organic’
produce, milk and eggs; it’s worth the much higher cost.” What statistical denials have or are YOU making?
Research suggests that our brain’s pre-frontal cortex does not mature or “kick-in” until our early twenties. This cortex is
where experiences are tied together and we start to see the possible consequences of our actions. Up to then, we are
“immortal”, as in the “Born to Be Free” song. Of course some life events (violence) speed up this process, which you
can imagine, and not always in a good way. Then again, some of us never “mature.” Moving on . . .
Most of our life decisions are (or should be) based on statistics: what is the safest car to buy, what picks should I make
for my fantasy team, what foods are heathiest, what medicine can best relieve my headache, can I afford this house,
what degree offers the highest job/salary potential, which lottery ticket should I buy, which political candidate will best
help me (preferably, which will be best for our country), etc.
We base many of these decisions on the ads we’ve seen or read. Those ads cite studies conducted on their products or
services. Those studies are statistically based (though NOT necessarily sound statistics). If you watch any TV show you
have seen countless ads for drugs that spend more time listing possible hazards than likely benefits. Wonder why? This
is a CYA deal. In the testing or actual use of their product, some persons have developed those conditions. We hope
they are rare occurrences, but we aren’t given that information (pay your money, take your chances).
In making all these corporate and personal decisions they and we need DATA. Keep in mind that the goal is to predict
what an entire POPULATION (e.g., age group) will do based on SAMPLES taken from that population. STATISTICS
ALLOWS US TO MAKE PREDICTIONS ABOUT A POPULATION BASED ON SAMPLES FROM THAT POPULATION. It gives us
the odds (the probability) of success (or failure).
Now, let’s assume you ARE a mature “critical thinker” who seeks out hard data and valid statistical analyses (good luck
with that). It’s out there in “peer-reviewed” studies and sound science research, if you look. BUT, it’s much easier to
find the more readily available, typically very biased data – the “alternative facts.” BUT, let’s be clear – there are NO
alternative facts. Facts are facts. There may be different interpretations of why something is a fact, but not that there is
a different fact.
So, how do we handle these different interpretations? We BALANCE THIS BIAS, meaning look at the extreme views and
their supporting data and then form OUR own opinion from these extremes, this can work. This is CRITICAL THINKING
and is what education is all about. Unfortunately, as this topic stated up front, far too many of us simply pick the data
sources that match, for whatever reason, our personal biases, and that polarization certainly stops any compromise,
meaning progress, that would ultimately benefit us all. Moving on . . (again):
Let’s talk “DATA”. What is it and how do we collect it, but most importantly what makes it good, meaning valid. There
are two types of data: qualitative and quantitative.
•
•
QUALITATIVE: Color of cars, taste of beer (hoppy, fruity, molasses), rankings like “unsatisfied, satisfied, very
satisfied”, numbers like 1-4, $$$,
, etc.)
QUANTITATIVE: Heights, weights, income, home prices, IQ, test scores; almost anything that can be measured
mathematically (except numerical rankings). There are two types of quantitative data: discrete and continuous:
o DISCRETE: These are WHOLE numbers like number of children, where an average, if not a whole
number might sound ridiculous (e.g., average U.S. family has 2.6 children).
o CONTINUOUS: Numbers where fractions are realistic like heights, weights, age. Money can go either
way, but let’s go with continuous. Rounding off can create some error.
There are FOUR SCALES or LEVELS of MEASUREMENT used for these above data types and this is important to remember
(Final Exam likely question-Illowsky p-26).
•
NOMINAL (scale or level): Qualitative data are measured on this scale. The unique characteristic is that no
statistical calculation works (would be invalid or nonsense) on NOMINAL data. Even putting the choices like
car colors in a particular order makes no real sense: red, yellow, white, blue or blue, white, red, yellow So what
now?
•
ORDINAL (scale or level): Qualitative data can also be measured on this scale. Here we have our RANKINGS
using choices like poor/fair/good or $$$ or even numbers 1-4 . Data measured on this scale CAN be put into a
meaningful order in that $$$$ is logically higher than $$. HOWEVER, ORDINAL scale data as with Nominal
scale data can NOT be analyzed statistically. How much better is a restaurant with 3 “smiley faces”
than one with 2 smiley faces? We can’t calculate this and more importantly we don’t know what each ranking
was based on. People may like the style of food, its presentation, its quantity, or they may not like dirty
silverware, or unclean restrooms. Who knows ?? You may even be asked to rank each of these qualitative
areas, but they are still QUALITATIVE, hence this data cannot be analyzed statistically. Also, be careful with
numerical rankings like “1 – 5”. These are no more appropriate for statistical analysis than smiley faces.
•
INTERVAL (scale or level): We have meaningful numbers. Some Quantitative data are measured on this scale,
BUT this scale has NO ZERO POINT. Temperatures are a good example. Differences in data DO make sense,
BUT comparisons do not. You can calculate average summer/winter temperatures for an area, let’s say 80 oF /
20 oF BUT we can NOT say that it is 4 times hotter in summer than winter. Why? Because the 0 oF or 0 oC are
NOT absolute zero. Temperature measured in the KELVIN scale DO go down to absolute zero (when all
molecular motion stops). On this scale we CAN say the 100 K is twice as hot as 50 K where “hot” refers to the
amount of molecular motion. This motion can be “seen” when you boil water and the molecules of water
actually have enough energy to “jump out” of the liquid phase and become steam (gas phase). (The other state
of matter is solid like ice). You can even “freeze” (solidify) the gas CO2 as dry ice.
•
RATIO (scale or level): Now we’re talking !! We have a meaningful zero and we can do ALL the statistical
calculations that might apply to this data set. An example would be class grades based on points earned out of
100. This works for most courses with multiple choice tests, but what about essay questions. Can you
statistically compare the grades in a course in which grades are based totally on multiple-choice exams to one
(in the same subject) in which the grade is based totally on essay question exams? NO ! So be careful that you
are ALWAYS comparing apples to apples. This is where knowing what the data are based on is the FIRST critical
consideration in evaluating any statistical analysis.
Data collection or “SAMPLING” is the next topic. What are we sampling? These are samples of a specific characteristic
of an entire POPULATION, and it is RARELY possible to sample an entire population. But, if we did and calculated the
mean of all those data, that mean would be considered a PARAMETER of the population. HOWEVER, the mean of a
sample is referred to as a STATISTIC. REMEMBER THIS (Final Exam likely).
There are FIVE data collection or sampling protocols we will cover, the INTENT of all is to get a REPRESENTATIVE SAMPLE
of the population. (methods –Illowsky p-18):
•
SIMPLE RANDOM sampling is the first. “Random” means that EVERY piece of data has an EQUAL chance
(probability) of being collected. You have twenty grandchildren that you like equally well but you can only
afford to send $10 holiday presents to five (the rest get $5 each).
•
STRATIFIED: Divide the population in to logical groups (or strata which means layers - a little confusing). You
want to determine the average age of students in each of the ten UMUC departments (let’s assume there are
only ten). Then, take a simple random sample of students from each Department.
•
CLUSTER: This sampling method starts like Stratified in that all groups in a population are identified. BUT, then
we use simple random sampling to decide on only a portion (cluster) of those groups. Next, we use simple
random sampling to collect our data from each of groups in that cluster.
•
SYSTEMATIC: A little tricky. Remember that we want EVERY person or item in the population to have an equal
chance of being selected. So, this seems to require that we know the size of our population. We also need to
decide how many samples we can afford to take. Divide the population size by the sample size and save that
number. We then pick our starting point from a random numbers table or generator and proceed to collect the
desired data (information) from every “saved number” person or item (e.g., item on a conveyor belt for quality
assurance).
•
CONVENIENCE: It is what it is. Poll the classmates, poll the neighbors, count the cars at a nearby intersection.
Some of the results from this sampling methodology will produce valid statistical results, but MANY won’t. In
some cases this is deliberate BIAS and assumes that readers will NOT question or look into how the data were
collected.
One last issue with DATA SAMPLING is whether sampling is done WITH REPLACEMENT OR NOT. Taking a large number
of samples from a phone book might require going through the book multiple times. With simple random sampling, you
would possibly hit the same name twice (or even more). Does this matter? MAYBE. If you ignore a repeat (nonreplacement), you actually improve the odds (probability) for the other names or items.
FOR EXAMPLE, If 5 winners are pulled from 20 names in a hat and yours is one name out of the 20 in the hat, your odds
of winning on the first pick are 1/20 (=0.05 or 5%). It you did NOT win on the first pick and the winner’s name is NOT put
back in the hat, your odds improve to 1/19 (=0.053 = 5.3%) and continue to get better with each losing selection. BUT, if
the winners’ names are put back in the hat, your odds stay at 5% with each pull as do the odds for the prior winners to
win again. For samples from a LARGE population replacement is not that critical.
FINALLY, HERE ARE THIS WEEK’S PART 1 HOMEWORK PROBLEMS:
HW1 (part 1)- HOMEWORK PROBLEMS (SUBMIT TO THE ASSIGNMENT FOLDER BY 11:59 PM EST SUNDAY)
#1. You are the quality assurance person working an assembly line at a TV manufacturing plant. They produce 1000
TV’s a day. IF THE TV’S ARE ALL THE SAME MODEL, WHAT PERCENTAGE (think about the cost of testing) WOULD YOU
TEST (WHY?) AND HOW WOULD YOU SELECT THEM (Don’t just say “randomly” – How do you do it randomly?) If the
inspector were lazy, how would they likely do it as a “convenience” sample? Lastly, if the 1000 TV’s were 4 different
models, how would you sample then and what type of sampling would this be?
#2. You are going out to eat. There are three shopping malls nearby and each has up to five restaurants (these
restaurants are all different styles: e.g., Italian, Chinese, French). Here are their customer SATISFACTION ratings on a
scale of up to five
+’s
(highest satisfaction). WHAT ASSUMPTIONS ARE YOU MAKING REGARDING WHAT
“SATISFACTION” MEANS?
Mall 1
Mall 2
Mall 3
(a) ++++
(a) +++
(a) +++++
(b) ++++
(b) ++
(b) ++++
(c) ++++
(c) +++
(c) +++
(d) ++
(d) +
(e) +++
#3. (a) What type of data and scale are involved here?
(b) Which Mall Restaurant did you pick? WHY?
(c) What issues could you encounter with your pick once you got there?
#4. What is a CONVENIENCE SAMPLE? Give an example of one and explain when it might be actually useful in giving a
picture of the entire population, and what could be misleading about it.
#5. You can find 20 RANDOM NUMBERS in a Table or you can generate them with software like Excel. The Excel
functions are “RAND” and “RANDBETWEEN”. With “Randbetween” you simply input how many numbers you want, the
number of digits you want in your random number and the range of values you want those numbers to fall between.
For example you may want twenty, 2-digit numbers that fall between 00 and 100 (like “34”).
TWO CONSIDERATIONS: (1) You must systematically use the random numbers in the Table or the ones generated.
You don’t “skip around” because that could un-randomize the values. (2) Let’s say you want 1000 names from a 50
page phone book. You reach the end of the book with your systematic selection and only have 800 names. What do
you do? Simple: start over in the book (loop). For example, if you were selecting names from every 15th page and you
reached the end of the book after only 8 pages, then start over on page 7 of the same book.
One source of random numbers is the Greek symbol “π” and its numerical value used in geometry is
3.141592653589793238462643383. . . (ignore the decimal between 3 and 1) and you have THIS string of random
numbers: 3141592653589793238462643383.
USE THIS STRING (and loop it) to generate twenty, 3-digit (e.g. 314) random numbers AND EXPLAIN how you did it.
FYI: the number π used in geometry, as in the AREA of a circle =
π r2
, is a random number in that
the numbers never repeat: π = 3.141592653589793238462643383 . . . (If you want a million decimal
places check out: www.piday.org/million )
WEEK 1, HOMEWORK 1- PART 2: LANE C 1-2,; ILLOWSKY C 2 SECTIONS 2.1 – 2.4
HW1 – Part 1 dealt with what data are and how to collect valid random data. We also talked about how data
“samples” are intended to give us an idea of an entire population. Of course sample size affects everything.
We now continue with DESCRIPTIVE STATISTICS concepts, such as DISPLAYING our data in the hopes of seeing
some pattern as distinctly as possible. So, who cares if there is a pattern? Well, if we see a bell curve shape
we likely have a NORMAL distribution and all our statistical analyses will work (give us valid results).
BUT, always remember that statistics proves NOTHING by itself. The calculated numbers simply give us
support for our hypothesis (our ideas). Remember too that as with stock prices, past data do NOT predict
future performance. Of course all statistics mean NOTHING IF the data are not true random samples that
reflect the entire population of concern.
OPEN THE EXCEL TABLE PROVIDED WITH THIS HOMEWORK AND ANSWER THE FOLLOWING FIVE (5)
QUESTIONS
#6. Draw a vertical (or horizontal) BAR CHART that compares each month’s total income AND total expenses
(TWO BARS PER MONTH FOR THE 12 MONTHS )
#7. Draw a PIE CHART showing the total ANNUAL cost split among these three EXPENSE categories: gas, food,
electricity (a hand drawing is fine). EXPLAIN HOW EXCEL OR YOU DETERMINED THE ANGLES OF THE 3
WEDGES IN THE PIE CHART (DUST OFF GEOMETRY AND TRIG)
#8. Draw a STEM & LEAF diagram of the MONTHLY INCOME numbers. Explain how you would handle this if
the expenses had more significant digits (e.g., $3189 instead of $3200, etc.). This is a possible shortcoming of
displaying data in a stem & leaf diagram)
#9. HAND DRAW a DOT PLOT for any one data column. (Few seem to get this one right) Look at the LANE
text example. Try putting the $-amounts along the bottom axis and a dot for each time a value occurs above
that number. You can do this sideways too (see Lane).
#10. SUMMARIZE (in your own words AFTER READING THE TEXTS) what each of the data displays (i.e., pie
chart, bar graph, stem & leaf, and dot plot) above is BEST suited for and what, if any, are its LIMITATIONS.
Which display do you feel gave you the best “picture” of the shape of the data distribution and/or the most
information about it?
WK1-HW1
JAN
FEB
MAR
APR
MAY
JUN
JUL
ASUG
SEP
OCT
NOV
DEC
TOTALS
MEAN
MEDIAN
MODE
VARIANCE
STD DEV
INCOME
3500
3300
3000
3500
3600
3800
3800
4000
4100
4200
4400
4500
45700
3808.33
3800
3500
204470
452
GAS
250
225
260
200
200
175
300
250
260
200
250
270
2840
237
250
250
1338
37
FOOD
500
600
550
450
400
450
500
500
375
350
400
600
5675
473
475
500
6984
84
UTIL
700
650
600
550
400
500
600
700
500
450
500
600
6750
563
575
600
9148
96
RENT
1400
1400
1400
1400
1400
1400
1400
1400
1400
1400
1400
1400
16,800
1400
1400
1400
0
0
C-CARD
1200
900
800
600
450
450
900
1000
850
1000
800
2000
10,950
913
875
900
166,875
409
EXP TOT
4050
3775
3610
3200
2850
2975
3700
3850
3385
3400
3350
4870
43,015
3585
3505
none
288,875
537
Purchase answer to see full
attachment