STA 106 Winter 2018
Homework 1 - Due Friday, Jan 19th
Sample Mean
Sample Standard Deviation
Sample Size
Book Homework (does not require R)
Smokers
150.03
27.49
266
Nonsmokers
139.18
27.49
234
Note: This may be hand written or typed. Answers
should be clearly marked. Please put your name in
the upper right corner.
Assume that you use the pooled variance formula.
Source: Data are part of a larger case study for the 2003
Annual Meeting of the Statistical Society of Canada.
1. For the following problems, identify all possible combinations of the two categorical variables.
(a) State the appropriate null and alternative hypothesis.
(b) Find the test-statistic and interpret the value.
(a) Grade (A, B, C, D, F) and employment status (Unemployed (U ), Employed at least part time (E)).
(c) Find the approximate p-value using the t-table and
interpret the value in terms of the problem.
(b) Cancer status (has cancer, or does not), and age group
(Young, Middle, Old).
(d) State your decision and conclusion in terms of the
problem if α = 0.05.
(c) Smoking status (Smoker, Non-smoker), and illegal
drug activity (Recently used, Used in the past, Never 6. Continue with problem 5.
used).
(a) Interpret a Type I error in terms of the problem.
(d) Intelligence of dog (high, medium, low), and type of
(b) Interpret a Type II error in terms of the problem.
dog (Border Collie, German Shepard, Dachshund).
(c) Calculate a 99% confidence interval for the true difference in average systolic blood pressure.
2. Assume that Y is a random variable with mean µY = 4,
and standard deviation σY = 8. Find the mean and standard deviations for the following random variables:
(d) Interpret the interval from (c) in terms of the problem,
being as specific as you can.
(a) U1 = 3 + 4Y
(e) What is the largest difference between the two groups
you would expect with 99% confidence? Be sure to
specify the direction of the difference.
(b) U2 = −10 + 2Y
(c) U3 = 1/4 − Y
(d) U4 = 3/4 − (1/4)Y
7. Three high schools participated in a study to evaluate the
effectiveness of a new computer-based mathematics curriculum. In each school, four 24-student sections of freshman algebra were available for the study. The two types
of instruction (standard, computer-based) were randomly
Find the mean and variance of: Ȳ .
assigned to the four sections in each of the three schools.
P10
Find the mean and variance of: i=1 Yi
At the end of the term, a standardized mathematics test
∗
was given to the 24 students in each section.
Find the mean and variance of: Y = a + bȲ , where a
and b are constants.
(a) Is this an experimental, observation, or mixed study?
P10
Find the mean and variance of: Y ∗ = 5 − 2 i=1 Yi
Explain.
3. Assume Y1 , Y2 , . . . , Y10 denotes an independent random
sample of size 10 from a population with mean 20, standard
deviation 5.
(a)
(b)
(c)
(d)
(b) What is the primary variable of interest (the response
variable)?
4. Suppose we take three independent random samples of size
100 from three independent populations. Let population
i be normally distributed, with mean µi , and standard
deviation σi , i = 1, 2, 3. Identify the distribution of the
following quantities, being as specific as you can (name the
distribution if possible, find the mean, find the standard
deviation).
(c) What are the explanatory variables? Identify all levels
of the explanatory variables (factor-levels) if appropriate.
(d) Identify all combinations of the explanatory variables.
8. A rehabilitation center researcher was interested in examining the relationship between physical fitness prior to
(b) Ȳ1 + Ȳ2
surgery of persons undergoing corrective knee surgery and
Ȳ2
−
Ȳ
(c) Ȳ1 +
3
the time required in physical therapy until successful reha2
bilitation. Data on the number of days required for a suc(d) Ȳ1 + Ȳ3 − 2Ȳ2
cessful completion of physical therapy and the prior phys5. A random sample of 500 subjects measured their systolic
ical fitness status (below average, average, above average),
blood pressure, and if they were a smoker or not. The goal
and the doctor they were randomly paired with (out of 3
is to evaluate if average systolic blood pressure differs by
possible doctors) were collected.
smoking status. Summary sample statistics on the dataset
follow:
(a) Is this an experimental, observation, or mixed study?
Explain.
(a) Ȳ1 − Ȳ2
1
(b) What is the primary variable of interest (the response
variable)?
(c) Create a mosaic plot of exercise and stress level.
Which exercise group had the highest proportion of
subjects with high stress?
(c) What are the explanatory variables? Identify all levels
of the explanatory variables (factor-levels) if appropriate.
(d) Identify all combinations of the explanatory variables.
(d) Create a mosaic plot of marriage and stress level.
Which marriage group had the highest proportion
of low-exercise subjects?
R Homework (requires some use of R)
III. Continue with the “GSK.csv” dataset. For the following
problems, you must show results from either a plot, a table, or an aggregate command to back up your answers.
Note: You do not have to use R Markdown to turn
in the homework, but the homework must be turned
in in a reasonable format. The answers to the questions should be in the body of the homework, and
the code used to obtain those answers should be in
an appendix. There should be no code in the body of
the homework. You can accomplish this in R, Word,
LaTex, Google Docs, etc.
(a) Which exercise group had the most subjects?
(b) Which stress group had the most highly educated
subjects?
(c) Which stress group had the highest average age?
(d) Which gender group had the lowest average systolic
blood pressure?
IV. Continue with the “GSK.csv” dataset.
Using R, and assuming equal variance by group, assume
we want to test if the average systolic blood pressure for
married vs. non-married subjects is equal.
I. Online you will find the file “GSK.csv”. The csv file has
the following columns:
Column 1. sysbp: The systolic blood pressure of the
subject (mmHg).
(a) Find the test-statistic.
Column 2. gender: The gender, with levels F and M.
(b) Find the exact p-value.
Column 3. married: Y if the subject was married, N if
not.
(c) Find the 95% confidence interval for the true difference.
Column 4. exercise: With levels L = low, M =
medium, H = high.
(d) Interpret the confidence interval in terms of the
problem.
Column 5. age: The age of the subject in years.
(e) What is your conclusion about how systolic blood
pressure may differ by marriage category? Explain
in detail.
Column 6. stress: With levels LS = low, MS =
medium, HS = high.
Column 7. educatn: With levels LE = low, ME =
medium, HE = high.
Use this dataset in problems I, II, III, IV.
Source: Data are part of a larger case study for the 2003
Annual Meeting of the Statistical Society of Canada.
(a) Find the average systolic blood pressure by stress
level. Which group had the highest average?
(b) Find the standard deviation of systolic blood pressure by stress level. Does it appear the standard
deviations are approximately equal?
(c) Find the average age by exercise level. Which group
has the lowest average age?
(d) Find the standard deviation of age by exercise level.
Which group seems to differ the most from its group
mean?
II. Continue with the “GSK.csv” dataset.
(a) Create a boxplot of systolic blood pressure by education level. Does there appear to be a trend? Explain
your answer.
(b) Create a histogram of systolic blood pressure by
marriage category. Does one group tend to vary
more than the other? Explain your answer.
2
Purchase answer to see full
attachment