1
Homework 7: Aggregated data and High Dimension Anomalies
1.1
Purpose
Homework 7 is meant to give you some practice with WLS and play around
with the high dimensional estimation problems.
Any line starting with ”side note” is something for you to think about, but
no need to answer to it.
1.2
Weighted Least Squares
Let’s create data at the individual level according to the linear regression model
Y = Xβ + with the following specifications
ind
• Let i ∼ N ormal(0, 20).
• Let the sample size, n, be 1000.
0
• Let β = 1.2
(the first parameter is for the constant feature).
• let’s assign each sample to 20 exclusive groups according to the following
multinomial distribution: P (i ∈ Group k) ∝ √1k where k ∈ {1, . . . , 20}
ind
• Define the non-constant feature, Xi ∼ Binomial(n = 100, p = 0.5), ∀i
Create the aggregated data by simply averaging the data points within each
group for the X and Y values. Let’s call these aggregated data values X̄k ,
and Ȳk .
Let the vectorized version of the data be denoted as Ȳ and let X̄ =
1 X̄1
..
..
.
.
1
X̄20
1.2.1
Q0
TRUE/FALSE, the aggregated data will always have a higher correlation between X and Y than the correlation at the individual level. Note, ignore the
constant feature.
1.2.2
Q1
TRUE/FALSE, Ȳ = X̄β + γ where γ only depends on
1.2.3
Q2
Let γ = Ȳ − X̄β, TRUE/FALSE, E(γ|X̄) = 0
1
1.2.4
Q3
What is the analytical expression for V ar(γk |X) and Cov(γk , γm |X) where k 6=
m. Please express the solution in terms of V ar() = σ 2 and the sample sizes of
the different groups. You should assume the group assignments are given.
1.2.5
Q4
TRUE/FALSE, using OLS on the aggregated data will produce unbiased estimates for β.
1.2.6
Q5
TRUE/FALSE, using OLS on the aggregated data vs using OLS on the individual level data will produce the same exact estimates for β.
1.2.7
Q6
If we only had access to the aggregate data, please produce the point-wise
95% confidence interval for β if we used OLS (i.e. pretend the variances are
constant) and compare that to the interval created using WLS (i.e. the correct
calculation).
1.2.8
Q7
Continuing Q6, which one would you recommend using?
1.2.9
Q8
Compute the point-wise 95% confidence interval for β using the individual level
data using OLS.
Side note, you should wonder if using the individual data is always preferable
despite the calculation from Q3.
For the following problems, let’s change the data generation process slightly:
ind
let Xi ∼ Binomial(n = 100, p = k−10
200 + 0.5), i.e. group 1 is distributed
−9
−8
according to p = 200
+ 0.5, group 2 has p = 200
+ 0.5, etc. There are still 20
groups.
Side note, you can imagine the group are different neighborhoods. X is your
parents’ income when you were born and Y is the base salary of your first job
(all in weird units).
1.2.10
Q9
Compare the point-wise 95% confidence interval for β1 using OLS at the individual level data vs the method chosen in Q7 with the aggregate data. Which
one would you recommend?
2
1.2.11
Q10
Using the individual level data and OLS, please write the code that produces the
the point-wise 95% prediction interval for new Y values for each hypothetical
X values, 0, 1, . . . , 100. Please make the interval center the regression line. No
need to report numbers, just the code is sufficient.
Again, the prediction interval is the interval that will capture 95% of the
cases when predicting new data points.
1.2.12
Q11
For this problem, assume you only have access to the aggregate data.
Side note: if you were to create a prediction interval based on the aggregate
data, you would need X̄new AND its corresponding group size (notice how WLS
assumes the weights are known). When you apply these intervals to individuals,
this is how ecological correlation mistakes are made.
Instead of creating an interval for Ȳnew , let’s create an interval for Ynew |{Xnew , X̄}
by computing an interval that uses V ar(Ynew − Xnew β̂wls |X̄, Xnew ), estimates
σ̂ 2 under our WLS setting, and centers Xnew β̂wls . Please create a plot that
compares this interval to the interval implied by your code from Q10.
Side note: you should think about what’s specific about this set up is allowing us to do this? Is this calculation true for all WLS settings?
1.3
NOT-James-Stein’s estimator
Let’s define MSE in estimating high dimension vectors, β, using an estimate β̂,
as E(kβ − β̂k2 ).
1.3.1
Q12
What is the theoretical MSE if we estimated arbitrary β with the vector of 0’s?
Side note: do not overthink. This is just to show anything CAN be an
estimate for anything.
1.3.2
Q13
Under the usual regression settings, create the biased estimate β̂γ = γ ∗ β̂OLS
where β̂OLS is the coefficient estimate from the regression. Calculate the theoretical mean squared error for γ β̂OLS . Express the result in terms of γ, β, X,
and σ 2 and simplify as much as possible.
Side note: you should know why this isn’t very useful in practice because β
and σ 2 are unknown.
1.3.3
Q14
Let Y = Xβ + where β is the 0 vector. Let ∼ N (0, 10), n = 1000 and create
99 random features all from a uniform random variable (between 0 and 1) and
3
1 constant feature for X. Let β̂OLS be the usual regression estimate. Using the
result above, with your simulated X values, write the code AND report the
smallest value for γ before the MSE starts to increase again.
Side note: this is intentionally similar to the simultaneous inference case.
1.3.4
Q15
To shrink a vector Z to the origin (i.e. the vector of all 0s), we can multiple Z
by γ ∈ [0, 1). However, we can also shrink Z to arbitrary vector µ by calculating
γ(Z − µ) + µ.
Same as Q14, let Y = Xβ + where β is the 0 vector. Let ∼ N (0, 10),
n = 1000 and create 99 random features all from a uniform random variable
(between 0 and 1) and 1 constant feature for X. Let β̂OLS be the usual regression
estimate. Shrink β̂OLS towards µ = 2, i.e. a vector containing 2’s and with
γ = 0.99. Numerically approximate the MSE over 100 simulations for the
shrink estimator and the OLS estimator. Report which estimator would you
prefer if you’re optimizing for MSE for estimating β.
4
Purchase answer to see full
attachment