1
Review 1st Semester Econometrics
Review Outline:
1. Population Model and Model Assumptions
2. Statistical Tests (t-test and F-test)
3. Changing Units of Measurement
4. Log and Level Models
5. Failure of Model Assumptions
6. Variance of OLS Estimator
7. Quadratics and Interactions
8. Indicator Variables
9. Fixed E↵ects
10. 1st Di↵erencing
1.1
Population Model and Model Assumptions
Multiple Regression Model:
Y =
0+
1 X1 +
2 X2 + · · · +
k Xk + u
where Y and X1 , X2 , ... are data, u is the error term, and
0,
1, . . . ,
k are parameters
• Terminology
– Y: Dependent variable, or explained variable, or LHS variable
– X’s: Independent variable, or explanatory variables, or RHS variables
– Predicted Parameters: ˆ0 , ˆ1 , . . . , ˆk
– Ŷ : Predicted Dependent Variable, or fitted Y
Ŷ = ˆ0 + ˆ1 X1 + ˆ2 X2 + · · · + ˆk Xk
1
– û: residual
û = Y
Ŷ
• “Simple Regression Model”: Y =
0+
1 X1 + u
– Simple because only one RHS variable
–
0 : intercept parameter
–
1 : slope parameter
– Estimated slope parameter: ˆ1 =
PN
i=1 (Xi X̄)(Yi Ȳ )
PN
2
i=1 (Xi X̄)
Where: X̄ and Ȳ are the sample means, and the i subscript indicates the
value of the variable for each observation i in the sample
– Can approximate the estimated e↵ect on Y as:
The change in the predicted Y as approximately equal to the product of the
estimated slope parameter and the RHS variable, or = Ŷ ⇡ ˆ1 X1
• ˆ1 has the same interpretation in a regression model with multiple RHS variables as
in simple regression, except now we hold all other variables (X2 , . . . , Xk ) constant
• We can write the predicted value of Y, or fitted value of Y as:
Ŷ = ˆ0 + ˆ1 X1 + ˆ2 X2 + · · · + ˆk Xk
• Define the residual as: û = Y
Ŷ
• Parameter estimates are derived in the model by minimizing the “mistakes”
P
2
– Select ˆ0 , ˆ1 , . . . , ˆk to minimize N
i ûi
– We can draw a simple picture that shows the regression line on our
plotted sample of data:
2
We Have Six Basic Assumptions:
1. Linear in the parameters: Y =
0+
2 X2 + · · · +
1 X1 +
k Xk + u
• At first this may seem very restrictive, but we will see that it is a surprisingly
flexible model
• Important to recognize that assumption is over the parameters
• We can still incorporate nonlinearities in the data
2. Random sampling: random sampling of n observations {(xi1 , xi2 , . . . , xik , Yi ) : i =
1, . . . , n}
• The idea is to learn about the world (i.e. “the population”) by taking a sample
• We want to test hypotheses about the population using the sample
• If the sample is non-random then we will not be able to make accurate statements
about the population
• Inference errors made using a non-random sample is Selection Bias
3. No Perfect Multicolinearity:
(a) No variable is a constant.
(b) We cannot write one variable as a linear combination of other variables.
4. Zero Conditional Mean: E[u|X1 , X2 , . . . , Xk ] = 0
• Says that the expected value of the error term conditional on the RHS variables
in the model is zero
• Assumption can fail if: (a) there is a RHS variable that “should” be in the
model, or (b) if there is measurement error in a RHS variable
• We sometimes refer to failure due to (a) as Omitted Variable Bias, which is
closely related to Selection Bias
2
5. Homoskedasticity: V ar[u|X1 , X2 , . . . , Xk ] =
• Says that the variance of the error term conditional on the RHS variables in the
model is equal to a constant
• Implies that the error term variance for each observation is the same
• This assumption is not necessary for accurate parameter estimates
• This assumption is only necessary for doing hypothesis testing
• Failure of this assumption is called Heteroskedasticity
6. Normally distributed error term: u ⇠ N (0,
3
2
)
• Says that the error term is distributed according to a Normal Distribution with
mean zero (from assumption 4) and constant variance (from assumption 5)
• This assumption is only necessary for doing hypothesis testing
Overview of OLS Assumptions
1. If assumptions 1-4 are valid then we can say that an estimator from our model (e.g.
ˆ1 ) is Unbiased:
• Unbiased because the expected value of the estimator (e.g. ˆ1 ) is equal to the
true value of the parameter (e.g. 1 ), or E[ ˆ1 ] = 1
• Note that in most statistical (econometric) analysis that uses (reasonably) large
datasets we essentially replace the concept of unbiasness with consistency:
The estimated parameter from the sample converges to the true population
parameter in probability (e.g.: ˆ1 !p 1 )
2. If assumptions 1-5 are valid then we can say that an estimator from our model (e.g.
ˆ1 ) is the Best Linear Unbiased Estimator (BLUE):
• This result follows from the Gauss-Markov Theorem
• Linear because the model is linear in parameters
• Unbiased because assumptions 1-4 are valid
• “Best” because the estimator has the smallest variance among all other linear
unbiased estimators:
! [Draw pictures of two estimated parameter distributions, where
the ˆBLU E distribution has the smaller variance]
3. If assumptions 1-6 are valid then we can do statistical tests: e.g. t-test, F-test:
• Note that we can “easily” relax the normality assumption and still conduct
statistical tests
4
Three OLS Facts
Pn
1.
i=1 ûi = 0.
• The sum of all the residuals (i.e. observation specific prediction mistakes) in our
sample is equal to zero
• This is why we use the sum of squared residuals as our criteria to find our
estimated parameters
2.
Pn
i=1 xij ûi = 0 8 xi1 , xi2 , . . . , xij
• The sum of the product of the residual and each variable (e.g. the numerical
value for X1 ) across all observations in our sample is zero
3. The point (Y , X 1 , . . . , Xk ) is on the regression line
The “Partialling Out” Interpretation of Multiple Regression
• We interpret each ˆj (for the slope coefficients) as the e↵ect of the independent
variable on the outcome while holding all the other X’s constant
=) That is, putting in each Xj “controls” for this factor when interpreting the ˆ’s
for the other RHS variables
• There is a neat mathematical way to see this using the simple regression model:
1. Run a Multiple Regression Model
Y =
0+
1 X1 +
2 X2 + u
=) Ŷ = ˆ0 + ˆ1 X1 + ˆ2 X2
where we care about ˆ1
2. Take a di↵erent 2-Step Approach
(1) Estimate a simple regression model with only the independent variables:
X1 = 0 + 1 X2 + u
- Calculate the predicted dependent variable: X̃1 = ˜0 + ˜1 X2
- Define ũ = X1
X̃1
- We interpret ũ as part of X1 not correlated with X2
(2) Estimate a model with the dependent variable we care about, Y, and the
residual from the first step: Y = 1 ũ + e, where e is the error term
- e is the error term
- This 2nd regression has no intercept (but that is not really important)
5
- Calculate the predicted dependent variable: Ẏ = ˙1 ũ
Q: How does ˆ1 compare to ˙1 ?
=) They are the same!
• Intuition:
– ũ is the part of X1 uncorrelated with X2
– Using the part of X1 uncorrelated with X2 in a simple regression is the same
thing as controlling for the e↵ect of X2 in a multiple regression
1.2
Statistical Tests
1.2.1
Testing Hypothesis About a Single Population Parameter: The t Test
• It is important to remember that
never know j with certainty
j are unknown features of the population; we will
• The best we can do is hypothesize about the value of
using statistical inference
j and then test the hypothesis
• Since we are testing a sample we use the t distribution
- Recall that the t distribution converges to the Normal distribution as the sample
size gets larger (once n ⇡ 100, the distributions are very close)
• So, instead of:
( ˆj
j)
SE( ˆj )
⇠ N (0, 1)
we use:
( ˆj
j)
⇠ tn (k+1)
ˆ
SE( j )
where:
q
ˆ
ˆ
– Standard Deviation for j : SD( j ) = V ar[ ˆj ]
q
ˆ
ˆ
– Standard Error for j : SE( j ) = V\
ar[ ˆj ]
⇤ We use the Standard Error because we need to estimate V ar[ ˆj ] using our
estimate for 2
6
Steps in Testing Our Parameters
1. Establish the Null Hypothesis, H0 :
j = 0
• Most, but not all, of our H0 will test if
j = 0
Example
• Consider the Model: Y =
0+
1 X1 +
2 X2 + u
Y ⌘ Global Temperature Change Since 1750
X1 ⌘ % Individuals Working as Pirates in World
X2 ⌘ Carbon Dioxide (and other GHG) emissions
• The null hypothesis is H0 :
1 = 0
– In words: Controlling for the level of emissions, the correlation between
pirates and global temperature is zero
2. Define the t-statistic (or t ratio) for a null hypothesis of
t ˆj =
j = 0 as:
ˆj
SE( ˆj )
• t-stat measures how many estimated standard deviations ˆj is away from zero
• If the t-stat is “large”, then we reject H0
• We define how “large” the t-stat must be for rejection of H0 based on our
selection of a critical value
• Critical value c defines the level for t ˆj at which we reject the null hypothesis
(given a specific significance level)
• Typical significance level is 5%: If we select a 5% significance level, then we are
willing to mistakenly reject H0 when it is true 5% of the time
• Probability value answers the question: Given the observed value of the t-test,
what is the smallest significance level at which H0 would be rejected?
3. Roughly speaking, we reject the Null in favor of the Alternative Hypothesis (that
j 6= 0) when |t ˆj | > 2:
! Draw a roughly normal pdf for ˆj with shaded cuto↵ at 2
7
1.2.2
Testing Hypothesis About Multiple Parameters Simultaneously: F Test
Goodness of Fit Statistic, R2
• Define the Following:
– Total Sum of Squares: SST ⌘
–
–
Pn
Y)
i=1 (Yi
Pn
Explained Sum of Squares: SSE ⌘ i=1 (Ŷi
P
Residual Sum of Squares: SSR ⌘ ni=1 ûi 2
• Define R2 : SSE
=1
SST
2
Y )2
SSR
SST
• We interpret R2 as the amount of variation in the dependent variable in our sample
that our model can explain
• R2 mechanically increases with the number of explanatory variables X included in
the regression model provided that each new X variable added is correlated with Y
• It is not necessarily bad to have a low R2 value
• R2 is the key to running F -tests
• Adjusted R2 is a slightly di↵erent formula that takes into account the number of
explanatory variables: 1 SSR/N
SST /N
Define F-statistic
F ⌘
(SSRr SSRu )/q
SSRu /[n (k + 1)]
where
F ⇠ Fq,n (k+1)
and
SSRr ⌘
SSRu ⌘
n
X
ûi 2 from the Restricted Model
i=1
n
X
ûi 2 from the Unrestricted Model
i=1
q : number of restrictions on H0 (i.e. numerator degrees of freedom)
n
(k + 1) : degrees of freedom in the Unrestricted Model
Notes:
• (SSRr
SSRu )
0 is always true
• Reject the Null if the F statistic is “large”, i.e. F > c
8
• Choose the significance level and look in the F table to find c (the default in Stata
is 5% significance level)
• If we reject H0 , we say that the variables in H0 are jointly statistically significant at
the given significance level
• If we fail to reject H0 , we say that the variables are jointly insignificant
• It is possible for variables to be collectively statistically significant, but individually
insignificant
F-test Example
Y =
0+
H0 :
1 X1 +
2 X2 +
3 = 0 AND
3 X3 +
4 X4 +
4 = 0 AND
5 X5 + u
5 = 0
H1 : H0 not true.
1. Estimate Unrestricted Model (i.e. the model without applying the Null Hypothesis)
• Calculate SSRu
2. Estimate Restricted Model (i.e. the model after applying the Null Hypothesis)
• In this example, Restricted Model: Y =
0+
1 X1 +
2 X2 + u
• Calculate SSRr
3. Calculate F using the formula
• In this example: q = 3 and n
(k + 1) = n
6
4. Compare F to c in a F-table to make a judgment about the Null
• In practice, statistical software will calculate F and provide a probability value
1.2.3
Economic Significance vs. Statistical Significance
• Statistical significance is determined by t ˆj (also F -stat)
• Economic significance is determined by the size of ˆj
• Possible to have a statistically significant variable that economically is not important
– Most likely to happen with very large datasets with a very large sample size
9
Example
(Wooldridge, p135)
The question of interest is what company-level factors are correlated with a higher participation rate in a company retirement savings plan. The sample size is 1,534.
Y ⌘ Participation Rate in Company Savings Plan
X1 ⌘ Match Rate, X2 ⌘ Age of Plan, X3 ⌘ Firm Size
Results (replacing each
Ŷ
=
j with its estimate, SE( j ) in parentheses):
80.29
(0.78)
+
5.44 X1
(0.52)
+
0.269X2
(0.045)
0.00013X3
(0.00004)
Interpretation
1. All variables statistically significant at the 5% level (e.g. using the “rule of 2”)
2. If size of company increases by 10,000 then this is associated with 1.3 percentage
point decrease in the participation rate in company savings plan
• A very large change in the size of a company is associated with a small change
in participation in the savings plan
• This is not an economically important finding
1.3
Changing Units of Measurement
The units of measurement we use for our variables can, not surprisingly, a↵ect how we
interpret the coefficient estimates ˆ0 and ˆ1 . Consider the following cases:
CASE 1: Scaling the units of measurement for the Y variable
• Suppose we run a simple regression of the hours of sleep a MSU student receives on a
school night (Y) and the credit hours the student is enrolled in for the semester (X):
HrsSleep =
0+
1 (CreditHrs) + u
– We estimate that ˆ0 = 10 and ˆ1 =
0.25
• What would ˆ0 , ˆ1 be if we ran the following regression instead?
M inSleep =
0+
1 (CreditHrs) + u
– That is we changed the units of the data for the dependent variable
10
– New ˆ0 = 10 ⇤ 60 = 600 ; New ˆ1 =
0.25 ⇤ 60 =
15
– Intuition: Both ˆ0 and ˆ1 are measured in terms of Y-units; If we change the
dependent variable units then the estimators get scaled by the same amount
CASE 2: Scaling the units of measurement for the X variable
Suppose we run the new regression:
HrsSleep =
0+
1 (CreditM in) + u.
• Let “Credit Minutes” simply be Credit Hours times 60
Q: What is the ˆ1 ?
=) : New ˆ1 =
Recall:
=
Ŷ = ˆ1 ·
.
X
- So, the product of ˆ1 and
X gives us our estimate of
Ŷ
- If we don’t change the units of Y , then the change in units of X must be o↵-set by
the change in scale of ˆ1
Q: What is the new ˆ0 ?
=) New ˆ0 = Old ˆ0 ; ˆ0 doesn’t change!
Why? Because
1.4
0 is measured in units of Y , which didn’t change
Log and Level Models
(1)
(2)
(3)
(4)
Model
Level-Level
Level-Log
Log-Level
Log-Log
Dep Var
Y
Y
log(Y )
log(Y )
Ind Var
X
log(X)
X
log(X)
Interp. 1
Y = 1· X
1
Y = ( 100
)·% X
% Y = (100 1 ) · X
% Y = 1·% X
• It is very common in economics to have the Dependent Variable be in (natural)
logarithmic form
11
• Terminology
– Economists often use log and ln interchangeably
–
is shorthand for a change
• Why use log transformations of the data?
– The most traditional reason is that a variable that is not normally distributed
in levels may be normally distributed after a log transformation
⇤ Thus, technically speaking, the OLS normality assumption will be satisfied;
But the normality assumption can be relaxed when we are using a regression
model to describe relationships in the sample (which is typically the case),
rather then trying to predict individual data points
– The most relevant reasons focus on practicality
⇤ Log transformation can make additive and linear models make more sense:
A multiplicative model on the original scale corresponds to an additive
model on the log scale
⇤ Difficult to interpret level e↵ects; Easier to interpret percent changes
⇤ Log transformation can lead the regression model to better explain the variation in the data (as measured by R2 )
• One challenge in using a log transformation is when the data include zeros
– If the dependent variable includes zeros then could use model (2)
⇤ Could also use a limited dependent variable model such as Logit, Probit, or
Poisson
– Use a di↵erent transformation of the data: ln(original data + small number),
where the “small number” could be 1 if the data are in the millions, or e.g. 10 7
if the data are on a smaller scale
1.5
Failure of Model Assumptions
In this class we will be focusing on selection bias
• Selection bias occurs when those observations in the sample that receive the “treatment” are di↵erent on observable characteristics (e.g. other X variables) than those
observations that don’t receive treatment
12
• If there is selection bias, we can think of the observations as selecting into treatment
based on other characteristics
• The estimated ˆ parameter on the RHS treatment variable will be biased, e.g. E[ ˆ] 6=
, when there is selection into treatment
1.5.1
Assumption 2 Fails
If Assumption 2 fails so that the estimation sample is non-random, then you will often have
to worry about selection bias
An Example
• You are interested in the e↵ect of (potentially) performance enhancing drugs on
academic performance
• Your causal question of interest is what the e↵ect of taking a prescription level of
Ritalin shortly before the test has on a student’s SAT score
• Population Model: SAT score =
0+
1 1(T akesRitalin) + Controls + u
Notation:
- 1() is an indicator function where the variable = 1 if what is inside the
parentheses is true and = 0 otherwise
- “Controls” is just shorthand for the other control variables that a↵ect educational performance (along with their parameters)
• 2 Possible Approaches
1. Exactly control for all factors that a↵ect education: Not Possible!
2. Use an experiment or experimental-type setting where we can think of Ritalin
as randomly assigned
• “Ideal Experiment”
1. Sign in for SATs and be given one of two pills (Ritalin or Placebo) that you
must swallow in front of a SAT test administrator
2. This type of experiment is not possible for ethical and legal reasons
• The usual alternative approach
13
1. Get survey data on high school students who took the SAT, where the survey
asks about Ritalin use and a bunch of other stu↵
2. Control for all information in the survey
3. Cross fingers and hope for no selection bias, so that after including control
variables that Ritalin use is as good as randomly assigned
Diagnostics for Selection Bias
• Ultimately, absent a truly random experiment, it is very difficult to know for sure
that your model estimates do not su↵er from selection bias
• However, there are two relatively simple things you can do as a researcher to get a
sense of the potential selection bias
1. Compare whether the values of the non-treatment RHS variables are similar
based on whether the observation receives treatment
– It is important to compare RHS variables that are predetermined and/or
not a↵ected by whether the observation received treatment
– The most common way to conduct this comparison is to calculate the mean
of each RHS variable for both groups (treated and non-treated) in the sample
and to provide evidence for whether the means are similar using e.g. a t-test
– Another way to make the comparison is using figures that show the entire
distribution of the values (not just the mean) of each RHS variable separately for both groups
– If the treated and non-treated groups are similar along the observable variables: The argument is stronger that there are no unobserved variables that
di↵er in their values between the groups and which is leading to selection
bias
– If the treated and non-treated groups are not similar along one or more
observable variables: There is greater concern that the coefficient of interest
will be biased
2. Estimate di↵erent models that range from a parsimonious specification to one
that includes all of the available control variables
– What we would like to see is that the estimated coefficient on the parameter
of interest is mostly “stable” across the model specifications
– If adding additional control variables does not have much of an e↵ect on the
estimated coefficient: Increases our confidence that the unobserved factors
that are not controlled for in the model are not leading to selection bias
14
1.5.2
Assumption 3 Fails
1. If Assumption 3 fails then the model is characterized by multicolinearity
2. This is really just a technical assumption that is necessary to mechanically run OLS
3. Recall that there are two parts to this assumption
(a) No variable is constant
(b) We can’t write one variable as a linear combination of others
An Example
Suppose you are interested in the role that money has in getting politicians elected. You estimate the following county-level regression model using the 2016 presidential 2-way (Democrat and Republican) vote share results.
The population model: Y =
0+
1 X1 +
2 X2 +
3 X3 + u
Y : Votes for President Trump
X1 : Dollars spent by President (then candidate) Trump
X2 : Dollars spent by Secretary Clinton
X3 : Total dollars spent by the two candidates
Q: What is the problem with this model?
– The problem is that X3 is a linear combination of X1 and X2
– We are unable to run this model because the parameters are undefined
Q: How could we adjust the model?
1. We could simply drop one of the variables, but this wouldn’t allow us to separately estimate each of the three factors
2. Redefine X3 ⌘ (T otalspent)2 , now X3 is a not linear combination of X1 and X2
Failure of Assumption 4: E[u|X1 , . . . , Xk ] 6= 0
Assumption 4 can fail if there is:
1. An omitted variable
2. Measurement error in the independent variable
15
Omitted Variable Bias
• Suppose the true population model is:
Y = 0 + 1 X1 + 2 X2 + u
But instead we use the model:
Y = 0 + 1 X1 + u.
(1)
(2)
Estimating Model (1): Ŷ = ˆ0 + ˆ1 X1 + ˆ2 X2
Estimating Model (2): Ỹ = ˜0 + ˜1 X1
Q: When will ˆ1 = ˜1 ?
• The relationship is determined by the following equation:
˜1 = ˆ1 + ˆ2 · ˜1
• The
1 from the incorrect model (2) equals the sum of
(i) ˆ1 : the real e↵ect of
1
(ii) Bias Term: ˆ2 · ˜1
• The bias term accounts for the e↵ect of ˆ2 that is being incorrectly picked up in ˜1
in the simple regression model (2)
• To get the bias term, run the following simple regression with no intercept:
X2 =
1 X1 + u
And the predicted dependent variable: X̃2 = ˜1 X1
• Interpretation of the bias term:
- ˆ2 is the e↵ect in the true model of X2 on Y
- ˜1 is the correlation between X1 and X2
• In other words, the bias term depends on:
(a) The importance of the omitted variable in explaining the outcome
(b) The relationship between the two X’s
Q: If the importance of the omitted variable increases, what happens to the bias of ˜1 ?
=) Bias increases
16
Q: If the correlation between X1 and X2 increases, what happens to the bias of ˜1 ?
=) Bias increases
1.6
Variance of OLS Estimator
1.6.1
The variance of our estimated slope parameter
2
V ar[ ˆj ] = Pn
X j )2 (1
i=1 (Xij
•
2
Rj2 )
comes from the assumption of homoskedasticity
X j )2 term is essentially the V ar[Xj ]
• The (Xij
• The (1 Rj2 ) term captures the correlation between the jth variable and all the other
independent variables:
(i) Rj2 is the R2 from the regression
Xj =
0+
1 X1 +
2 X2 + · · · +
k 1 Xk 1 + u
(ii) Rj2 is the proportion of the total variation in Xj explained by all the other X’s
1.6.2
We would like V ar[ ˆj ] to be low. When will V ar[ ˆj ] be low?
1. Smaller
2
leads to a smaller V ar[ ˆj ]
Intuition: 2 measures the (conditional) variance of the error term. The error
term represents those factors correlated with the dependent variable not in the
model. If 2 is low, then this implies that factors outside of the model have a
limited influence on the dependent variable. Thus, we would expect our model
estimates to be more precise.
Role of researcher: We can usually make 2 smaller by adding more variables
to the regression so that fewer important variables are in the error term. For
example, this is often why control variables are included in models even when
the model is for a randomized experiment.
2. Larger (Xij
X j )2 leads to smaller V ar[ ˆj ]
Intuition: OLS fits a regression “line” across the range of the X values. If the
data do not span the range of X values then we are, by definition, relying more
on extrapolation. Whenever there is extrapolation we would expect the precision
of our estimates to decrease. A larger V ar[Xj ] implies that there is less need for
extrapolation.
17
Role of researcher: We can (typically) increase the V ar[Xj ] by increasing the
sample size, or by choosing the same-sized sample with larger V ar[Xj ].
3. Larger (1
Rj2 ) implies smaller V ar[ ˆj ]
• We would like Rj2 to be smaller
• Rj2 is smaller if we exclude some X’s
• If we exclude X’s that are important control variables then this might lead to
omitted variable bias
• We can exclude X’s that are closely correlated with another RHS variable
1.6.3
We usually have to estimate
2
ˆ2
V\
ar[ ˆj ] = Pn
X j )2 (1
i=1 (Xij
We defined ûi ⌘ Yi
Rj2 )
Ŷi , we can write:
Yi = Ŷi + ûi () Yi = ˆ0 + ˆ1 Xi + ûi
() 0 + 1 Xi + ui = ˆ0 + ˆ1 Xi + ûi
( ˆ0
() ûi = ui
0)
( ˆ1
1 )Xi
This shows us that ûi 6= ui . However, taking expectations:
E[ûi ] = E[ui ]
E[ûi ] = E[ui ]
E[ ˆ0 ] + E[ 0 ]
E[ ˆ1 Xi ] + E[ 1 Xi ]
• If we knew ui , a sample estimator for E[u2 ] would be
Pn 2
i=1 ui
.
n
• What if we plug in ûi in place of ui ?
Pn
i=1 ûi
2
n
• It turns out that this is a biased estimator of ui because there are restrictions on the
values ûi can take given that
n
X
ûi = 0 and
i=1
n
X
i=1
18
Xi ûi = 0
• These two restrictions imply two fewer degrees of freedom (if we know all but two of
the ûi we can compute the other two)
• Unbiased estimator:
ˆ2 = s =
1
2
n
n
X
2 i=1
2
ûi =
✓
1.7
Quadratics and Interaction Terms
1.7.1
Models With Interaction Terms
1
n
2
◆
· SSR
The partial e↵ect (correlation after controlling for other variables) of one variable on the
dependent variable can depend on the magnitude of another variable.
Example
(Wooldridge, p197)
How much do various attributes of a house a↵ect the price of the house?
Y =
0+
1 X1 +
2 X2 +
3 X3 +
4 X4 + u
where
Y ⌘ Price of House X3 ⌘ Sq. Feet * Bedrooms
X1 ⌘ Square Feet X4 ⌘ Bathrooms
X2 ⌘
Bedrooms
• X3 is the interaction variable
• Interpretation (Level-Level Model):
• Technically,
2 is the e↵ect of
P rice = [ 2 + ( 3 · sqf t)] ·
Bedrooms
bedrooms when sqf t = 0. This is impossible!
- We need to remember that (like our intercept term) sometimes a literal interpretation at Xj = 0 does not make sense
- The good news is that there will never be a house with 0 square feet. So, we
should not have to worry about this interpretation for this model.
• Statistical Significance and Testing Coefficients
- In the above example, if we test the coefficients separately for each coefficient:
H0 : 2 = 0, H0 : 3 = 0, we could get a case where we fail to reject both H0
19
- If we care about the overall a↵ect of the number of bedrooms on the housing
price, then we would want to do an F -test with H0 : 2 = 0, 3 = 0
- Sometimes our “coefficient of interest” is on the interaction term (i.e. our research hypothesis is best answered by this coefficient), then a t-test is appropriate
1.7.2
Regression Models with Quadratics
• Quadratic functions are used to capture decreasing or increasing marginal e↵ects
• Consider the following model:
Y =
0+
1X +
2X
2
+u
We can approximate how a change in X a↵ects Y as:
Ŷ ⇡ ( ˆ1 + 2 ˆ2 X) ·
()
X
Ŷ
⇡ ˆ1 + 2 ˆ2 X.
X
• It is possible that the two terms have di↵erent signs
Example
(Wooldridge, p197)
How much does work experience a↵ect your salary?
Ŷ
So,
=
W age = (0.289
3.73 +
(0.35)
0.0061Exper2
(0.0009)
0.298 Exper
(0.041)
2 · 0.0061 · Exper) Exper
Numerical Calculations:
- Going from 0 ! 1 year of experience:
wage
[ = $0.298 or 29.8 cents
- Going from 10 ! 11 years of experience:
wage
[ = $0.176 or 17.6 cents
wage
[ = .298
2 · 0.0061 · 10 =)
- At some point (i.e. years of experience), an increase in years of experience is
predicted to decrease wage. We can find this inflection point, provided the ˆ’s
.298
have di↵erent signs, as: 2(0.0061)
= 24.4 years of experience.
20
1.8
Indicator Variables
• Can use when we have qualitative information (gender, race, happiness, etc.)
• Can use to make the model more flexible and less parametric
1.8.1
Indicator to Represent Qualitative Information
Example
How does gender (here defined as binary: male and female) correlate with the wage paid?
(1) Consider the following simple model for a worker’s wage:
wage =
0 + 0 f emale +
1 education + u
- The model assumes that the wage a worker is paid depends on two factors:
eduction (measured in years) and whether you are male/female
- f emale is an indicator variable that =1 if female and =0 if not female (i.e. male)
- We could think of this equation in terms of expectations
Assume: E[u|f emale, educ] = 0 (i.e. Assumption 4).
Then:
0 = E[wage|f emale = 1, educ]
= E[wage|f emale, educ]
E[wage|f emale = 0, educ]
E[wage|male, educ]
- Since we condition on education in each expectation, the di↵erence is just the
e↵ect due to gender
- Intuitively, including an indicator variable allows for wage to be (sadly) higher
for males than females at every level of education
[Draw picture with 2 regression lines for male and female]
(2) Now consider a model for wage that interacts the female indicator:
wage =
0 + 0 f emale +
1 education +
2 (education ⇤ f emale) + u
- Model di↵ers from (1) only by the new interaction variable: education ⇤ f emale
- This model allows for the correlation of wage and education to depend on
whether the worker is male/female
[Draw picture with 2 regression lines for male and female]
21
1.8.2
Indicator Variables for a Model with Multiple Categories
• Suppose your data are categorical data that involve several possible responses
• The two most basic ways to model these data are to include the data as a single
variable, or to estimate a model with multiple indicator variables representing the
di↵erent possible responses
Example: (Survey) asks 2 questions
(1) How happy were you with your last product by this company? 1 = Extremely unhappy, 2 = unhappy, 3 = ok, 4 = happy, 5 = very happy.
(2) What is the percent chance that you will purchase a product by this company in the
future: 0.00 to 1.00.
Model 1: Single Categorical Variable Model
• We could write this model as:
Y =
0+
1 X1 + u
Y : The likelihood that you purchase the product again
X1 : Variable for how happy the customer was (values 1,2,...,5)
• Interpretation: ˆ1 is the e↵ect of a one-unit increase on the happiness scale on
the predicted likelihood of buying the product again in the future
• ˆ1 assumes a constant marginal e↵ect
- Increase on the happiness scale from 1 to 2 is the same as from 2 to 3, etc.
- This is an implicit restriction on the model
Model 2: Multiple Categorical Variable Model
• Define the following variables:
X1 = 1 if very unhappy , = 0 if not very unhappy
X2 = 1 if unhappy , = 0 if not unhappy
X3 = 1 if happy , = 0 if not happy
X4 = 1 if very happy , = 0 if not very happy
• The new model becomes:
Y =
0+
1 X1 +
2 X2 +
3 X3 +
4 X4 + u
22
• This model does not restrict the relationship between the estimated coefficients,
and thus does not force a constant relationship between the likelihood of purchase and product happiness
Q: What is the interpretation of ˆ1 ?
ˆ1 is the correlation between purchasing a future product and being “very unhappy” with the last purchase, relative to the customers who were OK with
their last purchase
1.9
Fixed E↵ects
• Panel data allows for the use of fixed e↵ects in a regression model
– Recall that panel data implies that there are repeated observations at di↵erent
times for the same underlying unit
– In other words, there is a time dimension (e.g. daily, monthly, yearly, etc.) and
a unit dimension (e.g. person, city, country, etc.)
• Write the regression equation for a panel with 2 years data for each unit:
Yit =
0 + 0 X1t +
1 X2it + ai + uit
– i indexes the unit dimension, t indexes the time dimension
– X1t is an indicator variable = 1 if the observation is in the 2nd year
– X2it is a variable that varies between individuals and over time
– ai + uit is the composite error term with two parts
– ai are variables not in the model that vary only between units, but not over time
(e.g. birthplace)
– uit are variables not in the model that vary between units and over time
(e.g. college GPA for current college students)
• Model is called the unobserved e↵ects model or fixed e↵ects model
• In the composite error, ai is the fixed e↵ect (or unobserved heterogeneity) and uit
is the time-varying error (or idiosyncratic error)
23
1.9.1
Unit and Time Fixed E↵ects
• The key advantage of fixed e↵ects models is that a researcher can control for unobserved factors that could otherwise lead to bias in the estimation of the model
parameters
• The way that the math works is as follows:
– Rewrite the above model (for simplicity we exclude the time indicator variable):
(1) Yit =
1 X1it + ai + uit ,
t = 1, 2, . . . , T
– Take the time average for each individual i:
(2) Y i =
1
1 X 1 i + ai + ui ; where e.g. Y i = T
– Subtract (2) from (1):
Yit
Yi =
(3) Y˙it =
1 (Xit
X i ) + uit
ui ,
PT
t=1 Yit
t = 1, 2, . . . , T
˙
1 Xit + u˙it
where the notation Y˙it just means time-demeaned
- Estimate (pooled) OLS on equation (3)
Notes:
1. Modern statistical programs such as Stata and R will do all of the math for you
2. The key OLS assumption for unbiased coefficient estimates is now: E[u˙it |X˙it = 0]
- The key advantage of including unit fixed e↵ects is that the number of omitted
factors that could potentially bias our estimates are essentially cut in half
- Only unit-level factors omitted from the model that vary over time for the same
units can lead to bias
- We can e↵ectively control for all of the fixed unit-level factors without having
to gather the data or even know which factors are important!
3. Of course, E[u˙it |X˙it = 0] could still fail if omitted time varying factors are correlated
with the X’s
4. The correlations we want to measure in the model must involve X’s that vary over
time; otherwise these factors will be di↵erenced away (and not estimated)
24
5. Researchers sometimes refer to the fixed e↵ect model as using (only) within unit
variation to estimate the question of interest
6. If one or more of Xit are measured with error, then a fixed e↵ect model could make
measurement error worse (and measurement error makes the bias in ’s worse)
1.9.2
Applying Fixed E↵ects to Other Data Structures
1. We can apply fixed e↵ect estimation techniques to many types of data structures to
eliminate unobserved group or “cluster” fixed e↵ects
2. Can include more than one type of “cluster” fixed e↵ect in the same model
3. Key requirements:
(1) Need more than one cluster in the dataset (e.g. if the cluster is a school, then
need observations from more than one school)
(2) Need more than one observation within each di↵erent cluster in the dataset (e.g.
need to have more data on more than one student in the same school)
Example
What is the e↵ect that attending Headstart on child academic outcomes?
• Econometric model: Ysf t =
1 X1s +
2 X2sf t + ↵f +
t + uif t
– There are now three subscript to keep track of the structure of the data: s for
student, f for family, t for year
– Ysf t : test score at the end of 1st grade for student s in family f during year t
– X1s : whether the student attended Headstart prior to entering 1st grade
– X2sf t : control variable that varies by s, f, and t
– ↵f : family fixed e↵ect
–
t : year fixed e↵ect
• Using family fixed e↵ects controls for unobserved and unchanging variables at the
family level (e.g. characteristics about the parents that are fixed over the time period,
characteristics about the home environment that are fixed over the time period)
• It is important to remember that we can only include family fixed e↵ects in this model
if there is more than one sibling from each family in the dataset
25
1.10
First Di↵erence Estimation
• Motivation: We may be worried that the correlation between the composite error
term and the X’s is not equal to zero, i.e. E[(ai + uit )|Xit ] 6= 0
• This would lead to bias in our estimated coefficients
• Let the regression be: Yit =
0+
1 X1it + ai + uit
• Looking just at Period 1: Yi1 =
0+
1 X1i1 + ai + ui1
• Looking just at Period 2: Yi2 =
0+
1 X1i2 + ai + ui2
• Subtracting Period 1 from Period 2 for each observation i, we get:
(Yi2
Yi1 =
1 (X1i2
Yi =
X1i1 ) + (ui2
()
1
X1 +
ui1 )
ui .
We call ˆ1 our “First-Di↵erence Estimator” for the correlation between Y and X1
• Key: The unobserved ai has been “di↵erenced away”, so we no longer need to worry
about these omitted factors biasing our coefficient estimates
Notes:
1. First Di↵erencing is most commonly used in analyses using time series data (i.e.
datasets with many observations at di↵erent time intervals on few underlying units)
2. We must have variation over time in the X’s; otherwise these factors will be di↵erenced away. Dummy variables for race, gender, birth place, etc. drop out of the
first-di↵erenced model
3. Sometimes when we di↵erence the two years, there might not be much variation left
for the variables, which leads to larger SEs. This can be a major drawback of
First Di↵erencing.
4. When organizing panel data so that you can first di↵erence:
0
1
Observation Data
B P erson1,t=1
· C
B
C
B P erson1,t=2
C
·
B
C
B P erson2,t=1
· C
B
C
B P erson2,t=2
C
·
@
A
..
..
.
.
26
5. As with the fixed e↵ect model, if one or more variables are measured with error, then
first di↵erence will be make measurement error worse (and measurement error makes
the bias in ’s worse)
Choosing Between Fixed E↵ects and First Di↵erencing
• When T = 2, First Di↵erencing and Fixed E↵ects are the same
• When T
3, the FE estimator will have smaller standard errors, provided there is
no serial correlation
– Serial Correlation occurs when the error terms are correlated for the same unit
across time
– If Serial Correlation, then the assumption E[uit |X 0 s] = 0 will be violated and
our coefficient estimates will be biased
• If n is relatively small and T is relatively large so that the dataset is a time series
dataset, then we will want to First Di↵erence (for example: I = 10, T = 30 and
therefore N = 300)
1.11
Published Research Papers
1. Undergraduate Econometrics Instruction: Through Our Classes, Darkly (Joshua D.
Angrist and Jorn-Ste↵en Pischke, Journal of Economic Perspectives, 2017).
2. Design-Based Research in Empirical Microeconomics (David Card, American Economic Review, 2022).
3. Estimating Safety by the Empirical Bayes Method: A Tutorial (Ezra Hauer, Douglas
W. Harwood, Forrest M. Council, and Michael S. Griffith, Transportation Research
Record, 2002).
4. Criminal Deterrence When There Are O↵setting Risks: Traffic Cameras, Vehicular
Accidents, and Public Safety (Justin Gallagher and Paul J. Fisher, American Economic Journal: Economic Policy, 2020).
5. Will Studying Economics Make You Rich? A Regression Discontinuity Analysis of the
Returns to College Major (Zachary Bleemer and Aashish Mehta, American Economic
Journal: Applied Economics, 2022).
6. The Righteous and Reasonable Ambition to Become a Landholder: Land and Racial
Inequality in the Postbellum South (Melinda C. Miller, Review of Economics and
Statistics, 2019).
27
7. What Drives Racial and Ethnic Di↵erences in High Cost Mortgages? The Role of
High Risk Lenders (Patrick Bayer, Fernando Ferreira, and Stephen L. Ross, The
Review of Financial Studies, 2017). [Note: Focus on Sections 1-3 and 6.]
8. The Impact of Jury Race in Criminal Trials (Shamena Anwar, Patrick Bayer, and
Randi Hjalmarsson, Quarterly Journal of Economics, 2012).
9. Intended and Unintended Consequences of Youth Bicycle Helmet Laws (Christopher
Carpenter and Mark Stehr, Journal of Law and Economics, 2011).
28
Purchase answer to see full
attachment