Chapter 10
Identification and Data Assessment
© 2019 McGraw-Hill Education. All rights reserved. Authorized only for instructor use in the classroom. No reproduction or distribution without the prior written consent of McGraw-Hill Education
Learning Objectives
1. Explain what it means for a variable’s effect to be identified in a model
2. Explain extrapolation and interpolation and how each inherently suffers
from an identification problem
3. Distinguish between functional form assumptions and enhanced data
coverage as remedies for identification problems stemming from
exploration and interpolation
4. Differentiate between endogeneity and types of multicollinearity as
identification problems due to variable co-movement
5. Articulate remedies for identification problems and inference challenges
due to variable co-movement
6. Solve for the direction of bias in cases of variable co-movement
© 2019 McGraw-Hill Education.
2
Assessing Data via Identification
• The table below shows a subsample of rocking chair data
• Your goal is to estimate the average treatment effect of price on
sales. On average, when price increases by $1, what is the effect on
sales of rocking chairs?
© 2019 McGraw-Hill Education.
3
Assessing Data via Identification
• A parameter (e.g., β) is identified within a given model if it can be
estimated with any level of precision given a large enough sample
from the population
• Suppose we assume the data-generating process as:
Salesi = α + βPricei + Ui
• Within this model, we are interested in accurately estimating β.
• A parameter is identified if, for a given confidence level K ( < 100% )
and a given “length” L, we can build a confidence interval that
contains β with length less than l and confidence level of K, given a
large enough sample of data
© 2019 McGraw-Hill Education.
4
Assessing Data via Identification
Identification Example
• Define p as the probability of rolling a 3 on any single roll of a
die.
• Define X to be number of 3s observed on a single roll of a die( X =
1, for a roll of 3 and X = 0, for any other number), so E[X] = p
• It can be shown that Var[X] = p(1 – p). Using this framework, the
parameter p is identified
• We can estimate p as precisely as we want given enough data on
the roll of the die (given enough rolls of the die)
© 2019 McGraw-Hill Education.
5
Assessing Data via Identification
• The fact that p is identified follows directly from central limit
theorem
• Suppose the die is rolled N times. Define x1 as the observed values
of X for the first roll, x2 for the second, and so on.
1
1
𝑁
• Then, define: 𝑋𝑁 = 𝑁[𝑥1 + 𝑥2 + ⋯ + 𝑋𝑁 ] = 𝑁 σ
𝑥𝑖 the sample
𝑖=1
mean for X, or equivalently, the portion of the N rolls that showed a
3
• Given these definitions, the central limit theorem states that: 𝑋𝑁 ~
𝑝(1−𝑝)
N(p,
) as N gets large
𝑁
© 2019 McGraw-Hill Education.
6
Assessing Data via Identification
Distribution of Mean of X for N =50 and N = 5,000
© 2019 McGraw-Hill Education.
7
Extrapolation and Interpolation
NOTE HOW THE VARIABLES
“SALES” AND “PRICE” MOVE
TOGETHER IN THE PRICE
RANGE OF $210 TO $225 AND
IN THE PRICE RANGE OF $275
TO $300
© 2019 McGraw-Hill Education.
8
Extrapolation and Interpolation
• Suppose we want to know how Sales move with Prices in other
price ranges
• Interpolation involves drawing conclusions where there are
“gaps” in the data
• Data gap is any place where there are missing data for a
variable over an interval of values, but data are not missing for
at least some values on both ends of the interval
• Extrapolation involves drawing conclusions beyond the extent
of the data
© 2019 McGraw-Hill Education.
9
Identification Problems
• Must be considered when engaging in interpolation and/or
extrapolation
• The determining factor is whether the gap(s) in, or extend of,
the data are due to random limitations in the sample or
limitations in the population
• If it is the former, there may be no identification problem
• If it is the latter, then there is an identification problem that
must be addressed
© 2019 McGraw-Hill Education.
10
Identification Problems
• Attempt to draw f(.)
and g(.) without any
mathematical formulas
WE ARE ATTEMPTING TO INTERPOLATE
(FILL IN THE DATA GAP) AND
ATTEMPTING TO EXTRAPOLATE
(EXTEND BEYOND THE DATA’S RANGE).
© 2019 McGraw-Hill Education.
11
Identification Problems
• When interpolation or extrapolation is used to fill in gaps or
limited extend of the data sample, but not the population,
there is not an identification problem
• When interpolation or extrapolation is used to fill gaps or
limited extend of the population, there is an identification
problem
• No matter how much data is collected from the population, it
will not help to draw any conclusions about what is happening
in the unobserved range(s)
© 2019 McGraw-Hill Education.
12
Remedies
• Suppose you want to engage in interpolation and/or
extrapolation when there exists an identification problem
• For a general model of the data-generating process, where no
assumptions are made about the determining function, we
cannot sample more data from the population
• There are two key approaches toward solving this type of
identification problem:
1. Changes in the population
2. A functional form assumption
© 2019 McGraw-Hill Education.
13
Remedies: An Example
Changing the population to alleviate an identification problem
• A new singer has been promoting her music by selling physical
copies of her music at various high schools.
• She charges the same price to everyone and finds that the seniors
buy the most often, freshman the least, and sophomores and
juniors are in between
• This tells her that her sales appear to be increasing by age of
customers
• She would like to extrapolate this relationship beyond just high
school-aged kids
• Using only data from high schools, she has an identification
problem
© 2019 McGraw-Hill Education.
14
Remedies
The figure illustrates possible
ways to extrapolate past age
18, but there are no data to
sort through the options.
A CLEAR OPTION TO THIS IDENTIFICATION
PROBLEM WOULD BE TO TRY SELLING HER
MUSIC AT COLLEGES AND COLLECT DATA ON
HER SALES PERFORMANCE AMONG THIS
GROUP.
THIS SIMPLE EXPANSION OF POPULATION
WILL ALLEVIATE THE IDENTIFICATION
PROBLEM.
© 2019 McGraw-Hill Education.
15
Remedies
Imposing a functional form assumption to alleviate an
identification problem
• Standard practice is to assume a functional form of the
determining function that applies for all relevant price levels
• Assume a data-generating process with a linear functional form
for the determining function: Salesi = α + βPricei + Ui
• This assumption imposes the shape of the relationship
between Sales and Price to be linear, but also dictates how to
interpolate and/or extrapolate
© 2019 McGraw-Hill Education.
16
Regression Line for Rocking Chair Sales and
Price Data
❑ HERE, WE ARE ESTIMATING α AND β USING
ONLY DATA WITH PRICE IN THE RANGES
($210, $225) AND ($275, $300).
❑ WE ARE APPLYING THESE ESTIMATED VALUES
ACROSS MANY OTHER PRICE LEVELS.
❑ WE ARE USING THESE VALUES TO
INTERPOLATE BETWEEN $225 AND $275
AND TO EXTRAPOLATE ALL THE WAY TO
$350.
© 2019 McGraw-Hill Education.
17
Variable Co-Movement
• Another circumstance in which identification problems
typically arise is when there is variable co-movement in the
population
• We use the broader term “co-movement” rather than
correlation, since simple correlation alone do not encompass
all the ways variables may move together in a population that
result in identification problems
© 2019 McGraw-Hill Education.
18
Variable Co-Movement
• Three types of variable co-movement:
1. Perfect multicollinearity
2. Imperfect multicollinearity
3. Endogeneity
© 2019 McGraw-Hill Education.
19
Variable Co-Movement
• Consider the following data-generating process:
Yi = α + β1X1i +…+ βKXKi + Ui
• Use regression analysis to estimate α, 𝛽1 , 𝛽2,…, 𝛽𝐾
• We have assumed a functional form, so as long as there is
some variation in 𝑋1 , 𝑋2,…, 𝑋𝐾 there will not be identification
problems stemming from voids in the data
• There may be still be an identification problem when there is
co-movement among the Xs and/or co-movement between
one or more X and U
© 2019 McGraw-Hill Education.
20
Variable Co-Movement
• Perfect multicollinearity is a condition in which two or more
independent variables have an exact linear relationship
• If we can write 𝑋1𝑖 = 𝑐 + 𝑑𝑋2 there is perfect multicollinearity
• Perfect multicollinearity in our model is equivalent to being
able to express 𝑑1 𝑋1𝑖 + 𝑑2 𝑋2𝑖 + ⋯ + 𝑑𝐾 𝑋𝐾𝑖 = 𝑐 for all i in the
population
• Perfect multicollinearity implies a special type of correlation
among two or more independent variables
© 2019 McGraw-Hill Education.
21
Variable Co-Movement
• Imperfect multicollinearity is a condition in which two or more
independent variables have nearly an exact linear relationship
• When this condition exists for a data-generating process, we
can not express 𝑑1 𝑋1𝑖 + 𝑑2 𝑋2𝑖 + ⋯ + 𝑑𝐾 𝑋𝐾𝑖 = 𝑐 for all i in the
population
• Imperfect multicollinearity is equivalent to there being at least
one semi-partial correlation that is “high”– nearly equal to 1
• It is common to characterize a correlation above 0.8 as high
© 2019 McGraw-Hill Education.
22
Variable Co-Movement
Endogeneity: in the context of identification problems involves
co-movement between an independent variable(s) and the error
term in a data-generating process
© 2019 McGraw-Hill Education.
23
Identification Problems
• Perfect multicollinearity always leads to an identification problem
in regression analysis
• As an example, suppose, we believe that Sales of rocking chairs
depends not only on price, but also on Distance from the designer’s
location
• We follow the data-generating process: Salesi = α + β1Pricei +
β2Distancei + Ui
• The population from which we are drawing suffers from perfect
multicollinearity, creating an identification problem, particularly for
β1 AND β2.
© 2019 McGraw-Hill Education.
24
Perfect Multicollinearity
• The presence of perfect multicollinearity is clear, since we can write
one independent variable as a linear function for another for every
element in the population: Pricei = 200 + 0.04 × Distancei
• The identification problem comes from the fact that we cannot
separately estimate β1 and β2 – the marginal effect of Price and
Distance on sales
• The data-generating process becomes:
Salesi = α + β1(200 + 0.04 × Distancei)+ β2Distancei + Ui
Salesi = (α + β1200) + (0.04β1 + β2) Distancei + Ui
© 2019 McGraw-Hill Education.
25
Perfect Multicollinearity
Three ways to detect perfect multicollinearity
1. A known linear relationship among two or more independent
variables
2. Recognize misuse of dummy variables
3. Let the data reveal it
© 2019 McGraw-Hill Education.
26
Imperfect Multicollinearity
• Imperfect multicollinearity does not cause an identification
problem, it can create challenges with inference
• imperfect multicollinearity can generate inflated p-values and
confidence intervals, making it difficult to make any strong
inductive arguments about population parameters
• Because there is not an identification problem, these
challenges go away with enough data
© 2019 McGraw-Hill Education.
27
Imperfect Multicollinearity: An Example
• To illustrate, imperfect multicollinearity, suppose, Price has a
near-perfect linear relationship with Distance:
Pricei = 200 + 0.04 × Distancei + Vi,
where Vi contains other factors such as local fuel costs, etc.
• A customer at a Distance of 2,000 miles might have a value for
V of 3 and so face a Price of 200 + 0.04 × 2,000 + 3 = $283
• A customer at a Distance of 400 miles might have a value for V
of -2 and so face a Price of 200 + 0.04 × 400 ‒ 2 = $69
• Price and Distance have imperfect multicollinearity
© 2019 McGraw-Hill Education.
28
Imperfect Multicollinearity
• Assume the following data-generating process:
Salesi = α + β1Pricei + β2Distancei + Ui
• There is not perfect multicollinearity so we can get estimates of
all the parameters when regressing Sales on Price and Distance
© 2019 McGraw-Hill Education.
29
Imperfect Multicollinearity
• Ways to check whether there is imperfect multicollinearity, and
thus the possibility that this condition is inflating p-values and
confidence intervals:
1. Calculate semi-partial correlations among independent
variables and check whether they are close to 1
2. Variance inflation factor (VIF)
© 2019 McGraw-Hill Education.
30
Variation Inflation Factor (VIF)
• Variation inflation factor (VIF) for an independent variable—say,
1
2
𝑋1 —is equal to
,
where
𝑅
is the R-squared from
2
𝑋
1
1−𝑅
𝑋1
regressing that independent variable (X1) on all other
independent variables (X2,…,Xk) for a given determining
function
• A higher VIF for a given variable implies more noise (less
certainity) in its coefficient estimator
• VIF also tells us how much uncertainty this co-movement in the
Xs is injecting into our estimators
© 2019 McGraw-Hill Education.
31
Endogeneity as an Identification Problem
• Endogeneity can lead to estimators that are not consistent
• Assume the following data-generating process:
Yi = α + β1X1i +…+ βKXKi + Ui
and there is a non-zero correlation between X1 and U
• This correlation means β 1 from a regression of Y on X1,…, XK
need not be consistent
• The inconsistency of β 1 due to endogeneity amounts to
endogeneity as an identification problem
© 2019 McGraw-Hill Education.
32
Example of Inconsistent Estimator
WE HAVE, β 1 APPROACH A
NUMBER C ≠ β 1 AS THE
SAMPLE GETS LARGE
© 2019 McGraw-Hill Education.
33
The Effects of Variable Co-Movement on
Identification
• For the data-generating process Yi = α + β1X1i +…+ βKXKi + Ui : If there
exists an exact linear relationship between at least two of the
independent variables (Xs), defined as perfect multicollinearity, then
there is an identification problem
• In contract, if there is no exact linear relationship among the Xs, it is
always possible to distinguish the effects of the independent
variables on the outcome (Y) with any level of precision with
sufficient data, even if some Xs exhibit imperfect multicollinearity
• If there is correlation between any independent variable and the
error term, defined as endogeneity, then there is an identification
problem, no matter whether the correlation is via an exact linear
relationship or not
© 2019 McGraw-Hill Education.
34
Remedies for Identification Problems
For perfect multicollinearity
• As long as our goal is to estimate the treatment effect and we
have no particular interest in distinguishing the effects of
controls, dropping one of the control variables contributing to
perfect multicollinearity is an effective remedy
• The only viable remedy when the treatment contributes to a
perfect multicollinearity problem is to change the population
from which you are sampling
© 2019 McGraw-Hill Education.
35
Remedies for Identification Problems
For imperfect multicollinearity
• If data are suffering from noisy estimates and VIF calculations
suggest imperfect multicollinearity, the simple solution is to
gather more data
• If the imperfect multicollinearity involves only controls and
there is no interest in estimating the effects of the controls per
se, then collecting more data will not necessarily be
worthwhile
© 2019 McGraw-Hill Education.
36
Remedies for Identification Problems
For endogeneity
• The only viable remedy is to change the population from which
you are sampling
• It does not matter whether the endogeneity involves the
treatment or not
• Options include: collecting controls, finding a proxy variable(s),
finding an instrument(s), and/or transforming cross-sectional
data to become a panel
© 2019 McGraw-Hill Education.
37
Identification Damage Control: Signing the
Bias
• Suppose we have assumed the following data-generating
process: Yi = α + β1X1i +…+ βKXKi + Ui
• Let X1 be the treatment and X2, … , XK be controls
• Suppose that there is an omitted variable XK+1, that affects Y
(and so is part of U) and is correlated with X1
• The data generating process can be written as:
Yi = α + β1X1i +…+ βKXKi + βK+1XK+1i + Vi
© 2019 McGraw-Hill Education.
38
Identification Damage Control: Signing the
Bias
𝐾XKi be the estimated regression
• Let XK+1 = 𝛾+
ො 𝛿1X1i + …+ 𝛿
equation we get if we were to regress XK+1on X1, …, XK
• Within this framework, define βK+1 × 𝛿1as the omitted variable
bias
• Omitted variable bias is the product of the effect of the
omitted variable on the outcome (βK+1) and the (semi – partial)
correlation between the omitted variable and the treatment
(𝛿1)
© 2019 McGraw-Hill Education.
39
Identification Damage Control: Signing the
Bias
• Since we do not observe the omitted variable, we cannot
estimate either of the components of omitted variable bias
• We often can use theory to guide us with regard to the sign of
each component.
• The basic relationship is: sign(βK+1 × 𝛿1) = sign(βK+1) × sign(𝛿1)
© 2019 McGraw-Hill Education.
40
Identification Damage Control: Signing the
Bias
• The four possibilities for the sign of the omitted variable bias is
shown in the table below:
© 2019 McGraw-Hill Education.
41
o
Reflection and Discussion Forum Week 7 A
Assigned Readings:
Chapter 10. Identification and Data Assessment
Initial Postings: Read and reflect on the assigned readings for the week. Then post what you thought was the most important concept(s), method(s), term(s), and/or any
other thing that you felt was worthy of your understanding in each assigned textbook chapter. Your initial post should be based upon the assigned reading for the week, so
the textbook should be a source listed in your reference section and cited within the body of the text. Other sources are not required but feel free to use them if they aid in
your discussion.
Also, provide a graduate-level response to each of the following questions:
1. In Chapter 10 the focus of the material is identifying and assessing data. One of the chief concerns of identifying and assessing data is extrapolation and
interpolation. Please explain both of these concepts and give a reason why these scenarios would occur.
Please address each component of the discussion board. Also, cite examples according to APA standards.
[Your post must be substantive and demonstrate insight gained from the course material. Postings must be in the student's own words - do not provide quotes!]
[Your initial post should be at least 450+ words and in APA format (including Times New Roman with font size 12 and double spaced). Post the actual body of your paper
in the discussion thread then attach a Word version of the paper for APA review]
Submitting the Initial Posting:Your initial posting should be completed by Thursday, 11:59 p.m. EST.
Response to Other Student Postings: Respond substantively to the post of at least two peers, by Friday, 11:59 p.m. EST. A peer response such as “I agree with her," or
“I liked what he said about that” or similar comments are not considered substantive and will not be counted for course credit.
[Continue the discussion through Sunday, 11:59 p.m. EST by highlighting differences between your postings and your colleagues' postings. Provide additional insights or
alternative perspectives]
Evaluation of posts and responses: Your initial posts and peer responses will be evaluated on the basis of the kind of critical thinking and engagement displayed. The
grading rubric evaluates the content based on seven areas:
Content Knowledge & Structure, Critical Thinking, Clarity & Effective Communication, Integration of Knowledge & Articles, Presentation, Writing Mechanics, and Response
to Other Students.
Purchase answer to see full
attachment