PUAD 630
ANALYTICAL TECHNIQUES IN PUBLIC ADMINISTRATION
F ERZANA H AVEWALA
R EGRESSION A NALYSIS :
E STIMATING R ELATIONSHIPS
Introduction
1
Regression analysis is the study of relationships between variables.
There are two potential objectives of regression analysis:
to understand how the world operates
to make predictions.
Two basic types of data are analyzed:
Cross-sectional data are usually data gathered from approximately the same period of time
from a population.
Time series data involve one or more variables that are observed at several, usually equally
spaced, points in time.
Time series variables are usually related to their own past values—a property called
autocorrelation—which adds complications to the analysis.
PUAD 630: Analytical Techniques in Public Administration
Potential Uses of Regression Analysis
2
Regression analysis can help answer questions similar to:
How do wages of employees depend on years of experience, years of education, and
gender?
How does the current price of a stock depend on its own past values, as well as the
current and past values of a market index?
How does a company’s current sales level depend on its current and past advertising
levels, the advertising levels of its competitors, the company’s own past sales levels,
and the general level of the market?
How does the total cost of producing a batch of items depend on the total quantity
of items that have been produced?
How does the selling price of a house depend on such factors as the appraised value
of the house, the square footage of the house, the number of bedrooms in the
house, and perhaps others?
PUAD 630: Analytical Techniques in Public Administration
Regression Analysis Terms
3
In every regression study, there is a single variable that we are trying to explain or
predict, called the dependent variable.
It is also called the response variable or the target variable.
To help explain or predict the dependent variable, we use one or more explanatory
variables.
They are also called independent or predictor variables.
If there is a single explanatory variable, the analysis is called simple regression.
If there are several explanatory variables, it is called multiple regression.
Regression can be linear (straight-line relationships) or nonlinear (curved relationships).
Many nonlinear relationships can be linearized mathematically.
PUAD 630: Analytical Techniques in Public Administration
Scatterplots: Graphing Relationships
4
Drawing scatterplots is a good way to begin regression analysis.
A scatterplot is a graphical plot of two variables, an X and a Y.
If there is any relationship between the two variables, it is usually apparent from the
scatterplot.
PUAD 630: Analytical Techniques in Public Administration
Example 1 (a). Drugstore Sales
5
Objective: To use a scatterplot to examine the relationship between promotional
expenditures and sales at Pharmex.
Data:
Pharmex has collected data from 50 randomly selected metropolitan regions.
There are two variables:
Pharmex’s promotional expenditures as a percentage of those of the leading competitor
(“Promote”)
Pharmex’s sales as a percentage of those of the leading competitor (“Sales”).
PUAD 630: Analytical Techniques in Public Administration
Example 1(a).
Drugstore Sales
Scatterplots to estimate relationships
PUAD 630: Analytical Techniques in Public Administration
Example 2. Explaining Overhead Costs at Bendrix
7
Objective: To use scatterplots to examine the relationships among overhead, machine
hours, and production runs at Bendrix.
Data:
Data file contains monthly observations for 3 years
overhead costs
machine hours
number of production runs.
Each observation (row) corresponds to a single month.
PUAD 630: Analytical Techniques in Public Administration
Example 2(a).
Overhead Costs
Scatterplots to estimate relationships
PUAD 630: Analytical Techniques in Public Administration
Linear versus Nonlinear Relationships
9
Scatterplots are useful for detecting relationships that may not be obvious otherwise.
The typical relationship you hope to see is a straight-line, or linear, relationship.
This doesn’t mean that all points lie on a straight line, but that the points tend to cluster
around a straight line.
The scatterplot here illustrates a nonlinear relationship.
PUAD 630: Analytical Techniques in Public Administration
Outliers
10
Scatterplots are especially useful for identifying outliers—observations that fall
outside of the general pattern of the rest of the observations.
If an outlier is clearly not a member of the population of interest, then it is probably best to
delete it from the analysis.
If it isn’t clear whether outliers are members of the relevant population, run the regression
analysis with them and again without them.
If the results are practically the same in both cases, then it is probably best to report the
results with the outliers included.
Otherwise, you can report both sets of results with a verbal explanation of the outliers.
PUAD 630: Analytical Techniques in Public Administration
Outliers
11
CEO
PUAD 630: Analytical Techniques in Public Administration
Unequal Variance
12
Occasionally, the variance of the dependent variable depends on the value of the
explanatory variable.
The figure below illustrates an example of this.
There is a clear upward relationship, but the variability of amount spent increases as salary
increases—which is evident from the fan shape.
This unequal variance violates one of the assumptions of linear regression analysis, but there
are ways to deal with it.
PUAD 630: Analytical Techniques in Public Administration
No Relationship
13
Scatterplot can indicate that there is no relationship between a pair of variables.
Shapeless swarm of points
PUAD 630: Analytical Techniques in Public Administration
Correlations: Indicators of Linear Relationships
14
Correlations are numerical summary measures that indicate the strength of linear
relationships between pairs of variables.
A correlation between a pair of variables is a single number that summarizes the information
in a scatterplot.
It measures the strength of linear relationships only.
The usual notation for a correlation between variables X and Y is rxy.
Correlation formula:
The numerator of the equation is also a measure of association between X and Y,
called the covariance between X and Y.
The magnitude of a covariance is difficult to interpret because it depends on the units of
measurement.
PUAD 630: Analytical Techniques in Public Administration
Correlations: Indicators of Linear Relationships
15
By looking at the sign of the covariance or correlation—plus or minus—you can tell
whether the two variables are positively or negatively related.
Unlike covariances, correlations are completely unaffected by the units of
measurement.
A correlation equal to 0 or near 0 indicates practically no linear relationship.
A correlation with magnitude close to 1 indicates a strong linear relationship.
A correlation equal to -1 (negative correlation) or
+1 (positive correlation) occurs only when the linear relationship between the two variables
is perfect.
Be careful when interpreting correlations—they are relevant descriptors only for linear
relationships.
PUAD 630: Analytical Techniques in Public Administration
Remember . . .
16
The greater the strength of the relationship between two variables (the higher the
absolute value of the correlation coefficient), the more accurate the predictive
relationship.
Why?
The more two variables share in common (shared variance), the more you know about one
variable from the other.
PUAD 630: Analytical Techniques in Public Administration
Simple Linear Regression
17
Scatterplots and correlations indicate linear relationships and the strengths of these
relationships, but they do not quantify them.
Simple linear regression quantifies the relationship where there is a single explanatory
variable.
A straight line is fitted through the scatterplot of the dependent variable Y versus the
explanatory variable X.
PUAD 630: Analytical Techniques in Public Administration
Regression Line
18
Reflects our best guess as to what score on the Y variable would be predicted by a
score on the X variable
The line best fits these data because it minimizes the distance between each
individual predicted point and the regression line.
The distance between each individual data point and the regression line is the error
in prediction.
If the correlation were perfect, all of the data points would align themselves along a
45-degree angle, and the regression line would pass through each point.
PUAD 630: Analytical Techniques in Public Administration
The Simple Linear Regression Model
19
The equation that describes how y is related to x and an error term
Simple Linear Regression Model:
y = β 0 + β 1x + ε
Parameters: β0 and β1
The parameter values are usually not known and must be estimated using sample data
Sample statistics (denoted b0 and b1) are computed as estimates of the population parameters
β0 and β1
Random variable: Error term, ε
The error term accounts for the variability in y that cannot be explained by the linear
relationship between x and y
Estimated Regression Equation: The equation obtained by substituting the values of the
sample statistics b0 and b1 for β0 and β1 in the regression equation
Estimated simple linear regression equation: 𝑦ො = b0 + b1x
𝑦ො = Point estimator of E(y|x)
b0 = Estimated y-intercept
b1 = Estimated slope
The graph of the estimated simple linear regression equation is called the estimated
regression line
PUAD 630: Analytical Techniques in Public Administration
The Estimation Process in Simple Linear Regression
20
PUAD 630: Analytical Techniques in Public Administration
Possible Regression Lines in Simple Linear Regression
21
The regression line in Panel A shows that the mean value of y is related positively to
x, with larger values of E(y|x) associated with larger values of x.
In Panel B, the mean value of y is related negatively to x, with smaller values of
E(y|x) associated with larger values of x.
In Panel C, the mean value of y is not related to x; that is, E(y|x) is the same for
every value of x.
PUAD 630: Analytical Techniques in Public Administration
Least Squares Method
22
Least squares method is a procedure for using sample data to find the estimated
regression equation
Determine the values of b0 and b1
Interpretation of b0 and b1:
The slope b1 is the estimated change in the mean of the dependent variable y that is
associated with a one unit increase in the independent variable x
The y-intercept b0 is the estimated value of the dependent variable y when the independent
variable x is equal to 0
PUAD 630: Analytical Techniques in Public Administration
Least Squares Estimation
23
Fundamental Equation for Regression: Observed Value = Fitted Value + Residual
The residual is the difference between the actual and fitted values of the dependent
variable.
The best-fitting line through the points of a scatterplot is the line with the smallest
sum of squared residuals.
This is called the least squares line.
It is the line quoted in regression outputs.
The least squares line is specified completely by its slope and intercept.
Equation for Slope in Simple Linear Regression: b1
( X X )(Y Y ) r
(X X )
i
i
2
i
Equation for Intercept in Simple Linear Regression: b0 Y b1 X
PUAD 630: Analytical Techniques in Public Administration
XY
sY
sX
Least Squares Estimation
24
When fitting a straight line through a scatterplot, choose the line that makes the
vertical distance from the points to the line as small as possible.
A fitted value is the
predicted value of the
dependent variable.
Graphically, it is the height
of the line above a given
explanatory value.
PUAD 630: Analytical Techniques in Public Administration
Example 1 (b). Drugstore Sales
25
Objective: Find the least squares line for sales as a function of promotional expenses
at Pharmex.
Solution:
Select Regression from the StatTools
Regression and Classification dropdown.
Use Sales as the dependent variable
and Promote as the explanatory variable.
PUAD 630: Analytical Techniques in Public Administration
Example 1 (b).
Drugstore Sales
Simple Regression
PUAD 630: Analytical Techniques in Public Administration
Example 1 (b). Drugstore Sales
The slope = 0.7623
•
•
Sales index tends to increase by about 0.76 for each one-unit increase in the promotional
expenses index.
If two regions are compared, where the second region spends one unit more than the first
region, the predicted sales index for the second region is 0.76 larger than the sales index for
the first region.
The intercept = 25.1264
•
•
The predicted sales index for a region that does zero promotions is about 25.13
However, no region in the sample has anywhere near a zero promotional value.
• Therefore, in a situation like this, where the range of observed values for the
explanatory variable does not include zero, it is best to think of the intercept term as
simply an “anchor” for the least squares line that enables predictions of Y values for
the range of observed X values.
PUAD 630: Analytical Techniques in Public Administration
Example 2 (b). Overhead Costs
Objective: To regress overhead expenses at Bendrix against the two potential
explanatory variables.
Solution:
The Bendrix manufacturing data set has two potential explanatory variables, Machine
Hours and Production Runs.
First regress Overhead against Machine Hours as the single explanatory variable.
Then regress Overhead against Production Runs as the single explanatory variable.
PUAD 630: Analytical Techniques in Public Administration
Example 2 (b).
Overhead Costs
Simple Regression
PUAD 630: Analytical Techniques in Public Administration
Standard Error of Estimate
30
The magnitude of the residuals provide a good indication of how useful the
regression line is for predicting Y values from X values.
Because there are numerous residuals, it is useful to summarize them with a single
numerical measure.
This measure is called the standard error of estimate and is denoted se.
It is essentially the standard deviation of the residuals,
and is given by this equation:
The usual empirical rules for standard deviation can be applied to the standard error
of estimate.
In general, the standard error of estimate indicates the level of accuracy of
predictions made from the regression equation.
The smaller it is, the more accurate predictions tend to be.
PUAD 630: Analytical Techniques in Public Administration
How Good Is Our Prediction?
31
Error of estimate: How much each data point differs from the predicted data point
Standard error of estimate: The measure of how much each data point (on average)
differs from the predicted data point or a standard deviation of all of the error
scores
The higher the correlation between two variables (and the better the prediction), the
lower the error will be.
PUAD 630: Analytical Techniques in Public Administration
The Percentage of Variation Explained: R-Square
32
R2 is an important measure of the goodness of fit of the least squares line.
It is the percentage of variation of the dependent variable explained by the regression.
It always ranges between 0 and 1.
The better the linear fit is, the closer R2 is to 1.
Formula for R2:
In simple linear regression, R2 is the square of the correlation between the dependent
variable and the explanatory variable.
PUAD 630: Analytical Techniques in Public Administration
Multiple Regression
33
To obtain improved fits in regression, several explanatory variables could be included
in the regression equation. This is the realm of multiple regression.
Graphically, you are no longer fitting a line to a set of points. If there are two explanatory
variables, you are fitting a plane to the data in three-dimensional space.
The regression equation is still estimated by the least squares method, but it is not practical
to do this by hand.
There is a slope term for each explanatory variable in the equation, but the interpretation of
these terms is different.
The standard error of estimate and R2 summary measures are almost exactly as in simple
regression.
Many types of explanatory variables can be included in the regression equation.
PUAD 630: Analytical Techniques in Public Administration
Interpretation of Regression Coefficients
34
If Y is the dependent variable, and X1 through Xk are the explanatory variables, then
a typical multiple regression equation has the form shown below, where a is the Yintercept, and b1 through bk are the slopes.
General Multiple Regression Equation:
Predicted: Y = b0 + b1X1 + b2X2 + … + bkXk
Collectively, the bs in the equation are called the regression coefficients.
Each slope coefficient is the expected change in Y when this particular X increases
by one unit and the other Xs in the equation remain constant.
This means that the estimates of the bs depend on which other Xs are included in the
regression equation.
PUAD 630: Analytical Techniques in Public Administration
The BIG Rules . . .
35
When using multiple predictors, keep in mind . . .
Your independent variables (X1, X2, X3, etc.) should be related to the dependent
variable (Y). They should have something in common.
Independent variables should not be related to each other; they should be
uncorrelated so that they provide a unique contribution to the variance in the
outcome of interest.
PUAD 630: Analytical Techniques in Public Administration
Example 2(c). Overhead Costs
Objective: To estimate the equation for overhead costs at Bendrix as a function of both machine
hours and production runs.
Solution:
Select Regression from the StatTools Regression and Classification dropdown list. Then
choose the Multiple option and specify the single D variable and the two I variables.
PUAD 630: Analytical Techniques in Public Administration
37
Example 2(c).
Overhead Costs
Multiple Regression
PUAD 630: Analytical Techniques in Public Administration
Interpretation of Standard Error of Estimate and R-Square
38
The multiple regression output is very similar to simple regression output.
The standard error of estimate is essentially the standard deviation of residuals, but
it is now given by the equation below, where n is the number of observations and k
is the number of explanatory variables:
The R2 value is again the percentage of variation of the dependent variable explained
by the combined set of explanatory variables, but it has a serious drawback: It can only
increase when extra explanatory variables are added to an equation.
Adjusted R2 is an alternative measure that adjusts R2 for the number of explanatory
variables in the equation.
It is used primarily to monitor whether extra explanatory variables really belong in the
equation.
PUAD 630: Analytical Techniques in Public Administration
Modeling Possibilities
39
Several types of explanatory variables can be included in regression equations:
Dummy variables
Interaction variables
Nonlinear transformations
There are many alternative approaches to modeling the relationship between a
dependent variable and potential explanatory variables.
In many applications, these techniques produce much better fits than you could obtain
without them.
PUAD 630: Analytical Techniques in Public Administration
Dummy Variables
40
Some potential explanatory variables are categorical and cannot be measured on a
quantitative scale.
However, these categorical variables are often related to the dependent variable, so they need
to be included in the regression equation.
The trick is to use dummy variables.
A dummy variable is a variable with possible values of 0 and 1.
It is also called a 0-1 variable or an indicator variable.
It equals 1 if the observation is in a particular category, and 0 if it is not.
Categorical variables are used in two situations:
When there are only two categories (example: gender)
When there are more than two categories (example: quarters)
In this case, multiple dummy variables must be created.
PUAD 630: Analytical Techniques in Public Administration
Example 3. Bank Salaries
41
Objective: To analyze whether the bank discriminates against females in terms of
salary.
Data:
The data set includes the following variables for each of the 208 employees:
Education (categorical)
Grade (categorical)
Years1 (numerical. years with this bank)
Years2 (numerical. years of previous work experience)
Age (numerical)
Gender (categorical with two values)
PCJob (categorical yes/no for if employee's current job is primarily computer-related)
Salary (numerical)
PUAD 630: Analytical Techniques in Public Administration
Example 3(a). Bank Salaries
Create dummy variables for the various categorical variables, using IF functions or
the StatTools Dummy procedure.
Then we can run a regression analysis with Salary as the dependent variable, using
any combination of numerical and dummy explanatory variables.
Don’t use any of the original categories (such as Education) that the dummies are based on.
Always use one fewer dummy than the number of categories for any categorical variable.
The omitted dummy then corresponds to the reference category.
The interpretation of any dummy variable coefficient is relative to this reference category.
When there are only two categories, (e.g., gender variable), name the variable with the
category (e.g., Female) that corresponds to the 1’s. In this case the other category (e.g.,
Male) automatically becomes the reference category.
PUAD 630: Analytical Techniques in Public Administration
Example 3(a)&(b).
Bank Salaries
Multiple regression with numerical and
categorical explanatory variables
PUAD 630: Analytical Techniques in Public Administration
Interaction Variables
44
When you include only a dummy variable in a regression equation, you are allowing
the intercepts of the two lines to differ, but you are forcing the lines to be parallel.
To be more realistic, you might want to allow them to have different slopes.
You can do this by including an interaction variable.
An interaction variable is the product of two explanatory variables.
Include an interaction variable in a regression equation if you believe the effect of one
explanatory variable on Y depends on the value of another explanatory variable.
PUAD 630: Analytical Techniques in Public Administration
Example 3(c). Bank Salaries
45
Objective: To use multiple regression with an interaction variable to see whether the
effect of years of experience on salary is different across the two genders.
Solution: StatTools will create the interaction variables implicitly.
Check the Include Derived
Variables box at the
bottom of the
Regression dialog box
and then click the
resulting Add button .
Check the Years1 and
Gender variables and
select the Interaction
with Category Variable
option.
PUAD 630: Analytical Techniques in Public Administration
Example 3(c)
Bank Salaries
Multiple regression with interaction variable
PUAD 630: Analytical Techniques in Public Administration
Example 3(c). Bank Salaries
Regression Equations:
Males:
Salary = 30430
+
(1528) Years1
Females: Salary = (30430+4098) + (1528-1248) Years 1
34528
+
(280)
Years 1
PUAD 630: Analytical Techniques in Public Administration
Nonlinear Transformations
48
The general linear regression equation has the form:
Predicted Y = b0 + b1X1 + b2X2 + … + bkXk
It is linear in the sense that the right side of the equation is a constant plus a sum of
products of constants and variables.
The variables can be transformations of original variables.
Nonlinear transformations of variables are often used because of curvature detected in
scatterplots.
You can transform the dependent variable Y or any of the explanatory variables, the Xs. You
can also do both.
Typical nonlinear transformations include: the natural logarithm, the square root, the
reciprocal, and the square.
PUAD 630: Analytical Techniques in Public Administration
Example 4(a). Electricity
Objective: To see whether the cost of supplying electricity is a nonlinear function of
demand, and if it is, what form the nonlinearity takes.
Data: The data set lists the number of units of electricity produced (Units) and the
total cost of producing these (Cost) for a 36-month period.
Solution:
First generate a scatterplot of Cost versus Units.
Next, run a simple regression of Cost on Units.
PUAD 630: Analytical Techniques in Public Administration
Example 4(a).
Electricity
PUAD 630: Analytical Techniques in Public Administration
Example 4(b). Electricity
Create a new variable Units2 in the data set and then use multiple regression to
estimate the equation for Cost with both explanatory variables, Units, and Units2
Use Trendline option in Excel to superimpose a quadratic curve on the scatterplot.
PUAD 630: Analytical Techniques in Public Administration
Example 4(b).
Electricity
Quadratic transformation
PUAD 630: Analytical Techniques in Public Administration
Example 4(c). Electricity
Next, try a logarithmic fit by creating a new variable, NaturalLog(Units), and then
regressing Cost against this variable.
PUAD 630: Analytical Techniques in Public Administration
Example 4(c).
Electricity
Logarithmic transformation
PUAD 630: Analytical Techniques in Public Administration
Example 4(c). Electricity
Logarithmic transformations of variables are used widely in regression analysis
because they have a meaningful interpretation.
Suppose that Units increases by 1% (e.g., from 600 to 606 ), then the expected Cost will
increase by a given amount, namely, the coefficient of Log(Units) multiplied by 0.01,
approximately 0.01(16654)= $166.54
Every 1% increase in Units is accompanied by an expected $166.54 increase in Cost.
Note that for larger values of Units, a 1% increase represents a larger absolute increase (e.g.,
from 700 to 707 instead of from 600 to 606). But each such 1% increase entails the same
increase in Cost. This is another way of describing the decreasing marginal cost property.
PUAD 630: Analytical Techniques in Public Administration
Purchase answer to see full
attachment