4/15/2015
Thinking about
correlations and
regression
M. L. Zajicek-Farber, MSW, PhD
CUA-NCSSS
Shahan #112
Spring 2015 – Updated 4.15.2015
1
Working with a linear relationships between two variables
(test scores with levels of income reward per week).
What direction is implied by this relationship?
Notice the positive direction between variables – positive correlation !
100
90
80
Stats Test
Score
70
60
50
40
30
20
10
0
$0
$20
$40
$60
$80
Reward in dollars per week of studying
2
1
4/15/2015
What direction of the correlation is
implied by these results between
these two variables?
Stats Test
Score
100
90
80
70
60
50
40
30
20
10
0
$0
$20
$40
$60
$80
Reward in dollars per week of studying
Notice the curvilinear direction between variables – not a linear correlation
!
3
What direction of the correlation is
implied by these results between
these two variables?
Stats Test
Score
100
90
80
70
60
50
40
30
20
10
0
$0
$20
$40
$60
$80
Reward in dollars per week of studying
Notice the negative – inverse direction between variables – negative
correlation!
4
2
4/15/2015
What direction of the correlation is
implied by these results between
these two variables?
Stats Test
Score
100
90
80
70
60
50
40
30
20
10
0
$0
$20
$40
$60
$80
Reward in dollars per week of studying
Notice the “NO” relationship between variables – no correlation !
5
Pearson Correlation coefficient r
A linear relationship between 2 variables:
Pearson correlation is denoted by statistic “r”
Both variables have to be “numbers” or
continuous level of measurement
(or interval or ratio level of measurement)
Data should be gathered from a parametric
probability based (random) sample representative
of a population.
Both variables should be normally distributed!
If these conditions cannot be met (for
representative sample or normal distribution):
use Spearman Rank Order correlation coefficient test
rho:
6
3
4/15/2015
Different types of correlation coefficients
Level of Variable Measurement
Type of Correlation Example
Variable X
Variable Y
Nominal
Nominal
Phi Coefficient
Voting Preference
and Gender
Nominal-Ordinal
Ordinal
Rank biserial
coefficient
Social class (low,
medium, high) and
Rank standing in
high school
Nominal
Interval or Ratio
Point biserial
Type of support from
family member
(mother, father) and
GPA in school
Ordinal (rank)
Ordinal (rank)
Spearman rank
coefficient
Height in percentile
rank and weight in
percentile rank
Interval/Ratio
Interval/Ratio
Pearson correlation
coefficient
Number of problems
solved and age in
7
years
Correlation in Bivariate Statistics
Univariate & Bivariate Statistics for Numeric
Data
◦ Frequency distribution, mean, mode, range, standard
deviation
◦ Numeric correlation between two variables
Correlation ….examines …whether
Warning:
◦ - the linear pattern of relationship between one
variable (x) and another variable (y) – a linear
correlation between two continuous variables
◦ - relative position of one variable correlates with relative
distribution of another variable along a “straight” line on
a graph
◦ A correlation can be shown with a graphical
representation of the relationship between two variables
by a straight line
◦ Correlation does not show proof of causality!
◦ Correlation cannot assume x causes y!
8
4
4/15/2015
Correlation: How much scores correlate
linearly?
IV = “resilience scores”
DV = self-esteem scores”
Selfesteem
Resilience
Correlation
Or covariation of scores
9
Pearson’s Correlation Coefficient
“r” indicates…
◦ strength of relationship (strong, weak,
or none)
◦ direction of relationship
positive (direct) – variables move in same
direction
negative (inverse) – variables move in opposite
directions
r ranges in value from –1.0 to +1.0
-1.0
Strong Negative
0.0
No Rel.
+1.0
Strong Positive
10
5
4/15/2015
Direction of correlation
Positive (direct)
The values of scores of both variables
increase in the same direction (both are
high or both are low).
Or
Negative (indirect, or inverse)
Scores of one variable increases while
the scores of the other variable
decreases. (when one is of high value
the other is of low value).
Or
11
What direction is the correlation?
A shorter length of hospitalization was related to a fewer
number of symptoms in patients with acute heart disease.
_positive _negative
As clients levels of anxiety increased, their level of problem
solving decreased.
_ positive _negative
As clients’ satisfaction decreased, the number of their
complaints regarding dining services increased.
_ positive _negative
The older the client’s age level, the greater the participation
in socialization activities.
_ positive _negative
12
6
4/15/2015
Pearson correlation r
Tests “linearity” of association between 2 variables
r ranges from -1 to 0 to +1
Note: the sign + or – indicates the direction
Strength of correlation r = :
Memorize the
moderate value
“0” …..means no correlation
of r = .40 to .69
.01 to .39 = “weak” correlation
.40 to .69 = “moderate” relationship
.70 to .90 = “strong” correlation
1.00 or -1.00 = “identical /perfect relationship
13
What is the strength of the
correlation r:
r = .69
r = -.15
r = -.35
r = .55
r = -.78
1. Which is the weakest correlation? (-.15)
2. Which is the strongest correlation? (-.78)
3. Which correlations are positive and
moderate in strength? (.69, and .55)
4. Which correlations are weak and
negative? (-.15, and -.35)
14
7
4/15/2015
Hypothesis Testing:
1.
2.
3.
4.
5.
6.
7.
8.
15
State the Null hypothesis Ho:
State the research tail-hypothesis,
Find rcritical at p=.05 from the Table (when p-value is not
provided for r-value!)
Use N-2 = df
Compare rcritical with robtained :
If the rcritical in the table is < robtained : then
Reject the Null Ho
If the rcritical in the table is > robtained :
Do not reject the Null Ho
Draw conclusions about the correlation
** When the
p-value for statistical significance is provided, base
your interpretation on whether the p < .05 for evaluating whether
the r-value is statistically significant!
Considering r_critical: Using a Table
Imagine that a study found a correlation of r = .69
between “test score GPA” and the “amount of
dollar reward” per week given out by 10 randomly
selected parents to improve youth GPA.
robtained = .69 and N = 10
rcritical : figure out df = N - 2 [df = 10 -2 = 8]
Go into Salkind Table B4 for critical values of r
rcritical (df=8) = .6319 at p = .05 (2-tailed)
rcritical (df=8) = .5494 at p = .05 (1-tailed)
Note: Some tables for rcritical give you N instead of
df…..and this table has already been adjusted.
Decision for r = .69 ?
16
8
4/15/2015
This tables provide N with correct values
1-Tailed
17
Notice N
2-Tailed
Notice df
1-tail
18
9
4/15/2015
Let’s figure out the result for r =.69.
r (calculated) = 0.69
r (critical from table) at 2 tail = 0.6319 at df =8
r (critical from table) at 1 tail = 0.5494 at df =8
Compare r (critical) with r(calculated)
Fro 2 Tail Ho: 0.6319 < 0.69
For 1 Tail Ho: 0.5494 < 0.69
Make conclusions:
Since r(calculated) is “bigger” > than the table
value r(critical) – then reject the Null Ho)
hypothesis and conclude that the calculated r of 0.69
is a statistically significant result at both 2-tail and 1tail research hypotheses at p < 0.05.
19
Note: On statistical significance!
If the analysis gives you r-value
(correlation result) with a p-value such as
r = 0.35 at p = 0.012 for correlation between
“resilience” and “self-esteem” - then –
Use the p-value to decide whether the
correlation is statistically significant.
That is, here p = 0.012 - which is less than
p < 0.05 and therefore, you can conclude that
the r of 0.35 is statistically significant at p <
0.05!
20
10
4/15/2015
Note: On statistical significance!
If the analysis gives you r-value
(correlation result) without a p-value such
as
r = 0.69 for correlation between “GPA scores”
and “no. of rewards” - then –
Use the table for determining the critical rvalue at 2-tail or 1-tail at p = 0.05, and
Then – compare r(calculated) against
r(critical) and if “calculated value r” > “table
critical value r” – then Reject Ho, and
conclude that the calculated r is statistically
significant at p < 0.05.
21
Once you have a significant r-result
(correlation coefficient), then that
result must be interpreted covering
several specific areas:
• Direction of the correlation r coefficient
• Interpretation of the direction in a normal English
language meaning
• Magnitude of r
• Coefficient of determination or computation of rsquare
• Explanation of shared variance (or r-square x 100%)22
11
4/15/2015
For a statistically significant (p < .05)
correlation
Indicate:
Direction of the relationship (positive or negative)
Interpret the direction in normal English language
Indicate the strength or magnitude of the
correlation coefficient r
Calculate the Coefficient of Determination
(r-square) and multiply it by 100%.
Interpret the coefficient of determination as the
percent of shared variance that is explained in the
significant correlation of the two variables.
23
Interpret r = 0.69 at p < 0.05
Direction: Positive
Meaning: Higher the students’ test scores are
correlated with higher values of students’
rewards! Or, lower the students’ test scores are
correlated with lower students’ rewards.
Magnitude of r: the value of 0.69 reflects a
“moderate” magnitude of the strength of this
correlation.
Coefficient of Determination ……continued
on the next slide
24
12
4/15/2015
Note: r2 Linear = .69 x .69 = .479
r2
is a Coefficient of Determination!
Captures the shared amount variance in the
covariation of the scores of X and Y variables.
Multiply r2 by 100% in order to get the percent
of variance in a correlation coefficient explained
by the coefficient of determination!
Coefficient of determination is 47.9% or the
correlation between “GPA” and “Number of
rewards” explains close to 48% of shared
variance.
25
More Information on r-square r2
In a regular bivariate correlation, r2 represents just
the percent of shared variance in the correlation of
the two variables.
However, when we designate one of the variables
as IV and the other as DV, then r2 actually is
explains the percent of variance in the DV that can
be specifically accounted for by the variance of IV.
This concept of IV variance explaining DV
variance becomes important in a “prediction”
or “statistical regression analysis.”
26
13
4/15/2015
Notice: Correlation and prediction and
causation are separate concepts!
Correlation refers to bi-variate correlation
between 2 variables.
Prediction refers to “regression analysis” –
which is an extension of correlation analysis
with 2 or more variables for the specific
purpose of explaining “which IV variables
predict the DV variable.
Causation looks to a design of the study –
and whether or not the study has (or has
not) some form of experimental design!
27
Correlation and Causality !!!
Correlation between two variables measures the
relationship or correlation between two
variables!
In NO way does correlation imply any
causality between two variables!
However, the correlation that uses prediction
analysis called “regression” analysis allows
for stronger inference about the relationship
between variables when one is designated
as the X (independent variable) and the other
as the Y (dependent variable).
28
14
4/15/2015
So, let’s look at another study:
A study examined a correlation
between job satisfaction and
burnout of 218 randomly selected
employees in an organization.
This study found a correlation:
r = - 0.409 at p < 0.001
The data are presented in a Table
on the next slide!
29
A study examined a correlation between job satisfaction and30
burnout of 218 randomly selected employees in an
organization. The data are presented in a Table below:
Descriptive Statistics
Mean
Std. Deviation
N
Degree of Job Satisfaction
66.1606
8.82534
218
Degree of Burnout
21.5688
6.30466
218
Correlations
Job Satisfaction
Job Satisfaction
Pearson Correlation
1
Sig. (2-tailed)
**. Correlation is significant at the 0.01 level (2-tailed).
218
Pearson Correlation
r=
-.405**
Sig. (2-tailed)
p=
.000
N
-.405**
.000
N
Burnout
Burnout
218
Note: Sig = .000 is p < .001 !
218
1
218
15
4/15/2015
What is r for “satisfaction” with “burnout”?
r = - .405 ** at p < .001
Note: ** indicates that this correlation is
“statistically significant”
1) What is the direction of this correlation? negative
2) Translate the direction into “normal English” language to
provide normal meaning! more satisfaction less burnout
3) What is the strength (magnitude) of this correlation?
moderate strength
4) What is the coefficient of determination? r2 = .405 x .405 =
0.164025 x 100% = 16.40%
5) Explain the meaning of the coefficient of determination?
This significant correlation explains 16% of
variance that is shared between these two variable
scores.
31
Limitations of Correlation
• linearity:
– can’t describe non-linear relationships
– e.g., relation between anxiety & performance
• truncation of range:
– underestimate strength of relationship if you can’t see
full range of x value
• no proof of causation
– third variable problem:
• could be 3rd variable causing change in both
variables
• directionality: can’t be sure which way causality
32
“flows”
16
4/15/2015
Correlation and Prediction
• Regression: Correlation with Prediction
– predicting DV (y) based on IV (x)
– e.g., predicting….
• “burnout” of employees (Y) variable
• based on “job satisfaction” (X) variable
• Is there a statistically significant correlation between
the variables? and
• Does the linear relationship between the variables
predict the dependent variable Y from the variable
X variable?
• Regression is typically expressed as MRA (multiple
regression analysis)
• MRA usually has several IVs and 1 DV:
33
Consider:
If age, job satisfaction and turnover are
designated as three independent variables (IVs),
and
job burnout is designated as the dependent (DV)
- outcome variable, then
(a) is there a statistically significant correlation
between these three variables and burnout, and
(b) Do variables of age, job satisfaction, and
turnover scores predict the “burnout” of
agency employees?
(c) What percent of burnout can be predicted
from this model ?
(d) Which variable is the strongest predictor of
“burnout?”
34
17
4/15/2015
Prediction Equation
Y’ = a + b1X1 + b2X2 + b3X3 + ……bnXn
Y’ is the predicted score of the Y
(dependent variable – “burnout”) based on
the known value of Xs (independent
variables)
b is the raw (unstandardized) partial
regression coefficient derived from the
analysis, or the slope, or direction of the
line (based on the variable relationship
between Xs and Y)
X is the score of the variable being used as
the predictor (independent variable)
a is the value of “burnout” when “job
satisfaction” = 0 …also called the
“intercept” or “constant”
35
MRA – Research Question and Hypotheses
RQ: Do variables of age, job satisfaction, and turnover in
the agency predict the burnout of agency employees?
Null Ho: Variables of age, job satisfaction, and turnover in
the agency will not significantly predict the burnout of agency
employees.
2-Tail Ha: Variables of age, job satisfaction, and
turnover in the agency will significantly predict the
burnout of agency employees.
1-Tail Ha1: Age will significantly positively predict
burnout of employees;
1-Tail Ha2: Job satisfaction will significantly negatively
predict burnout of employees;
1-TailHa3: Turnover will significantly positively predict
burnout of employees.
36
18
4/15/2015
Note:
When we do regression analysis,
then we do not use correlation
critical tables to interpret the
results……rather……we will always
be given the p-value associated
with the regression results!
So, let’s look at analytical results on
the next slide
37
Results of SPSS MRA analysis
Note:
The Pearson Correlation r between
“Burnout” and all other variables
Note:
r-values
Note:
p-values
Note:
Sample Size
38
19
4/15/2015
Identifying correlation from the
previous slide
Age: r = -.077 at p = .130 (Not Significant!)
Satisfaction: r = -.405 at p < .001 (Significant)
Turnover: r = .577 at p < .001 ( Significant)
Now, let’s look at the analytical output for
MRA analysis on the next slide:
39
Additional Results of MRA analysis
Note: Adjusted R2 for
the entire model with
all independent
variables.
Here: The three
independent variables
explained 35.8% of
variance in “burnout.”
Note: Each variable
contributed a different
percentage % to the
explanation of the Adj.
R-Square in this
model.
Note: The Sig. F
Change tells you what
happens to the model
as you add variables
to the equation in
predicting “burnout.”
Here: Age contributed
less than 1%, while
satisfaction explained
15.8% and turnover
around 20%.
Here: Age did not
produce a sig. model,
but job satisfaction
and turnover
variables did!
40
20
4/15/2015
Additional Results of MRA analysis
41
Read: Model 3
Note: The p-value for
Note: B = b = raw
Note:
each independent
unstandardized
Beta is
variable.
partial correlation
used to
coefficient for each
point to the
Here: Age p=.197
independent variable “strongest
Satisfaction p = .002
that is used for
coefficient
Turnover p = < .001
building the
prediction equation
Equation for Predicting Burnout:
Y’ = 15.644 + (.049)(Age_score) + (-.138)(Satisfaction_score) + (.704)(Turnover_score)
Note: Constant = a
= 15.644 = Intercept
value of predicted y’ or
burnout when all other
variables equal to zero.
Another way to present MRA table with results:
Table X.
Prediction of Burnout based on Age, Job
Satisfaction and Turnover of Employees (N=215)
Predictor
Variable
Simple r
Adjusted
R2
∆
R2
Change
Regression
Coefficient
b
Beta
Sig.
Age
-.077
.001
.006
.049
.073
.197
Job
Satisfaction -.405**
.156
.158
-.138
-.196
.002
Turnover
.358
.203
.704
.509
.001
.577**
Constant = 15.644
Adjusted R2
= 35.8%
Notice the p-values of
these predictor variables
You should be able to realize that “age” was not a significant predictor of
“burnout” whereas “job satisfaction” and “turnover” both statistically significantly
predicted “burnout” because their p-values were p < .05!
42
21
4/15/2015
So what do these results tell us?
Multiple Regression Analysis (MRA) found that employees’
“burnout” can be significantly predicted from three variable
model (age, job satisfaction, and turnover) – look at slide
40and you should see we built the model step by step,
including one variable at a time.
Notice on slide 40: the F-change showed you how each
model p-value changed: age F change p = .260 (NS), and
then when we added job satisfaction and turnover
variables – the Fchange p-values for Sig. changed to .000
– telling us that p < 0.001!.
This three variable model explained 38.5% of variance
in burnout (Adjusted R-Square = .385). – look at slide 40
and 42!
In this model, only job satisfaction and turnover were
predictive of burnout (p < .05), whereas age did not
significantly predict burnout (p= .197).
43
Continue with results:
Job satisfaction explained 15.8% and turnover
explained 20.3% of the total variance in burnout. –
Look at slide 40 under “adjusted R-square
Change” column!
However, turnover was the strongest predictor
(Beta= .509) when compared with job satisfaction
(Beta = -.196). – Look at slide 41 under the Beta
column for “standardized coefficients” and just simply
look for the highest value Beta regardless of the sign.
Notice: the sign just tells you the direction of the
Predictor and the Outcome variable!
The relationship between burnout and turnover is
such that a higher score of turnover predicted a
higher score of burnout.
The relationship between burnout and job satisfaction
is negative such that a higher score in job satisfaction
predicted a lower score in burnout.
44
22
4/15/2015
Summarized results from the study:
The researcher investigated whether 3 predictor
variables (age, job satisfaction, and turnover)
predicted the level of burnout in 215 agency
employees, who were randomly selected from all
agency employees for a survey study.
The findings from MRA analysis found that age did
not significantly predict burnout, whereas both job
satisfaction and turnover did significantly predict the
burnout of the employees. The analytical model with
three variables explained around 38% of burnout. The
best significant predictor was the variable turnover,
which explained around 20% of variance in burnout,
while job satisfaction explained 16% of variance in
burnout.
The finding of these results can be attributed to
agency employees because the study used a
probability sample.
45
Let’s look at this study:
RQ:
Does gender, level of resilience, and income
predict “life satisfaction”?
Ha (2-tail):
Gender, level of resilience, and income will
predict “life satisfaction.”
Ho: Gender, level of resilience, and income
will not predict “life satisfaction.”
46
23
4/15/2015
Another way to present MRA table with results:
Table X.
Prediction of Life Satisfaction based on Gender, Resilience,
and Income in Randomly Selected Adults (N= 500)
Predictor
Variable
Simple r
Adjusted
R2
∆
R2
Change
Regression
Coefficient
b
Beta
Sig.
Gender
.351*
.15
.151
.333
.323
.034
Resilience
.421**
.32
.175
.454
.544
.001
Income
.011
.32
.001
.014
.050
.255
* p < 0.05, ** p < 0.001
Constant = 20.22
1. Which simple correlations are statistically significant ?
2. Which variable predictors in this model significantly predicted “life satisfaction”?
3. How much percent % variance did these predictors explain in “life satisfaction”?
4. Of the significant predictors how much % of variance did each predictor explain?
5. Which significant predictor is the “best” predictor of life satisfaction?
47
Another way to present MRA table with results:
Table X.
Prediction of Life Satisfaction based on Gender, Resilience,
and Income in Randomly Selected Adults (N= 500)
Simple r
Adjusted
R2
∆
R2
Change
Regression
Coefficient
b
Beta
Sig.
Gender
Male = 0
Female =1
.351*
.15
.151
.333
.323
.034
Resilience
.421**
.32
.175
.454
.544
.001
Income
.011
.32
.001
.014
.050
.255
Predictor
Variable
Constant = 20.22
* p < 0.05, ** p < 0.001
1. Which simple correlations are statistically significant ?
(a) Notice that Gender r = .351 with a p < 0.05 - which means “gender” is
statistically significant, and weakly and positively correlated with Life
Satisfaction (DV). And, female (with higher gender code) have higher life
satisfaction! Males (with lower gender code) have lower life satisfaction!
(b) Resilience (r = .421) has a significant (p < 0.001), positive, moderate correlation
with life satisfaction.
(c) Income has no significant simple correlation with life satisfaction (p > 0.05). 48
24
4/15/2015
Another way to present MRA table with results:
Table X.
Prediction of Life Satisfaction based on Gender, Resilience,
and Income in Randomly Selected Adults (N= 500)
Predictor
Variable
Simple r
Adjusted
R2
∆
R2
Change
Regression
Coefficient
b
Beta
Sig.
Gender
.351*
.15
.151
.333
.323
.034
Resilience
.421**
.32
.175
.454
.544
.001
Income
.011
.32
.001
.014
.050
.255
* p < 0.05, ** p < 0.001
Constant = 20.22
2. Which variable predictors in this model significantly predicted “life
satisfaction”? …look at the “Sig. column – and use the p-value in this column:
(a) Gender (p < 0.05) and Resilience ( p < 0.001), but not Income (p > 0.05).
49
Another way to present MRA table with results:
Table X.
Prediction of Life Satisfaction based on Gender, Resilience,
and Income in Randomly Selected Adults (N= 500)
Predictor
Variable
Simple r
Adjusted
R2
∆
R2
Change
Regression
Coefficient
b
Beta
Sig.
Gender
.351*
.15
.151
.333
.323
.034
Resilience
.421**
.32
.175
.454
.544
.001
Income
.011
.32
.001
.014
.050
.255
Constant = 20.22
* p < 0.05, ** p < 0.001
3. How much percent % variance did these predictors explain in “life
satisfaction”? …….look at the Adjusted R-Square column:
.32
x 100% = 32.00 % - which means this model with three
predictors explained 32% of variance in life satisfaction!
50
25
4/15/2015
Another way to present MRA table with results:
Table X.
Prediction of Life Satisfaction based on Gender, Resilience,
and Income in Randomly Selected Adults (N= 500)
Predictor
Variable
Simple r
Adjusted
R2
∆
R2
Change
Regression
Coefficient
b
Beta
Sig.
Gender
.351*
.15
.151
.333
.323
.034
Resilience
.421**
.32
.175
.454
.544
.001
Income
.011
.32
.001
.014
.050
.255
* p < 0.05, ** p < 0.001
Constant = 20.22
4. Of the significant predictors how much % of variance did each predictor
explain? …….Look at the R-square change column:
Gender contributed 15.1% and Resilience contributed 17.5%, while Income had
almost no contribution with less than 1%
51
Another way to present MRA table with results:
Table X.
Prediction of Life Satisfaction based on Gender, Resilience,
and Income in Randomly Selected Adults (N= 500)
Predictor
Variable
Simple r
Adjusted
R2
∆
R2
Change
Regression
Coefficient
b
Beta
Sig.
Gender
.351*
.15
.151
.333
.323
.034
Resilience
.421**
.32
.175
.454
.544
.001
Income
.011
.32
.001
.014
.050
.255
Constant = 20.22
* p < 0.05, ** p < 0.001
5. Which significant predictor is the “best” predictor of life satisfaction?......here
look at the Beta column…and pick the best value(s) for the significant variables:
Here, we can see that Resilience has a the best Beta (.544) compared to Gender
with Beta = .323.
52
26
4/15/2015
Another way to present MRA table with results:
Table X.
Prediction of Life Satisfaction based on Gender, Resilience,
and Income in Randomly Selected Adults (N= 500)
Predictor
Variable
Simple r
Adjusted
R2
∆
R2
Change
Regression
Coefficient
b
Beta
Sig.
Gender
.351*
.15
.151
.333
.323
.034
Resilience
.421**
.32
.175
.454
.544
.001
Income
.011
.32
.001
.014
.050
.255
Constant = 20.22
* p < 0.05, ** p < 0.001
Overall, this study shows that gender and resilience were significant predictors of life
satisfaction. Females had more life satisfaction than males. Income had no significant
contribution to the prediction of life satisfaction. The model explained 32% of variance
in life satisfaction. And, Resilience was a better stronger predictor of life satisfaction
than gender. The results were generalizable from these participants to the targeted
population of adults (originally in the sampling frame list).
53
27
SSS 590 – Online – Week 7 – Learning About Chi-Square Association – Dr. Farber 2015
1
Notes on Chi-Square Association
•
Chi-square [2 ] test is a test of association between or among frequencies of the
variables.
•
Both the independent and dependent variables have to be of nominal or ordinal level
of measurement. If one of the variables is of a higher level of measurement (i.e.,
interval or ratio), then the data have to be grouped in order to create categories of data
(that are at ordinal level of measurement).
•
The null hypothesis addresses the association between variables by asserting that the
differences were created by random sampling error. Note: the relationship that is
being investigated is examining “association between frequencies” of the levels in
variables and NOT a correlation!
•
Therefore, as in using other statistical tests, the null hypothesis is rejected when the
probability of the chi-square [2 ] result is p < .05.
•
Use degrees of freedom (df) in order to find the critical value of chi-square from
the table: df = (rows – 1) x (columns -1).
•
Remember that in order to conclude that the obtained calculated chi-square result is
statistically significant, the calculated/obtained value of the chi-square result has to be
bigger than the critical chi-square value in the table, at p = .05 (either at 2-tail or 1-tail
hypothesis that is selected during the hypothesis testing process).
•
Note: When you have the p-value available with the chi-square result [or any other
statistical test, for that matter], then the p-value is first judged on whether it is
statistically significant [p Critical (2.71) = Reject the Null of No association
(b) For 2-Tailed: Obtained (7.24) > Critical (3.84) = Reject the Null of No association
Notice that our obtained result for 7.24 is still significant at 1-tailed critical chisquare at p = .005!
•
Draw a conclusion:
The results of this study show that there is a statistically significant association between
student gender and level of comfort in learning research (2 (1, N=75) = 7.24, p < .05).
Based on observed proportions, males are more likely to report “high” level of comfort
(71%) when compared to females, who are more likely to report a “low” level of comfort
(62%) in learning research.
Example 2:
A study examined behaviors of 120 college students. The study was interested in
testing an association [relationship] between students’ self-reported behaviors and class
participation. The students’ self-reported behaviors were measured by their [“YES/NO”]
responses that occurred in the past 30 days. The class participation was measured by
instructors’ verification [“YES/NO”] of whether the student cut class more than two
times in the past 30 days.
Examine the findings in Table 1 and answer the questions below:
Farber NCSSS/CUA 2015
SSS 590 – Online – Week 7 – Learning About Chi-Square Association – Dr. Farber 2015
Table 1
6
Number and Percentage of Students Answering YES to Classroom Behaviors
Based on Class Participation (N=120)
Variable
Cutting Class
Not Cutting Class
[YES]
Behavior (YES)
Got Drunk
Sped
Broke the law
Told a significant lie
Was feeling depressed
Read a book for pleasure
Visited family
* p< .05, ** p < .002
[NO]
68 n
57%
52 n
43%
Chi-square
59
63
35
14
7
25
40
87%
93%
51%
21%
10%
37%
59%
24
39
10
8
5
15
48
46%
75%
19%
15%
10%
29%
92%
22.79**
7.19*
13.07**
0.53
0.02
0.83
?????
Another way to represent rows and columns is to create a cross-tabulation table:
Behavior
Getting Drunk
YES
NO
Totals
Class Participation
Cutting Class
YES (n=68)
Cutting Class
NO (n=52)
Totals
59 (87%)
24 (46%)
83 (69%)
(71%)
(29%)
100%
9 (13%)
28 (54%)
37 (31%)
(24%)
(76%)
100%
68 (100%)
52(100%)
120 (100%)
57%
43%
100%
Q1.
What percent of the sample tends to “cut class”? 68/ 120 = 56.67 rounded to 57%
Q2.
What percent of the sample “gets drunk” ? 83 / 120 = 69.16%
Farber NCSSS/CUA 2015
SSS 590 – Online – Week 7 – Learning About Chi-Square Association – Dr. Farber 2015
Q3.
What percent of those who “Cut Class” in the past month did not get drunk?
9/68 = 13.23%
Q4.
What percent of those who “Did Not Cut Class” in the past month did not get
drunk?
28 / 52 = 53.85% …or rounded to 54%
Q5.
Who is more likely to “get drunk” in the past month, those who “cut class” or those
“who do not cut class”? Indicate the percentages that you are comparing:
Start out with the “getting drunk” row….71% cut class and 29% did not!
Q6.
Based on the chi-square result in Table 1, is there a statistically significant
association between these two variables (getting drunk and class cutting)?
Yes: chi-square = 22.79 at p = < .022, which means the p-level < .05 for determining
statistical significance, and in turn that means the chi-square result is statistically
significant. And, that also means that there is a significant association between getting
drunk and cutting class categories!
And, then, we would need to look at the proportions or percentages in the table in order to
see how the association (or dependency between categories) plays out!
Q7.
State the null hypothesis between “Getting drunk” and “Class Cutting”:
For example: There will be no statistically significant association between “getting drunk”
and “cutting class” levels.
Q8.
Utilizing the table on the previous page, complete the cross-tabulation
table for “Visiting Family” and “Class Participation”
Behavior
Class Participation
Totals
Visiting Family
YES
NO
Farber NCSSS/CUA 2015
Class Cutting
Not Class Cutting
YES (n=68)
NO (n=52)
40
48
Totals
7
SSS 590 – Online – Week 7 – Learning About Chi-Square Association – Dr. Farber 2015
68 (100%)
Q9.
52(100%)
8
120
State a possible one-tailed research hypothesis between “Visiting family” and “Class
Participation” :
Q10. Examining results in Table 1, calculate whether there is a statistically significant
association between “Visiting Family” and “Class Cutting”? Explain:
For this example – you will have to compute the chi-square value from the table here!
Farber NCSSS/CUA 2015
SSS 590 – Online – Week 7 – Learning About Chi-Square Association – Dr. Farber 2015
Chi-Square Critical Table:
9
Alpha set for 1-Tail Directional
Hypothesis.
Alpha set for 2-Tail Non-directional
hypothesis
Farber NCSSS/CUA 2015
SSS 590 – Online – Week 7 – Learning About Chi-Square Association – Dr. Farber 2015
Using SPSS Output
Farber NCSSS/CUA 2015
10
SSS 590 – Online – Week 7 – Learning About Chi-Square Association – Dr. Farber 2015
11
Example #1: A study is investigating whether there is a statistically significant association between the
type of residence of the care recipient [RESIDE] and whether or not the caregiver uses respite care
services [RESPITE].
RESPITE * RESIDE Crosstabulation
RESIDE
RESPITE
NO
caregiver
elsewhere
91
232
72.7%
85.8%
77.3%
53
15
68
27.3%
14.2%
22.7%
194
106
300
100.0%
100.0%
100.0%
Count
% within RESIDE
Total
141
Count
% within RESIDE
Total
Resides
Count
% within RESIDE
YES
Resides with
Chi-Square Tests
Value
Pearson Chi-Square
Continuity
Correctionb
Likelihood Ratio
df
Asymp. Sig.
Exact Sig.
Exact Sig.
(2-sided)
(2-sided)
(1-sided)
6.781a
1
.009
6.051
1
.014
7.171
1
.007
Fisher's Exact Test
Linear-by-Linear Association
N of Valid Cases
.009
6.759
1
.006
.009
300
a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 24.03.
b. Computed only for a 2x2 table
•
•
•
Here we have the results of the Chi-Square statistic.
Notice that here we have several different Chi-Square test statistical results:
Pearson Chi-Square is the most frequently reported result!
•
However, technically - because this table represents a 2 x 2 table, the “Continuity Correction”
(or Yates Correction Factor noted in your textbook) version of Chi-Square is provided.
Its value is 6.051 and the p level equals .014.
•
Do we have a statistically significant association here? ……….p < .05!
•
Another thing to notice in this table is the statement that “O cells have expected count less
than 5.” If there were cells with an expected count of less than 5, we would have to use the
“Fisher’s Exact Test” statistic instead of Chi-Square.
Farber NCSSS/CUA 2015
SSS 590 – Online – Week 7 – Learning About Chi-Square Association – Dr. Farber 2015
12
Now, let’s look at the magnitude of the obtained chi-square result:
Symmetric Measures
Value
Nominal by Nominal
Phi
Cramer's V
-.150
.009
.150
.009
N of Valid Cases
•
•
•
•
•
•
•
Approx. Sig.
300
This final table tells us the strength of the association using either Phi (for 2 x 2
tables),
or Cramer’s V (for larger tables).
Therefore, we would report the Phi coefficient value for this computer run (you ignore
the minus sign).
If the X2 is not significant, you don’t report these findings or the percentages in the
cells!
< .39 = weak association
.40 - .69 = moderate association
.70+ strong association
INTERPRETATION: “There is
a statistically significant association between the residence of
the care recipient and whether or not the caregiver has used respite services [2(1, N = 300)
= 6.05, p = .014]. Specifically, a higher percentage of caregivers who had the care recipient
living with them used respite services compared to caregivers of care recipients who lived
elsewhere (27.3% vs. 14.2%, respectively).
This is a weak relationship based on a Phi
coefficient of .15.”
NOTE:
For a 2 x 2 cross-tabs, you report the percentages in the cells that are most pertinent to the
focus of your study (focusing on the DV or “use of respite services” in this case), reading across
categories of the IV.
•
Here, we would want to report the column percentages (not counts) of caregivers who
had used respite services (“YES” responses).
•
Also, you generally mention the IV first, followed by the DV, especially if you have
stated a specific hypothesis concerning the association between the 2 variables. This is
why we phrased the first two sentences to mention the care recipient’s living situation
first, followed by the use of respite services second rather than the other way around.
Farber NCSSS/CUA 2015
SSS 590 – Online – Week 7 – Learning About Chi-Square Association – Dr. Farber 2015
Example #2: Chi-Square
This study investigates whether there is an association between the caregiver’s gender [SEX] and
the level of hours of caregiving provided to the care recipient per week [CAREHRS].
13
CAREHRS * SEX Crosstabulation
SEX
MALE
CAREHRS
Low
Count
% within SEX RECODED
Moderate
Count
% within SEX RECODED
High
Count
% within SEX RECODED
Total
Count
% within SEX RECODED
FEMALE
Total
12
84
96
35.3%
31.6%
32.0%
13
88
101
38.2%
33.1%
33.7%
9
94
103
26.5%
35.3%
34.3%
34
266
300
100.0%
100.0%
100.0%
Chi-Square test
Asymp. Sig. (2Value
df
sided)
1.058a
2
.589
1.098
2
.578
Linear-by-Linear Association
.718
1
.397
N of Valid Cases
300
Pearson Chi-Square
Likelihood Ratio
a. 0 cells (.0%) have expected count less than 5. The minimum expected
count is 10.88.
What can we conclude about this association and why?
Yes – you should be seeing that there is no statistical significant association between gender
and care hours!
Farber NCSSS/CUA 2015
SSS 590 – Online – Week 7 – Learning About Chi-Square Association – Dr. Farber 2015
14
Example #3: Chi-Square
This study investigates whether there is an association between the level of caregiving burden
[Stress] (recoded as “low,” “moderate,” or “high”) and levels of depression [DEPRESSION]
(recoded as “low” or “high”).
DEPRESSION * BURDEN Crosstabulation
Stress
Low
DEPRESSION
LOW
Count
% within Burden Levels
HIGH
Count
% within Burden Levels
Total
Count
% within Burden Levels
Moderate
High
Total
79
56
35
170
78.2%
57.7%
34.3%
56.7%
22
41
67
130
21.8%
42.3%
65.7%
43.3%
101
97
102
300
100.0%
100.0%
100.0%
100.0%
Chi-Square Tests
Asymp. Sig. (2Value
df
sided)
39.903a
2
.000
Likelihood Ratio
41.330
2
.000
Linear-by-Linear Association
39.713
1
.000
Pearson Chi-Square
N of Valid Cases
300
a. 0 cells (.0%) have expected count less than 5. The minimum expected
count is 42.03.
Symmetric Measures
Value
Approx. Sig.
Phi
.365
.000
Cramer's V
.365
.000
Nominal by Nominal
N of Valid Cases
300
What do these results tell us and why?
Yes, these results point to statistical significance, and the effect size is modest (use
Cramer’s V), and then look at the proportions – and you should see that “low
depression” is more associated with “low stress” and “high depression” is dependent
on “high stress” ….and so on.
Farber NCSSS/CUA 2015
SSS 590: WEEK 7, ASSIGNMENT 4
1
Names:
THIS ASSIGNMENT IS WORTH 100 POINTS. IT IS COMPRISED OF EIGHT ITEMS,
ALL OF WHICH HAVE MULTIPLE PARTS. PLEASE TYPE YOUR RESPONSES
DIRECTLY INTO THIS DOCUMENT AND USE A BLUE–COLORED FONT.
1. Below are the results of a chi–square test which was conducted to determine if there
was an association between meeting the diagnostic criteria for Depression (DEP) and
meeting the diagnostic criteria for General Anxiety Disorder (GAD) among adults
receiving services from Agency XYZ.
(2 points for each sub–item; 10 points total)
GAD*DEP CROSSTABULATION
Meets Diagnostic
Criteria for DEP
Meets Diagnostic
Criteria for GAD
No
Yes
Total
No
Count
% within GAD
22
55.0%
18
45.0%
40
100.0%
Yes
Count
% within GAD
8
12.3%
57
87.7%
65
100.0%
Count
% within GAD
30
28.6%
75
71.4%
105
100%
Total
CHI–SQUARE TESTS
Pearson Chi–Square
Likelihood Ratio
Linear–by–Linear Association
N of Valid Cases
Value
df
Asymp. Sig.
(2–sided)
22.115a
22.094
21.904
105
1
1
1
.000
.000
.000
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 11.43.
SYMMETRIC MEASURES
Nominal by Nominal
N of Valid Cases
Phi
Cramer's V
Value
Approx. Sig.
.459
.459
105
.000
.000
SSS 590: WEEK 7, ASSIGNMENT 4
2
a. Present the Ha (2–tailed) for this study.
b. Present the Ho for this study.
c. Interpret the results of the chi–square test and specify if Ho should be rejected or retained.
d. If applicable, interpret the crosstabulations (re: differences among the groups).
e. If applicable, interpret the magnitude of the association.
2. Agency RST tracked the number of hours employees exercised last week (mean=35,
SD=4.8). Show all your calculations for each item below. No credit will be given
without this information.
(2 points for each sub–item; 10 points total)
a. How many hours did an employee with a z–score of –2.1 exercise? Round your answer
to the nearest whole number (re: no decimals).
b. What percentage of employees exercised between 27 and 44 hours? Round your answer
to the nearest whole number (re: no decimals).
c. How many hours did an employee who scored 2.5 SD below the mean exercise? Round
your answer to the nearest whole number (re: no decimals).
d. What percentage of employees exercised between 33 and 37 hours? Round your answer
to the nearest whole number (re: no decimals).
SSS 590: WEEK 7, ASSIGNMENT 4
3
e. What is the associated percentile for an employee who exercised for 40 hours? Round
your answer to the nearest whole number (re: no decimals).
3. Below are the results of a chi–square test which was conducted to determine if there was
an association between educational attainment (EDU) and meeting the diagnostic criteria
for Alcohol Use Disorder (AUD) among adults receiving services from Agency XYZ.
(2 points for each sub–item; 10 points total)
AUD*EDU CROSSTABULATION
Highest Level of
EDU Completed
Total
High
Some
College
School/GED College Graduate
Meets Diagnostic No
Criteria for AUD
Count
% within AUD
8
22.9%
9
25.9%
18
51.4%
40
100.0%
Yes Count
% within AUD
27
38.6%
26
37.1%
17
24.3%
65
100.0%
Count
% within AUD
35
33.3%
35
33.3%
35
33.3%
105
100%
Total
CHI–SQUARE TESTS
Pearson Chi–Square
Likelihood Ratio
Linear–by–Linear Association
N of Valid Cases
Value
df
Asymp. Sig.
(2–sided)
7.800a
7.645
6.367
105
2
2
1
.020
.022
.012
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 11.67.
SYMMETRIC MEASURES
Nominal by Nominal
N of Valid Cases
Phi
Cramer's V
Value
Approx. Sig.
.273
.273
.020
.020
105
SSS 590: WEEK 7, ASSIGNMENT 4
4
a. Present the Ha (2–tailed) for this study.
b. Present the Ho for this study.
c. Interpret the results of the chi–square test and specify if Ho should be rejected or retained.
d. If applicable, interpret the crosstabulations (re: differences among the groups). Note,
more than one interpretation may be required.
e. If applicable, interpret the magnitude of the association.
4. The mean level of new flu cases per week in Star County is 22 (SD=1.5). However,
among a sample of 144 residents who live in Mira City, which is located in Star County,
the mean level of new flu cases per week is 26.
(Points for each sub-item are provided below; 10 points total)
a. Present the Ha (2–tailed) for this study.
(2 points)
b. Present the Ho for this study.
(2 points)
c. Calculate the z–test. Show all your calculations. No credit will be given without this
information.
(3 points)
SSS 590: WEEK 7, ASSIGNMENT 4
5
d. Draw conclusions based upon comparing your calculated z with the critical z of +/–1.96.
Indicate if the null hypothesis should be retained or rejected. Explain whether the mean
number of flu cases per week in Star County and Deneb City are significantly different.
(3 points)
5. Below are the results of a Multiple Regression Analysis (MRA) that was conducted to
determine which of the following variables could be used to predict the number of
cancelled therapy sessions among youth in a residential treatment center: age, sex, having
earned off–campus privileges, treatment group, the number of serious behavioral
incidents (SBI), and the quality of the therapist–client relationship (Quality). Overall, the
model explained 30.3% of the variance in the number of cancelled therapy sessions.
(Points for each sub-item are provided below; 20 points total)
COEFFICIENTS
Unstandardized
Standardized
Coefficients
Coefficients
Model
B
Std. Error
(Constant)
4.379
.882
Age
–.001
.071
Sexa
–1.328
Privilegesb
t
Sig.
4.967
.000
–.001
–.013
.990
.292
–.369
–4.547
.000
.432
.413
.120
1.045
.298
Groupc
–.549
.359
–.153
–1.527
.130
SBI
.070
.034
.238
2.075
.040
Quality
–.202
.071
–.268
–2.838
.005
a
Beta
Sex (0=Male, 1=Female)
b
Earned Off–Campus Privileges (0=No, 1=Yes)
c
Treatment Group (0=Routine Treatment, 1=New Treatment)
a. Present the Ha (2–tailed) for this study.
(2 points)
b. Present the Ho for this study.
(2 points)
SSS 590: WEEK 7, ASSIGNMENT 4
6
c. Identify which variables did not significantly predict the number of cancelled therapy
sessions among youth in a residential treatment center.
(2 points)
d. Identify which variables significantly predict the number of cancelled therapy sessions
among youth in a residential treatment center. Which variable was the strongest
predictor?
(2 points)
e. For each of the variables that significantly predict the number of cancelled therapy
sessions among youth in a residential treatment center, specify and interpret the direction
of its relationship with the outcome variable.
(12 points)
6. For each item below, address the following: (a) specify and interpret the direction of the
relationship between the variables; (b) indicate the magnitude/strength of the
relationship between the variables; and (c) calculate the shared variance between the
variables. In regards to (c), show all your calculations and take your answer out to one
decimal point. No credit will be given without this information.
(4 points for each sub–item; 20 points total)
a. There is a significant correlation between infants’ lengths of hospitalization in the NICU
and their parents’ levels of anxiety (r = 0.74, p < 0.05).
b. There is a significant correlation between parents’ attendance at PTA meetings and their
children’s year in school (r = –0.41, p < 0.05).
c. There is a significant correlation between family composition (0=Non–Single Parents,
1=Single Parents) and levels of family–related stress (r = 0.53, p < 0.05).
d. There is a significant correlation between levels of leadership and organizational
performance rates (r = 0.37, p < 0.05).
SSS 590: WEEK 7, ASSIGNMENT 4
7
e. There is a significant correlation between parent (0=Fathers, 1=Mothers) and levels of
perceived control (r = –0.20, p < 0.05).
7. A study was conducted among adolescents with recent cancer diagnoses who
participated in a six week intervention that was designed to enhance their coping skills.
Prior to their enrollment, participants’ levels of resilience were measured via a 30 item,
5–point Likert–scale. Higher scores are associated with higher levels of resilience. At
pre–test, resilience scores were normally distributed (mean=85, SD=2.5).
(2 points for each sub–item; 10 points total)
a. What is the median?
b. What is the mode?
c. Within what range of two scores around the mean did 68% of participants score? Show
all your calculations. No credit will be given without this information.
d. Within what range of two scores around the mean did 95% of participants score? Show
all your calculations. No credit will be given without this information.
e. What is the associated score for a participant whose score was 3SD below the mean?
Show all your calculations. No credit will be given without this information.
8. Upon conclusion of the above study, participants’ levels of resilience were measured
once again (mean=90, SD=5.5).
a. Within what range of two scores below the mean did 34% of participants score? Show
all your calculations. No credit will be given without this information.
SSS 590: WEEK 7, ASSIGNMENT 4
8
b. What is the associated score for a participant whose score was 3SD below the mean?
Show all your calculations. No credit will be given without this information.
c. Within what range of scores fell 2SD below and above the mean? Show all your
calculations. No credit will be given without this information.
d. Within what range of scores fell between 1 SD and 3 SD above the mean? Show all your
calculations. No credit will be given without this information.
e. In which distribution of resilience scores (pre–test or post–test) are scores more
heterogeneous? Provide a brief rationale for your response.
Purchase answer to see full
attachment