User's Guide: Which Comes First - The
Dataset or the Research Question?
Researchers often face a decision: whether to first develop a research question and then find the
best dataset to answer it, or to first select a high-value dataset and then develop the research
question. For most researchers, the best approach is often a hybrid of these two options.
The core of high-quality research is a cogent and important research question. For this reason, it
is usually a mistake to simply choose a dataset and then flip through the codebooks until one
finds an interesting variable. On the other hand, datasets rarely have all of the data that one
wants, so trying to fit a pre-conceived notion into existing data can be an exercise in frustration.
Moreover, this approach often results in subpar research when the investigator doesn’t respect
the limitations and strengths of the data with which she is working.
Thus, the best approach is often to develop a broadly-defined area of inquiry, and then identify a
handful of datasets that are well-suited to that focus. Then, one can carefully evaluate the
structure and content of the data to look for unique ways to translate that area of inquiry into a
specific question that is both important and well-suited for that dataset. For example, a researcher
might find that a dataset contains a unique series of questions that provide a novel framework for
studying her area of interest. Or, the dataset may have a unique structure – such as the collection
of longitudinal data, or national representativeness, or linking of patient survey data with
biomarkers – that offers a fresh and exciting way to evaluate a research topic.
Finally, the accessibility, ease of use, and local experience working with a dataset is of critical
importance. For example, Medicare claims data is a tremendous resource for research, but is very
challenging to use. If one’s mentor has used this database and one has access to local data
analysts with extensive experience using Medicare data, that’s great. If not, proceed with caution
unless you have an abundance of time, money, and patience. For this reason, the best datasets for
a junior investigator are often those where (1) there is local experience using the dataset that can
be put to use, and/or (2) the dataset is relatively easy to access, learn, and use.
CHAPTER
4
RESEARCH DESIGN,
VALIDITY, AND BEST
AVAILABLE EVIDENCE
T
he best available evidence on public health programs and policies
comes from high-quality studies. A study’s quality is dependent upon
the strength of its methodology, including its research design, outcome measures, settings, participants, interventions, data collection strategies, and statistical techniques.
This chapter explores commonly used research designs and discusses
how they affect a study’s internal validity, external validity, and quality.
Subsequent chapters discuss other components of study methodology.
CHAPTER OBJECTIVES
After reading this chapter, you will be able to
•• Describe the characteristics of commonly used research designs in
studies to define and meet public health program needs, including
Randomized controlled trials with concurrent, parallel, or wait-list
control groups and factorial designs
Quasi-experimental designs with concurrent or parallel control
groups
107
108– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
Time-series designs, including pretest-posttest and interrupted
time-series designs
Observational designs, including cohorts, case controls, and crosssectional surveys
Describe the methods that statisticians, epidemiologists, and other health
researchers use to ensure that experimental and control groups are
equivalent before they participate in research; these include blinding,
random allocation, matching, propensity score analysis, and analysis of
covariance
Describe the threats to internal and external validity that can result
from a study’s research design
Read a research article that evaluates program effectiveness, describe
the main objective, explain how participants were assessed for inclusion and exclusion into the study, and describe the research design
When given a table of data, write up the results comparing the experimental and control groups
When given an excerpt from a study report or article, list the variables
or covariates that the researchers controlled for statistically in order
to prevent them from confounding the results
Describe how the choice and implementation of a study’s research
design affects its quality
••
••
••
••
••
••
RESEARCH METHODS AND RESEARCH DESIGN
A study’s methods include its research design; outcome measures; criteria
for including settings, participants, and interventions; sampling strategies
and techniques for reducing bias between participants and interventions;
and data collection and statistical strategies.
Most researchers agree that the “best” way of demonstrating program
effectiveness is through a well-designed and implemented randomized
controlled trial (RCT).
The Randomized Controlled Trial: Going for the Gold
An RCT is an experimental study in which eligible individuals or groups
of individuals (e.g., schools, communities) are assigned at random to receive
one of several programs or interventions. The group in an experiment that
receives the specified program is the experimental group. The control
group is another group assigned to the experiment, but not for the purpose
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–109
of being exposed to the program. The performance of the control group usually serves as a standard against which to measure the effect of the program
on the experimental group. The control program may be typical practice
(usual care), an alternative practice, or a placebo (a treatment or program
believed to be inert or innocuous). Random assignment, or random allocation, means that people end up in the experimental or control group by
chance rather than by choice.
Randomized controlled trials are sometimes called true experiments
because, at their best, they can demonstrate causality. That means that, in
theory at least, the researcher can assume that if participants in an RCT
achieve desirable outcomes, the program caused them.
True experiments are often contrasted with quasi experiments, and
observational studies. A quasi-experimental design is one in which the
control group is predetermined (without random assignment) to be comparable to the program group in critical ways, such as being in the same school
or eligible for the same services. In observational designs, the researcher
does not intervene. He or she studies the effects of already existing programs
on individuals and groups (e.g., a retrospective design, historical analysis, or
summative evaluation of programs such as Head Start or the Welfare to Work
Program of the U.S. government). Observational designs are sometimes
called descriptive.
True and quasi-experimental designs aim to link programs to outcomes,
while observational studies are used to illuminate the need for programs,
learn about their implementation, and clarify the findings of current evaluations by applying lessons learned from previous research.
The RCT is considered the gold standard of research designs because it
is the only one that can be counted on to rule out inherent participant characteristics that may affect the program’s outcomes. Put another way, if participants are assigned to experimental and control programs randomly, then
the two groups will probably be alike in all important ways before they participate. If they are different afterward, the difference can reasonably be
linked to the program.
Suppose the evaluators of a health literacy program in the workplace
hope to improve writing skills. They recruit volunteers to participate in a
6-week writing program and compare their writing skills to those of other
workers who are, on average, the same age and have similar educational
backgrounds and writing skills. Suppose also that, after the volunteers complete the 6-week program, the evaluators compare the two groups’ writing
and find that the experimental group performed much better. Can the evaluators claim that the literacy program is effective? Possibly. But the nature of
the design is such that you cannot really tell if some other factors that the
110– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
evaluators did not measure are responsible for the apparent program success. The volunteers may have done better because they were more motivated to achieve (that is why they volunteered), had more home-based social
support, and so on.
A better way to evaluate the workplace literacy program is to (a) randomly assign all eligible workers (e.g., those who score below a certain level
on a writing test) to the experimental program and to a comparable (but not
new) control program and then (b) compare changes in writing skills over
time. With random assignment, all the important factors (e.g., motivation,
home support) are likely to be equally distributed between the two groups.
Then, if the scores are significantly different in favor of the experimental
group, the evaluators will be on firmer ground in concluding that the program is effective (see Table 4.1).
Table 4.1
An Effective Literacy Program: Hypothetical Example
Assignment
Before the Program
After the Program
Randomly assigned to the
experimental literacy
program
Relatively weak
writing skills
Significantly improved
writing skills
Randomly assigned to a
standard and comparable
literacy program
Relatively weak
writing skills
No change: Writing
skills are still relatively
weak
CONCLUSION: The experimental program effectively improved writing skills when compared
to a standard and comparable program.
In sum, RCTs are quantitative, comparative, controlled experiments in
which investigators study two or more programs, interventions, or practices
in a series of eligible individuals who receive them in random order.
Here are two commonly used randomized control designs:
1. Concurrent controls in which two (or more) groups are randomly
constituted, and they are studied at the same time (concurrently).
Concurrent controls are sometimes called parallel controls.
2. Wait-list controls in which one group receives the program first and
others are put on a waiting list; if the program appears to be effective,
participants on the waiting list receive it. Participants are randomly
assigned to the experimental and wait-list groups.
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–111
Concurrent or Parallel Controls. Here is how evaluation researchers
design randomized controlled trials with concurrent groups (see Figure 4.1):
1. First the researcher appraises the eligibility of the potential participants.
•• Some people are excluded because they did not meet the inclusion criteria or they did meet the exclusion criteria.
•• Some eligible people decide not to participate. They change their
mind, become ill, or are too busy.
2. The remaining potential participants are enrolled in the evaluation study.
3. These participants are randomly assigned to the experiment or to an
alternative (the control).
4. Participants in the experimental and control groups are pretested,
that is, compared at baseline (before program participation) when
possible. They are always compared (posttested) after participation.
Figure 4.1 Randomized Control Trial With Concurrent Controls
Potential Participants Are Assessed
for Eligibility
Eligible Participants Are
Enrolled
Randomized to
Experimental
Group
Control Group
Excluded From
Participation
• Did not meet inclusion criteria
• Met exclusion criteria
• Refused to participate
112– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
Example 4.1 illustrates two randomized controlled trials with concurrent
controls.
Example 4.1
Two Randomized Controlled Trials With Concurrent Controls
1. Evaluating Home Visitation by Nurses to Prevent Child Maltreatment in Families Referred to
Child Protection Agencies (MacMillan et al., 2005)
Objective. Recurrence of child maltreatment is a major problem, yet little is known about
approaches to reduce this risk in families referred to child protection agencies. Since home
visitation by nurses for disadvantaged first-time mothers has proven effective in the prevention of
child abuse and neglect, the researchers investigated whether this approach might reduce the
recurrence of maltreatment.
Assessment for Eligibility. Families were eligible if they met the following criteria: (1) the
index child was younger than 13 years, (2) the reported episode of physical abuse or neglect
occurred within the previous 3 months, (3) the child identified as physically abused or
neglected was still living with his or her family or was to be returned home within 30 days of
the incident, and (4) families were able to speak English. Families in which the abuse was
committed by a foster parent, or in [which] the reported incident included sexual abuse, were
not eligible.
Evaluation Research Design. The evaluators randomly assigned 163 families to control or
intervention groups. Control families received standard services arranged by the agency. These
included routine follow-up by caseworkers whose focus was on assessment of risk of recidivism,
provision of education about parenting, and arrangement of referrals to community-based parent
education programs and other services. The intervention group of families received the same
standard care plus home visitation by a public-health nurse every week for 6 months, then every
2 weeks for 6 months, then monthly for 12 months.
Findings. At 3-years’ follow-up, recurrence of child physical abuse did not differ between groups.
However, hospital records showed significantly higher recurrence of either physical abuse or
neglect in the intervention group than in the control group.
2. Evaluating Therapy for Depressed Elderly People: Comparing a Holistic Approach to Medication
Alone (Nickel et al., 2005)
Objective. To find out whether recovering the ability to function socially takes a different course
with integrative, holistic treatment than it does with medication alone.
Assessment for Eligibility. To be included, participants had to be female; aged 65–75; living at
home; and disturbed by symptoms such as sadness, lack of drive, and reclusion. Grounds for
exclusion were the need for personal assistance in any of four key activities of daily living:
Bathing, dressing, walking inside the house, and transferring from a chair; significant cognitive
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–113
impairment with no available proxy; diagnosis of a terminal illness, psychosis, or bipolar
disorder; the current use of antidepressants or psychotherapy; and plans to change of
residence within next four months.
Findings. Both forms of therapy did afford a relatively rapid reduction of depressive
symptoms. The integrative treatment not only led to a quicker reduction in depression,
however, but was also the only one that led to a significant improvement in the ability to
function socially.
3. Evaluating a Health Care Program to Get Adolescents to Exercise (Patrick et al., 2006)
Objective. Many adolescents do not meet national guidelines for participation in regular,
moderate, or vigorous physical activity; for limitations on sedentary behaviors; or for dietary
intake of fruits and vegetables, fiber, or total dietary fat. This study evaluated a health care–based
intervention to improve these behaviors.
Assessment for Eligibility. Adolescents between the ages of 11 and 15 years were recruited
through their primary care providers. A total of 45 primary care providers from 6 private
clinic sites in San Diego County, California, agreed to participate in the study. A
representative group of healthy adolescents seeing primary care providers was sought by
contacting parents of adolescents who were already scheduled for a well-child visit and by
outreach to families with adolescents. Adolescents were excluded if they had health
conditions that would limit their ability to comply with physical activity or diet
recommendations.
Evaluation Research Design. After baseline measures but before seeing the provider, participants
were randomized to either the Patient-Centered Assessment and Counseling for Exercise +
Nutrition (PACE+) program or to a sun protection control condition.
Findings. Compared with adolescents in the sun protection control group, girls and boys in the
diet and physical activity program significantly reduced sedentary behaviors. Boys reported more
active days per week. No program effects were seen with percentage of calories from fat
consumed or minutes of physical activities per week. The percentage of adolescents meeting
recommended health guidelines was significantly improved for girls for consumption of saturated
fat and for boys’ participation in days per week of physical activity. No between-group
differences were seen in body mass index.
Wait-List Control: Do It Sequentially. With a wait-list control design, both
groups are assessed for eligibility, but one is randomly assigned to be given
the program now (experimental group) and the other is put on a waiting list
(control group). After the experimental group completes the program, both
groups are assessed a second time. Then the control group receives the program and both groups are assessed again (see Figure 4.2).
114– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
Figure 4.2 Randomized Controlled Trial Using a Wait-List Control
Potential Participants Are Assessed
for Eligibility
Eligible Participants Are
Enrolled
Excluded From
Participation
• Did not meet inclusion criteria
• Met exclusion criteria
• Refused to participate
Randomized to
Experimental Group:
Gets the Program NOW
Control Group: Put on
a Waiting List
Here is how this design is used:
1. Compare Group 1 (experimental group) and Group 2 (control
group) at baseline (the pretest). If random assignment has worked,
the two groups should not differ from one another.
2. Give Group 1 the program.
3. Assess the outcomes for Groups 1 and 2 at the end of the program. If
the program is working, expect to see a difference in outcomes favoring the experimental group.
4. Give the program to Group 2.
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–115
5. Assess the outcomes a second time. If the program is working, Group 2
should catch up to Group 1), and both should have improved in their
outcomes (Figure 4.4).
Figure 4.3 Evaluating Effectiveness With a Wait-List Control
The Wait-List Group (Control Group) Catches Up
25
20
Scores
15
10
5
0
Baseline
Time 1 Outcome
Time 2 Outcome
Measure Times
Experimental Group
Control Group
Example 4.2 has three illustrative wait-list control evaluation designs.
Example 4.2 Three RCTs With Wait-List Controls
1. Evaluating a Methadone Maintenance Treatment in an Australian Prison System (Dolan et al., 2003)
Objective. To determine whether methadone maintenance treatment reduced heroin use, syringe
sharing, and HIV or hepatitis C incidence among prisoners.
(Continued)
116– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
Example 4.2 (Continued)
Assessment for Eligibility. Male inmates were eligible to participate if they (1) were assessed as
suitable for methadone maintenance by a detailed interview with medical staff who confirmed
they had a heroin problem; (2) were serving prison sentences longer than four months at the time
of interview; and (3) were able to provide signed informed consent.
Evaluation Research Design. All eligible prisoners seeking drug treatment were randomized to
methadone or a wait-list control group and followed up after four months.
Findings. Heroin use was significantly lower among treated than control subjects at follow-up.
Treated subjects reported lower levels of drug injection and syringe sharing at follow-up. There
was no difference in HIV or hepatitis C incidence.
2. Evaluating Two Brief Treatments for Sleep Problems in Young Learning Disabled Children:
A Randomized Controlled Trial (Montgomery, Stores, & Wiggs, 2004)
Objective. To investigate the efficacy of a media-based, brief behavioral treatment of sleep
problems in children with learning disabilities.
Assessment for Eligibility. The study included children aged 2–8 years with any form of severe
learning disability, confirmed by a general practitioner. Severe sleep problems were defined
according to standardized criteria as follows: (1) night waking occurring three or more times a week
for more than a few minutes and the child disturbing the parents or going into their room or bed
and/or (2) settling problems occurring three or more times a week with the child taking more than
one hour to settle and disturbing the parents during this time. These problems needed to have been
present for at least three months and not be explicable in terms of a physical problem such as pain.
Evaluation Research Design. The parents of severely learning disabled children took part in a
randomized controlled trial with a wait-list control group. Face-to-face delivered treatment was
compared to usual care, and a booklet-delivered treatment was compared to usual care.
Findings. Both forms of treatment (face-to-face and booklet) were almost equally effective
compared with the controls. Two thirds of children who were taking over 30 minutes to settle five
or more times per week and waking at night for over 30 minutes four or more times per week
improved on average to having such settling or night waking problems for only a few minutes or
only once or twice per week. These improvements were maintained after six months.
3. Evaluating a Mental Health Intervention for Schoolchildren Exposed to Violence: A Randomized
Controlled Trial (Stein et al., 2003)
Objective. To evaluate the effectiveness of a collaboratively designed school-based intervention
for reducing children’s symptoms of posttraumatic stress disorder (PTSD) and depression that has
resulted from exposure to violence.
Assessment for Eligibility. Sixth-grade students at two large middle schools in Los Angeles
who reported exposure to violence and had clinical levels of symptoms of PTSD using
standard measures.
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–117
Evaluation Research Design. Students were randomly assigned to a ten-session standardized
cognitive-behavioral therapy (the Cognitive-Behavioral Intervention for Trauma in Schools) early
intervention group or to a wait-list delayed intervention comparison group conducted by trained
school mental health clinicians.
Findings. Compared with the wait-list delayed intervention group (no intervention), after
three months of intervention, students who were randomly assigned to the early intervention
group had significantly lower scores on symptoms of PTSD, depression, and psychosocial
dysfunction. At six months, after both groups had received the intervention, the differences
between the two groups were not significantly different for symptoms of PTSD and
depression.
A wait-list control design (sometimes called switching replications or
delayed treatment design) has the advantage of allowing the evaluator to
compare experimental and control group performance on the same
program. It is sometimes difficult to find or implement an alternative control program that is equal to or better than the new program. Also, the
new program may be designed to fill a gap in the availability of programs,
and no comparable program may actually be available at all. Finally,
the nature of the design means that everyone receives the program, and,
in some circumstances, this may be an incentive for everyone to participate fully.
Wait-list control designs are particularly practical when programs are
repeated at regular intervals, as they are in schools with a semester system.
For example, students can be randomly assigned to Group 1 or Group 2, with
Group 1 participating in the first semester. Group 2 can then participate in
the second semester. The design is especially efficient in settings that can
wait for results.
Wait-list control designs are also reliant upon having the experimental
group cease its improvement at the time of program completion. If
improvement in the experimental group continues until the control
group completes the program, then the effects of the program on the
control group may appear to be less spectacular than they actually were.
To avoid this confusion, some investigators advocate waiting for improvement in the experimental program to level off (a “wash out” period) and
to time the implementation of the control program accordingly. However,
the amount of time needed for the effect to wash out is usually unknown
in advance.
118– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
Factorial Designs
Factor 2:
Notification Status
Factorial designs enable researchers to evaluate the effects of varying the
features of an intervention or practice to see which combination works best. In
Example 4.3 the investigators are concerned with finding out if the response
rate to web-based surveys can be improved by notifying prospective responders in advance by e-mail and/or pleading with them to respond. The investigators design a study to solve the response-rate problem using a two-by-two (2 ×
2) factorial design in which participants are either notified about the survey in
advance by e-mail or not prenotified, or they are pleaded with to respond or
not pleaded with. The factors (they are also independent variables) are pleading (Factor 1) and notifying (Factor 2). Each factor has two levels: plead versus
don’t plead and notify in advance versus don’t notify in advance.
Factor 1: Pleading Status
Plead
Don’t Plead
Notify in Advance
Don’t Notify in Advance
In a 2 × 2 design, there are four study groups: (1) prenotification e-mail
and pleading invitation e-mail, (2) prenotification e-mail and nonpleading
invitation, (3) no prenotification e-mail and pleading invitation, (4) no prenotification and nonpleading invitation. In the diagram above, the empty cells
are placeholders for the number of people in each category (e.g., the number
in the groups plead × notify in advance compared to the number in plead
× don’t notify in advance).
With this design, the researchers can study main effects (plead versus
don’t plead) or interactive effects (prenotification and pleading). The outcome in this study is always the response rate. If research participants are
assigned to groups randomly, the study is a randomized controlled trial.
Example 4.3 Factorial Design (Felix, Burchett, & Edwards, 2011)
Improving Response Rate to Web Surveys
Objectives. To evaluate the effectiveness of pre-notification and pleading invitations in Web surveys
by embedding a randomized controlled trial (RCT) in a Web-based survey.
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–119
Study Design and Setting. E-mail addresses of 569 authors of published maternal health research were
randomized in a 2×2 factorial trial of a pre-notification vs. no pre-notification e-mail and a pleading
vs. a non-pleading invitation e-mail. The primary outcome was completed response rate, and the
secondary outcome was submitted response rate (which included complete and partial responses).
Results. Pleading invitations resulted in 5.0% more completed questionnaires, although this
difference did not reach statistical significance [odds ratio (OR) 1.23; 95% confidence interval (CI):
0.86, 1.74; P = 0.25]. Pre-notification did not increase the completion rate (OR 1.04; 95% CI 0.73,
1.48; P = 0.83). Response was higher among authors who had published in 2006 or later (OR 2.07;
95% CI: 1.43, 2.98; P = 0.001). There was some evidence that pre-notification was more effective in
increasing submissions from authors with recent publications (P = 0.04).
Conclusion. The use of a “pleading” tone to e-mail invitations may increase response to a Web-based
survey. Authors of recently published research are more likely to respond to a Web-based survey.
Factorial designs may include many factors and many levels. It is the
number of levels that describes the name of the design. For instance, in a
study of psychotherapy versus behavior modification in outpatient, inpatient,
and day treatment settings, there are two factors (treatment and setting),
with one factor having two levels (psychotherapy versus behavior modification) and one having three levels (inpatient, day treatment, and outpatient).
This design is a 2 × 3 factorial design.
Doing It Randomly
Randomization is considered to be the primary method of ensuring
that participating study groups are probably alike at baseline, that is, before
they participate in a program. The idea behind randomization is that if
chance—which is what random means—dictates the allocation of programs,
all important factors will be equally distributed between and among experimental and control groups. No single factor will dominate any of the groups,
possibly influencing program outcomes. That is, each group will be as smart,
as motivated, as knowledgeable, as self-efficacious, and so on as the other to
begin with. As a result, any differences between or among groups that are
observed later, after program participation, can reasonably be assigned to the
program rather than to the differences that were there at the beginning. In
researchers’ terms, randomized controlled trials result in unbiased estimates
of a program’s or treatment’s effects.
How does random assignment work? Table 4.2 describes a commonly
used method and some considerations.
120– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
Table 4.2 Random Assignment
1. An algorithm or set of rules is applied to a table of random numbers, which are usually generated
by computer (although tables of random numbers are sometimes used in small studies). For
instance, if the research design includes an experimental group and a control group, and an
equal probability of being assigned to each, then the algorithm could specify using the random
number 1 for assignment to the experimental group and 2 for assignment to the control group—
or vice versa. (Other numbers are ignored.)
2. As each eligible person enters the study, he or she is assigned one of the numbers (1 or 2).
3. The random assignment procedure should be designed so that members of the research team
who have contact with study participants cannot influence the allocation process. For instance,
random assignments to experimental or control groups can be placed in advance in a set of
sealed envelopes by someone who will not be involved in opening them. Each envelope
should be numbered (so that all can be accounted for by the end of the study). As a participant
comes through the system, his or her name is recorded, the envelope is opened, and the
assignment (1 or 2) is recorded next to the person’s name.
4. It is crucial that researchers prevent interference with randomization. Who would tamper with
assignment? Sometimes members of the research team may feel pressure to ensure that the most
“needy” people receive the experimental program. One method of avoiding this is to ensure that
tamper-proof procedures are in place. If the research team uses envelopes, they should ensure
the envelopes are opaque (so no one can see through them) and sealed. In large studies,
randomization is done off site.
Variations on how to conduct the random allocation of participants and
programs certainly exist. As described in the checklist below, look for adherence to certain principles regardless of the specifics of the method reported
in a particular evaluation study.
What Evidence-Based Public
Health Practice Should Watch For: A Checklist
99 Study team members who have contact with participants were not
part of the allocation process. Randomization can be done off-site.
99 Assignment was not readily available to evaluation team members.
99 A table of random numbers or a computer-generated list of random
numbers was used.
Random Clusters
In some situations, it may be preferable for researchers to randomly
assign clusters of individuals (e.g., families, communities) rather than
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–121
individuals to the experimental or control groups. In fact, randomization by
cluster may be the only feasible method of conducting an evaluation in many
settings. Research that uses clusters to randomize is variously known as field
trials, community-based trials, or cluster randomized trials.
Compared with individually randomized trials, cluster randomized trials
are more complex to design and require more participants to obtain equivalent statistical power and more complex analysis. This is because observations on individuals in the same cluster (e.g., children in a classroom) tend
to be interrelated by potentially confounding (confusing) variables. For
example, students in a classroom are about the same age, may have the same
ability, and will have similar experiences. Consequently, the actual sample size
is less (one classroom) than the total number of individual participants
(25 students). The whole is less than the sum of its parts!
Example 4.4 contains an example of random assignment by cluster. In
this example, the cluster comprises colleges. Please note that data on the
outcome (cessation of smokeless tobacco use in the previous 30 days) were
collected from individual students, but randomization was done by college—
not by student. Is this OK? The answer depends on how the study deals with
the potential problems caused by randomizing with one unit (colleges) and
analyzing data from another (students).
Example 4.4 A College-Based Smokeless Tobacco Program for College Athletes (Walsh
et al., 1999)
Objective. The purpose of this study was to determine the effectiveness of a college-based smokeless
tobacco cessation intervention that targeted college athletes. Effectiveness was defined as reported
cessation of smokeless tobacco use in the previous 30 days.
Assessment for Eligibility. Current users of smokeless tobacco (use more than once per month
and within the past month) were eligible for the study. A total of 16 colleges with an average of
23 smokeless tobacco users in each were selected from lists of all publicly supported California
universities and community colleges. Half the participants were selected to be urban and half to be
rural; all had varsity football and baseball teams. One-year prevalence of cessation among smokeless
tobacco users was determined by self-report of abstinence for the previous 30 days.
Evaluation Research Design. The occurrence of smokeless tobacco use was calculated for each
athlete using information from a questionnaire given to them at baseline. Colleges were then
matched by pairs so that the level of smoking was approximately the same in each of the individual
colleges paired. One college from each pair was randomized to receive the program, while the other
college in the pair received no program.
Findings. In both groups, 314 students provided complete data on cessation. Cessation frequencies
were 35% in the program colleges and 16% in the control colleges. The program effect increased
with level of smokeless tobacco use.
122– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
ENSURING BASELINE
EQUIVALENCE: WHAT EVIDENCE-BASED
PUBLIC HEALTH PRACTICE SHOULD WATCH FOR
When reviewing articles, be certain that researchers provide information as
to whether baseline characteristics are balanced among clusters and individuals. The evaluators in Example 4.4 sought to achieve balance (i.e., equivalence) among universities (the clusters) by including only public universities
and junior colleges in California. They aimed for equivalence in smoking
levels among students within each university by pairing up universities in
terms of their students’ smoking levels and randomly assigning pair members to the experimental or the control group.
In addition to descriptive information on methods used to ensure equivalence, look for proof that the process worked and that, after it was over, the
groups were indeed equivalent:
Of 273 children with asthma in this cohort, 42.1% were female, 41.7%
were African-American, and the average age was 8.2 years. The baseline
characteristics for Program and non-Program groups were quite similar
in terms of demographics, enrollment, and asthma comorbidity.
Compared with the Program group, the non-Program group had a significantly higher percentage of females and “other race” children, but
significantly less Managed Care Organization enrollment and less allergy
comorbidity.
Despite all efforts, chance may dictate that the two groups differ on
important variables at baseline. Bad luck! Statistical methods may be used
to “correct” for these differences, but it is usually better to anticipate the
problem.
Improving on Chance
Small to moderate-sized RCTs can gain power to detect a difference
between experimental and control programs (assuming one is actually present) if special randomization procedures are used to balance the numbers of
participants in each (blocked randomization) and in the distribution of
baseline variables that might influence the outcomes (stratified blocked
randomization).
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–123
Why are special procedures necessary if random assignment is supposed to take care of the number of people in each group or the proportion of people in each with certain characteristics? The answer is that, by
chance, one group may end up being larger than the other or differing in
age, gender, and so on. Good news: This happens less frequently in large
studies. Bad news: The problem of unequal distribution of variables
becomes even more complicated when groups or clusters of people (e.g.,
schools, families) rather than individuals are assigned. In this case, the
evaluator has little control over the individuals within each cluster, and the
number of clusters (over which he or she does have control) is usually
relatively small (e.g., five schools, 10 clinics). Some form of constraint such
as stratification is almost always recommended in RCTs in which allocation
is done by cluster.
Two commonly used methods for ensuring equal group sizes and balanced variables are blocked randomization and stratified blocked randomization as described in Table 4.3.
Table 4.3 Enhancing Chance: Blocked and Stratified Blocked Randomization
Blocking or Balancing the Number
of Participants in Each Group
Stratifying or Balancing Important Predictor (Independent)
Variables
Randomization is done in blocks of
predetermined size. For example, if
the block’s size is 6, randomization
proceeds normally within each
block until the third person is
randomized to one group, after
which participants are
automatically assigned to the other
group until the block of 6 is
completed. This means that in a
study of 30 participants, 15 will be
assigned to each group, and in a
study of 33, the disproportion can
be no greater than 18:15.
Stratification means dividing participants into segments.
For example, participants can be divided into differing
age groups (the stratum) or gender or educational level.
In a study of a program to improve knowledge of how
to prevent infection from HIV/AIDS, having access to
reliable transportation to attend education classes is a
strong predictor of outcome. It is probably a good idea to
have similar numbers of people who have transportation
(determined at baseline) assigned to each group. This can
be done by dividing the study sample at baseline into
participants with or without transportation (stratification
by access to transportation) and then carrying out a
blocked randomization procedure with each of these
two strata.
Example 4.5 illustrates how these techniques have been applied in program evaluations.
124– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
Example 4.5 Enhancing Chance by Using Special Randomization Procedures
1. We randomly allocated families to control or intervention groups using a computer program
sequence generated by our statistician, blocked after every eight allocations. We aimed to do
secondary analyses within the intervention group, albeit with modest power, on the basis of the
number of nurse visits. Therefore, to increase the numbers in the intervention group, toward the
end of recruitment, we randomly allocated families using a 5-to-3 ratio (5 intervention families to
3 controls). Randomization was stratified by the age of the index child—i.e., younger than 4 years
and 4 to 12 years—since evidence exists indicating that preschool children are at increased risk
for recurrence of physical abuse and neglect. Group assignment was placed in numbered
sequential sealed envelopes.
2. Group allocation was based on block randomization. A sequential list of case numbers was
matched to group allocations in blocks of ten by randomly drawing five cards labeled “control”
and five cards labeled “treatment” from an envelope. This procedure was repeated for each block
of ten sequential case numbers. The list of case numbers and group allocation was held by a
researcher not involved in recruiting or interviewing inmates. The trial nurses responsible for
assessing, recruiting, and interviewing inmates had no access to these lists. Once an inmate had
been recruited and interviewed, the study nurse contacted the Central Randomization System via
a mobile telephone to ascertain the inmate’s group allocation.
3. After the baseline period, patients meeting the inclusion criteria were randomly stratified by
center (block size 12 not known to trial centers) in a 2:1:1 ratio (experimental program, an
alternative program, waiting list) using a centralized telephone randomization procedure (random
list generated with the Sample Software, version 8.0, Bronx, New York).
Blinding
In some randomized studies, the participants and investigators do not
know which participants are in the experimental and control groups: This is
the double-blind experiment. When participants do not know, but investigators do, this is called the blinded trial. Participants, people responsible for
program implementation or assessing program outcomes, and statistical
analysts are all candidates for being blinded.
Experts in clinical trials maintain that blinding is as important as randomization in ensuring valid study results. Randomization, they say, eliminates
confounding variables, or confounders, before the program is implemented—at baseline—but it cannot do away with confounding variables that
occur as the study progresses. A confounding variable is an extraneous variable in a statistical or research model that affects the dependent variables but
has either not been considered or not been controlled for. For example, age,
educational level, and motivation may be confounders in a study that involves
adherence to a complicated intervention.
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–125
Confounders can lead to a false conclusion that the dependent variables
are in a causal relationship with the independent or predictor variables. For
instance, suppose research shows that drinking coffee (independent or predictor variable) is associated with heart attacks (dependent variable). One
possibility is that drinking coffee causes heart attacks. Another is that having
heart attacks causes people to drink more coffee. A third explanation is that
some other confounding factor, such as smoking, is responsible for heart
attacks and is also associated with drinking coffee.
Confounding during the course of a study can occur if participants get
extra attention or the control group catches on to the experiment. The extra
attention or changes in the control group’s perceptions may alter the outcomes of a study. One method of getting around and understanding the biases
that may result in unblinded studies is to standardize all program activities and
to monitor the extent to which the program has been implemented as planned.
Pay special attention to the biases that may have occurred in randomized
controlled studies without blinding. Expect the evaluator to report on how the
program’s implementation was monitored and the extent to which any deviations from standard program procedures may have affected the outcomes.
Example 4.6 contains examples of blinding used in RCTs.
Example 4.6 Blinding in RCTs
1. The Researcher Is Blinded
Seventy-five opaque envelopes were produced for the initial randomization and lodged with an
independent staff member. Each contained a slip of paper with the word conventional, booklet, or
control (25 each). The randomization was performed by this staff member selecting an envelope for
each participant immediately after the initial assessment meeting with parents. For the re-randomization
of the control crossover group, this process was repeated with a second batch of 26 envelopes, half
each with the word conventional or booklet. The researcher conducting the study was therefore blind to
the nature of the treatment allocated until after the posttreatment assessment. Following that point, both
participant and researcher were aware of the treatment group to which they had been randomized.
2. Patients and Researchers Are Blinded
This study was a randomized, multicenter trial comparing acupuncture, sham acupuncture, and a
no-acupuncture waiting-list condition. The additional no-acupuncture waiting list control was
included because sham acupuncture cannot be substituted for a physiologically inert placebo.
Patients in the acupuncture groups were blinded to which treatment they received. Analysis of
headache diaries was performed by two blinded evaluators. The study duration per patient was
28 weeks: 4 weeks before randomization, the baseline; 8 weeks of treatment; and 16 weeks of
follow-up. Patients allocated to the waiting list received true acupuncture after 12 weeks and were
also followed up for 24 weeks after randomization (to investigate whether changes were similar to
those in patients receiving immediate acupuncture).
126– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
RCTs are generally expensive, time-consuming, and tend to address very
specific research questions. They should probably be saved for relatively
mature programs and practices, that is, those that previous research suggests
are likely to be effective. Previous research includes large pilot studies and
other randomized trials.
Despite an important advantage over other types of research designs
(specifically, their ability to establish that Program A is likely to have caused
Outcome A), RCTs’ requirement for control sometimes gives them a bad
name. Some researchers and practitioners express concern over the fairness
of excluding certain groups and individuals from participation or from receiving the experimental program. Others question the idea of regarding humans
as fit subjects for an experiment and of obtaining good information from
quantitative or statistically oriented research. To some extent, these are personal or ethical concerns and are not inherent weaknesses of the RCT design
itself. Nevertheless, it is certainly reasonable to expect researchers to explain
their choice of design in ethical as well as methodological terms. In Example 4.7,
the evaluators of a program that provides respite care for homeless patients,
a cohort study, defend themselves.
Example 4.7 Statement by the Evaluators of a Study of Respite Care for Homeless
Patients Regarding the Ethics of Their RCT (Buchanan, Doblin, Sai, &
Garcia, 2006)
Finally, a randomized control trial is needed. Although the available demographic data, clinical
variables, and baseline utilization data were similar in our respite care and usual care study groups,
it is possible that unmeasured variables, including differential rates of substance use or psychiatric
illness, may have confounded our results. Some might argue that a randomized trial would be
unethical, given the obvious humanitarian virtues of respite care. But a randomized trial would be no
less ethical than the current status quo in the United States, where respite care is available only to
some, not all, homeless people. Now is the time for such a trial, given the results of the present study,
the financial distress of many U.S. hospitals, and the unmet needs of our country’s homeless people.
The evaluators understand that some people believe that if you have an
intervention or program that is perceived to be humanitarian (e.g., respite
care), you should not conduct an experiment in which some people are necessarily denied services. The investigators counter by arguing that some homeless people do not have access to respite care anyway. The evaluators also point
out that evaluation research that takes the form of a randomized trial can help
clarify the results of their study by providing information on unmeasured factors such as differential rates of substance use or psychiatric illness.
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–127
Quasi-Experimental Research Designs
Quasi-experimental research is characterized by nonrandomized assignment to groups or by conducting a series of measures over time on one or
more groups.
Nonrandomized Controlled Trials: Concurrent Controls
Nonrandomized controlled trials are a type of quasi-experimental
design. In fact, quasi-experiment is often synonymous with nonrandomized control trial and is defined as a design in which one group receives the
program and one does not, the assignment of participants to groups is not
controlled by the researcher, and assignment is not random.
Quasi-experimental, nonrandomized controlled trials rely on participants who volunteer to join the study, are geographically close to the study
site, or conveniently turn up (e.g., at a clinic or a school) while the study is
being conducted. As a result, people or groups in a quasi-experiment may
self-select, and the evaluation findings may not be unbiased because they are
dependent upon participant choice rather than chance.
Quasi-experimental researchers use a variety of methods to ensure that
the participating groups are as similar to one another as possible (equivalent)
at baseline or before “treatment.” Among the strategies used to ensure
equivalence is matching. Example 4.4 showed the matching approach
applied to the randomized trial of a smokeless tobacco program for college
athletes
Matching requires selecting pairs of participants or clusters of individuals
who are comparable to one another on important confounding variables. For
example, suppose a researcher was interested in comparing the acuity of
vision among smokers and nonsmokers. One method of helping to ensure
that the two groups are balanced on important confounders requires that, for
every smoker, there is a nonsmoker of the same age, sex, and medical history.
Matching can effectively prevent confounding by important factors such
as age and sex for individuals. The strategy’s implementation can be relatively
expensive, however, because finding a match for each study participant is
sometimes difficult and often time-consuming.
Another technique for allocating participants to study groups in quasiexperiments include assigning each potential participant a number and using
an alternating sequence in which every other individual (1, 3, 5, etc.) is
assigned to the experimental group and the alternate participants (2, 4, 6,
etc.) are assigned to the control. A different option is to assign groups in
order of appearance; for example, patients who attend the clinic on Monday,
128– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
Wednesday, and Friday are in the experimental group, and those attending on
Tuesday, Thursday, and Saturday are assigned to the control. To prevent certain types of patients (e.g., those who can only come on a certain day) from
automatically being in one or the other of the groups, the procedure for
assignment can be reversed after some number of days or weeks.
Illustrations of nonrandomized, quasi-experimental designs with concurrent groups are given in Example 4.8.
Example 4.8 Quasi-Experimental Design: Concurrent Groups
1. Reducing Injuries Among Teen Agricultural Workers (Reed & Kidd, 2004)
Objective. To test an agricultural safety curriculum [Agricultural Disability Awareness and Risk
Education (AgDARE)] for use in high school agriculture classes.
Assessment for Eligibility. A total of 21 schools (1,138 agriculture students) from Kentucky, Iowa,
and Mississippi participated in the program.
Research Design. Schools in each state were grouped geographically to improve homogeneity in
agricultural commodities and production techniques and then assigned randomly to either one of
two intervention groups (A or B) or the control group. Fourteen schools were assigned to the
intervention arms, and seven schools were assigned to the control group.
Findings. Students who participated in AgDARE scored significantly higher in farm safety attitude
and intent to change work behavior than the control group. School and public health nurses,
working together with agriculture teachers, may make an effective team in reducing injuries
among teen agricultural workers.
2. Contraceptive Practices Among Rural Vietnamese Men (Ha, Jayasuriya, & Owen, 2005)
Objective. To test a social-cognitive intervention to influence contraceptive practices among men
living in rural communes in Vietnam.
Assessment for Eligibility. There were 651 married men from 12 villages in two rural communes
(An Hong and Quoc Tuan) in the An Hai district of Hai Phong province in Vietnam. Interviewers
visited each household in the selected villages and sought all married men aged 19–45 years
who had lived with their wives in the same house during the three months prior to the study. The
inclusion criteria were as follows: the wife was currently not pregnant, the couple did not plan to
have a child in the next six months, they currently did not use condoms consistently for family
planning, and the wives currently did not use the pill consistently for family planning.
Evaluation Research Design. Villages were chosen as the primary unit for intervention. From each
of the two communes, three villages were chosen for intervention and three as controls. The
intervention villages were separated from control villages by a distance of 2–3 km. Participants in
both study groups were assessed, using interviewer-based questionnaires, prior to (baseline) and
following the intervention (posttest).
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–129
Findings. There were 651 eligible married men in the 12 villages chosen. A significant positive
movement in men’s stage of readiness for IUD use by their wife occurred in the intervention
group. There were no significant changes in the control group. Compared to the control
group, the intervention group showed higher pros, lower cons, and higher self-efficacy for
IUD use by their wife as a contraceptive method. Interventions based on social-cognitive
theory can increase men’s involvement in IUD use in rural Vietnam and should assist in
reducing future rates of unwanted pregnancy.
Strong quasi-experimental designs have many desirable features. They
can provide information about programs when it is inappropriate or too
late to randomize participants. Another desirable characteristic of quasiexperiments is that, when compared to RCTs, their settings and participants
may more accurately reflect the messiness of the real world. An RCT requires
strict control over the environment, and to get that control the evaluator has
to be extremely stringent with respect to the research question being posed
and who is included and excluded from study participation. As a result, RCT
findings may apply to a relatively small population in constrained settings.
Nonrandomized designs are sometimes chosen over randomized ones in
the mistaken belief that they are more ethical than randomized trials. The
idea behind the ethical challenge is that, if the evaluation researcher suspects
that Program A is better than Program B, then how (in ethical terms) can he
or she allocate Program B to innocent participants? In fact, evaluations are
only ethical if they are designed well enough to have a strong likelihood of
producing an accurate answer about program effectiveness. There are cases
in which programs that were presumed effective turned out not to be so after
all. We have to assume that the evaluator has no evidence that Program A is
better than Program B to start with because, if he or she had proof, then the
evaluation would be unnecessary.
Some researchers and practitioners also think that quasi-experiments are
less costly than RCTs, but this has never been proven. Poor studies, whether
RCTs or quasi-experiments, are costly when they result in misleading or
incorrect information, which may delay or even prevent participants from
getting needed services or education.
Good quasi-experiments are difficult to plan and implement and require
the highest level of research expertise. Many borrow techniques from RCTs,
including blinding. Many others use sophisticated statistical methods to
enhance confidence in the findings.
The most serious potential flaw in quasi-experimental designs without
random assignment is that the groups in the experimental and control
130– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
groups may differ from one another at baseline so that the program cannot
have a fair trial. Therefore, in evaluating quasi-experiments, it is absolutely
crucial to find confirmation (usually done statistically) that either no difference in groups existed to begin with or the appropriate statistical methods
were used to control for the differences.
Time-Series Designs
Time-series designs are longitudinal studies that enable the
researcher to monitor change from one time to the next. They are sometimes
called repeated measures analyses. Debate exists over whether time-series
designs are research or analytic designs.
In a simple self-controlled design (also called pretest-posttest
design), each participant is measured on some important program variable
and serves as his or her own control. Participants are usually measured twice
(at baseline and after program participation), but they may be measured
multiple times afterward as well (see Example 4.9).
Example 4.9 Pilot Test of a Cognitive-Behavioral Program for Women With Multiple
Sclerosis (Sinclair & Scroggie, 2005)
Objective. The purpose of this quasi-experimental study was to evaluate the effectiveness of a
cognitive-behavioral intervention for women with multiple sclerosis (MS).
Assessment for Eligibility. Thirty-seven adult women with MS participated in a group-based program
titled “Beyond MS,” which was led by master’s-prepared psychiatric nurses.
Research Design. Perceived health competence, coping behaviors, psychological well-being, quality
of life, and fatigue were measured at four time periods: 5 weeks before the beginning of the
intervention, immediately before the program, at the end of the 5-week program, and at a 6-month
follow-up.
Findings. There were significant improvements in the participants’ perceived health competence,
indices of adaptive and maladaptive coping, and most measures of psychological well-being from
pretest to posttest. The positive changes brought about by this relatively brief program were
maintained during the 6-month follow-up period.
Pretest-posttest designs have many disadvantages from an evidencebased practice perspective. Participants may become excited about taking
part in an experiment, and this excitement may help motivate performance;
without a comparison group, you cannot control for the excitement. Also,
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–131
between the pretest and the posttest, participants may mature physically,
emotionally, and intellectually, affecting the program’s outcomes. Finally, selfcontrolled evaluations may be affected by historical events, including changes
in program administration and policy.
Because of their limitations, self-controlled time-series designs are
not considered experimental designs (some researchers call them preexperimental rather than quasi-experimental), and they are only appropriate for pilot studies or preliminary feasibility studies. Pretest-posttest designs
are not useful for evidence of effectiveness, and they are not meant to be.
Historical Controls
Some researchers make up for the lack of a readily available control
group by using a historical control. With traditional historical controls,
investigators compare outcomes among participants who receive a new program with outcomes among a previous group of participants who received
the standard program. An illustration of the use of historical controls is given
in Example 4.10.
Time-series designs can also be improved by adding more measurements
for a single group of participants before and after the program (in a single
time-series design) and adding a control (in a multiple time-series design).
Example 4.10 Historical Controls: Use and Impact of an eHealth System by
Low-Income Women With Breast Cancer (Gustafson et al., 2005)
Objective. To examine the feasibility of reaching underserved women with breast cancer and
determine how they use the system and what impact it had on them.
Assessment for Eligibility. Participants included women recently diagnosed with breast cancer whose
income was at or below 250% of the poverty level and were living in rural Wisconsin (n = 144; all
Caucasian) or Detroit (n = 85; all African American).
Evaluation Research Design. Historical Control: A comparison group of patients (n = 51) with similar
demographics was drawn from a separate recently completed randomized clinical trial.
Findings. When all low-income women from this study are combined and compared with a lowincome control group from another study, the Comprehensive Health Enhancement Support System
[CHESS]) group was superior to that control group in 4 of 8 outcome variables at both statistically
and practically significant levels (social support, negative emotions, participation in health care, and
information competence). We conclude that an eHealth system like CHESS will have a positive
impact on low-income women with breast cancer.
132– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
Interrupted or Single Time-Series Designs
The interrupted or single time-series design without a control group
(hence, the “single”) involves repeated measurement of a variable (e.g.,
reported crime) over time, encompassing periods both before and after
implementation of a program. The goal is to evaluate whether the program
has interrupted or changed a pattern established before the program’s
implementation. For instance, an evaluation using an interrupted timesseries design may collect quarterly arrest rates for drug-related offenses in a
given community for 2 years before and 2 years following the implementation of a drug enforcement task force. The data analysis would focus on
changes in patterns before and after the introduction of the program. In a
multiple time-series design, multiple interrupted observations are collected
before and after a program is launched. The “multiple” means that the observations are collected in two or more groups.
Time-series designs are complex research designs requiring many observations of outcomes and, in the case of multiple time-series designs, the
participation of many individuals and even communities. Their complex
analysis has led some researchers to take the position that they are really data
analytic strategies.
Observational Designs
In observational designs, researchers conduct studies with existing
groups of people or use existing databases. They do not intervene, which is
to say, they do not introduce programs. Among the observational designs
that are used in evaluation research are cohorts, case controls, and crosssectional surveys.
Cohort Designs
A cohort is a group of people who have something in common and who
remain part of a study group over an extended period of time. In public
health research, cohort studies are used to describe and predict the risk factors for a disease and the disease’s cause, incidence, natural history, and
prognosis. They tend to be extremely large studies.
Cohort studies may be prospective or retrospective. With a prospective design, the direction of inquiry is forward in time; with a retrospective
design, the direction is backward in time.
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–133
Example 4.11 contains abstracts of two cohort studies. The first is an
abstract of the National Treatment Improvement Evaluation Survey, a longitudinal study (a prospective study that takes place over several years) of a
national sample of substance abuse treatment programs that had received
federal treatment improvement demonstration grants in 1990–1991 (the
cohort). Treatment programs and their clients across 16 states completed
highly structured lay-administered interviews between July 1993 and
November 1995. Administrative interviews elicited information from senior
program administrators that focused on program finances and staff configuration, including the primary measure of interest and whether the program
had staff designated as case managers.
The second abstract in Example 4.11 is of a study to examine detection
rates of depression in primary care. The investigators used data collected
from a prospective cohort study of 1,293 consecutive general practice attendees in the United Kingdom.
Example 4.11 Two Cohort Studies
1. Prospective Cohort Design: Case Managers as Facilitators of Medical and Psychosocial Service
Delivery in Addiction Treatment Programs (Friedmann, Hendrickson, Gerstein, & Zhang, 2004)
Objective. To examine whether having designated case management staff facilitates delivery of
comprehensive medical and psychosocial services in substance abuse treatment programs.
Assessment for Eligibility. Clients from long-term residential, outpatient, and methadone
treatment modalities.
Research Design. A prospective cohort study of 2,829 clients admitted to selected substance
abuse treatment programs.
Findings. Availability of designated case managers increased client-level receipt of only two of
nine services, and exerted no effect on service comprehensiveness compared to programs that
did not have designated case managers. These findings do not support the common practice of
designating case management staff as a means to facilitate comprehensive services delivery in
addiction treatment programs.
2. A Prospective Cohort Design: Recognition of Depression in Primary Care: Does it Affect Outcome?
The PREDICT-NL Study (Kamphuis et al., 2011)
Background. Detection rates of depression in primary care are < 50%. Studies showed similar
outcome after 12 months for recognized and unrecognized depression. Outcome beyond
12 months is less well studied.
(continued)
134– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
Example 4.11 (Continued)
Objective. We investigated recognition of depression in primary care and its relation to outcome
after 6, 12 and 39 months.
Methods. Data were used from a prospective cohort study of 1,293 consecutive general practice
attendees (PREDICT-NL), who were followed up after 6 (n = 1236), 12 (n = 1179) and 39
(n = 752) months. We measured the presence and severity of major depressive disorder (MDD)
according to DSM-IV criteria and Patient Health Questionnaire 9 (PHQ-9) and mental function
with Short Form 12 (SF-12). Recognition of depression was assessed using international
classification of primary care codes (P03 and P76) and Anatomical Therapeutic Chemical
(N06A) codes from the GP records (6 months before/after baseline).
Results. At baseline, 170 (13%) of the participants had MDD, of whom 36% were recognized by
their GP. The relative risk of being depressed after 39 months was 1.35 [95% confidence interval
(CI) 0.7–2.7] for participants with recognized depression compared to unrecognized depression. At
baseline, participants with recognized depression had more depressive symptoms (mean difference
PHQ-9 2.7, 95% CI 1.6–3.9) and worse mental function (mean difference mental component
summary −3.8, 95% CI −7.8 to 0.2) than unrecognized depressed participants. After 12 and
39 months, mean scores for both groups did not differ but were worse than those without depression.
Conclusions. A minority of patients with MDD is recognized in primary care. Those who were
unrecognized had comparable outcome after 12 and 39 months as participants with recognized
depression.
High-quality prospective or longitudinal studies are expensive to conduct, especially if the researcher is concerned with outcomes that are relatively
rare or hard to predict. Studying rare and unpredictable outcomes requires
large samples and numerous measures. Also, researchers who do prospective
cohort studies have to be on guard against loss of subjects over time, or attrition (also called loss to follow-up). For instance, longitudinal studies of
children are often beset by attrition because, over time, children lose interest,
move far away, change their names, or are otherwise unavailable. If a large
number of people drop out of a study, the sample that remains may be very
different from the one that was originally enrolled. The remaining sample may
be more motivated or less mobile than those who left, for example, and these
factors may be related in unpredictable ways to any observed outcomes.
When reviewing prospective cohort studies, make sure that the researchers address how they handled loss to follow-up or attrition. Ask these questions: How large a problem was attrition? Were losses to follow-up handled in
the analysis? Were the study’s findings affected by the losses?
Because of the difficulties and expense of implementing prospective
cohort designs, many cohort designs reported in the literature tend to be
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–135
retrospective. Retrospective cohort designs use existing databases to identify
cohorts; they may do an analysis of the data that already exist in the database
or collect new data. A sample retrospective cohort design that identifies the
cohort and collects new data is illustrated in Example 4.12.
Example 4.12 Retrospective Cohort Design: Tall Stature in Adolescence and Depression
in Later Life (Bruinsma et al., 2006)
Objective. To examine the long-term psychosocial outcomes for women assessed or treated during
adolescence for tall stature.
Assessment for Eligibility. Women assessed or treated for tall stature identified from the records of
Australian pediatricians were eligible to participate.
Research Design. Retrospective cohort study in which women treated for tall stature were traced
using electoral rolls and telephone listings. Once found, the women were contacted by mail and
invited to complete a postal questionnaire and computer assisted telephone interview. Psychosocial
outcomes were measured using the depression, mania, and eating disorders modules of the
Composite International Diagnostic Interview (CIDI), the SF-36, and an index of social support.
Findings. There was no significant difference between treated and untreated women in the prevalence
of 12 month or lifetime major depression, eating disorders, or scores on the SF-36 mental health
summary scale or the index of social support. However, compared with the findings of populationbased studies, the prevalence of major depression in both treated and untreated tall girls was high.
Retrospective cohort designs have the same strengths as prospective
designs. They can establish that a predictor variable (e.g., being in a treatment program) precedes an outcome (e.g., depression). Also, because data
are collected before the outcomes being assessed are known with certainty,
the measurement of variables that might predict the outcome (e.g., being in
a program) cannot be biased by prior knowledge of which people are likely
to develop a problem (e.g., depression).
Case-Control Designs
Case-control designs are generally retrospective. They are used to
explain why a phenomenon currently exists by comparing the histories of
two different groups, one of which is involved in the phenomenon. For
example, a case-control design might be used to help understand the social,
demographic, and attitudinal variables that distinguish people who, at the
present time, have been identified with frequent headaches from those who
136– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
do not currently have frequent headaches. The researchers in a case-control
study like this want to know which factors (e.g., dietary habits, social
arrangements, education, income, quality of life) distinguish one group
from the other.
The cases in case-control designs are individuals who have been chosen
on the basis of some characteristic or outcome (e.g., frequent headaches).
The controls are individuals without the characteristic or outcome. The histories of cases and controls are analyzed and compared in an attempt to
uncover one or more characteristics that are present in the cases and not in
the controls.
How can researchers avoid having one group decidedly different from
the other (e.g., healthier, smarter)? Some methods include randomly selecting the controls, using several controls, and carefully matching controls and
cases on important variables.
Example 4.13 uses a sophisticated sampling strategy to compare the role
of alcohol use in boating deaths.
Example 4.13 Alcohol Use and Risk of Dying While Boating (Smith et al., 2001)
Objective. To determine the association of alcohol use with passengers’ and operators’ estimated
relative risk of dying while boating.
Assessment for Eligibility. A study of recreational boating deaths among persons aged 18 years or
older from 1990–1998 in Maryland and North Carolina (n = 221) provided the cases, which were
compared with control interviews obtained from a multistage probability sample of boaters in each
state from 1997–1999 (n = 3,943). Persons aged 18 years or older from 1990–1998 in Maryland and
North Carolina (n = 221) were compared with control interviews obtained from a multistage
probability sample of boaters in each state from 1997–1999 (n = 3,943).
In this study, a complex random sampling scheme was employed to
minimize bias among control subjects and maximize their comparability with
cases (e.g., deaths took place in the same location).
Epidemiologists often use case-control designs to provide insight into
the causes and consequences of disease and other health problems.
Reviewers of these studies should be on the lookout for certain methodological problems, however. First, cases and controls are often chosen from
two separate populations. Because of this, systematic differences (e.g., motivation, cultural beliefs) may exist between or among the groups that are
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–137
difficult to anticipate, measure, or control, and these differences may influence the study’s results.
Another potential problem with case-control designs is that the data
often come from people’s recall of events, such as asking women to discuss
the history of their physical activity or asking boaters about their drinking
habits. Memory is often unreliable, so the results of a study that depends on
memory may result in misleading information.
Cross-Sectional Designs
Cross-sectional designs result in a portrait of one or many groups at
one period of time. They are sometimes called descriptive or pre-experimental
designs. Following are three illustrative uses of cross-sectional designs
The most common use of cross-sectional designs is to describe the study
sample. The tabular description of results is sometimes called Table 1
because it is often the first table in a study report or article. Example 4.14
shows an example Table 1.
Example 4.14 Sociodemographic Characteristics, Substance Abuse History, and History
of Violence: Low-Income Women Seeking Emergency Care in the Bronx,
NY, 2001–2003 (El-Bassel, Gilbert, Vinocur, Chang, & Wu, 2011)
Total
(N = 241)
Participants Not Meeting
PTSD Criteria (n = 169)
Participants Meeting
PTSD Criteria (n = 72)
33 (10)
33 (10)
33 (10)
Latina
119 (49)
81 (48)
38 (53)
African American
105 (44)
75 (44)
30 (42)
17 (7)
13 (8)
4 (6)
High school diploma, no. (%)
127 (53)
93 (55)
34 (47)
Employed in past 6 mo., no. (%)
111 (46)
86 (51)
25* (35)
Homeless in past 6 mo., no. (%)
38 (16)
23 (14)
15 (21)
Sociodemographic characteristics
Age, y, mean (SD)
Race/ethnicity, no. (%)
Other
(continued)
138– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
Example 4.14 (Continued)
Total
(N = 241)
Participants Not Meeting
PTSD Criteria (n = 169)
Participants Meeting
PTSD Criteria (n = 72)
Substance abuse in past 6 mo., no. (%)
Heavy episode drinking
57 (24)
30 (18)
27** (38)
104 (43)
61 (36)
43** (60)
99 (41)
50 (30)
49** (68)
Lifetime sexual IPV
167 (69)
107 (63)
60** (83)
Lifetime physical or injurious IPV
165 (68)
103 (61)
62** (86)
Illicit drug use
History of violence, no. (%)
Childhood sexual abuse (before
age 16 y)
NOTE: IPV = intimate partner violence; PTSD = posttraumatic stress disorder.
*P < 0.05; **P < 0.01
The major limitation of cross-sectional studies is that, on their own and
without follow-up, they provide no information on causality; they only provide
information on events at a single, fixed point in time. For example, suppose a
researcher finds that girls have less knowledge of current events than do boys.
The researcher cannot conclude that being female somehow causes less
knowledge of current events. The researcher can only be sure that, in this survey undertaken at this particular time, girls had less knowledge than boys did.
To illustrate this point further, suppose you are doing a literature review
on community-based exercise programs. You are specifically interested in
learning about the relationship between age and exercise. Does exercise
decrease with age? In your search of the literature, you find the report presented in Example 4.15.
Example 4.15 A Report of a Cross-Sectional Survey of Exercise Habits
In March of this year, Researcher A surveyed a sample of 1,500 people between the ages of 30 and
70 to find out about their exercise habits. One of the questions he asked participants was, “How
much do you exercise on a typical day?” Researcher A divided his sample into two groups: People
45 years of age and younger and people 46 years and older. Researcher A’s data analysis revealed
that the amount of daily exercise reported by the two groups differed with the younger group
reporting 15 minutes more exercise on a typical day.
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–139
Based on this summary, does amount of exercise decline with age? The
answer is that you cannot get the answer from Researcher A’s report. The
decline seen in a cross-sectional study like this one can actually represent a
decline in exercise with increasing age, or it may reflect the oddities of this
particular sample. The younger people in this study may be especially sports
minded, while the older people may be particularly adverse to exercise. As a
reviewer, you need to figure out which of the two explanations is better. One
way you can do this is to search the literature to find out which conclusions
are supported by other studies. Does the literature generally sustain the idea
that amount of exercise always declines with age? After all, in some communities the amount of exercise done by older people may actually increase
because, with retirement or part-time work, older adults may have more time
to exercise than do younger people.
Observational Designs and Controlled Trials:
Compare and Contrast
Observational data can be useful adjuncts to randomized controlled trials
and quasi-experiments. They can assist the researcher in determining
whether effectiveness under controlled conditions translates into effective
treatment in routine settings. Also, some problems simply do not lend themselves to a randomized controlled trial. For instance, when they studied the
effects of cigarette smoking on health, it was impossible for researchers to
randomly assign some people to smoke while assigning others to abstain.
The only possible design was an observational one, albeit one that involved
decades of observing hundreds of thousands of people all over the world.
The case for observational studies over RCTs is suggested in a study
reported in the British Medical Journal:
The investigators in the study aimed to determine whether “parachutes
are effective in preventing major trauma related to gravitational challenge.” To find out, they reviewed all the randomized controlled trials
they could find in Medline [PubMed], Web of Science, EMBASE, and the
Cochrane Library databases. They also reviewed appropriate Internet
sites and citation lists. To be included, a study had to discuss the effects
of using a parachute during free fall. The effects were defined as death
or major trauma, defined as an injury severity score > 15.
Despite their diligence and scientific approach to the review, the
investigators were not able to find any randomized controlled trials of
the effectiveness of parachute intervention. They concluded that as with
many interventions intended to prevent ill health, the effectiveness of
140– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
parachutes has not been subjected to rigorous evaluation by using randomized controlled trials. The investigators point out that this is a serious problem for hard-line advocates of evidence-based medicine who
are adamantly opposed to the adoption of interventions evaluated by
using only observational data. To resolve the problem, the investigators
recommend that the most radical protagonists of evidence-based medicine organize and participate in a double blind, randomized, placebo
controlled, crossover trial of the parachute. They further conclude that
individuals who insist that all interventions need to be validated by a
randomized controlled trial need to come down to earth with a bump.
THE BOTTOM LINE: INTERNAL AND EXTERNAL VALIDITY
Internal validity refers specifically to whether an experimental program
makes a difference and whether there is sufficient evidence to support the
claim. A study has internal validity when you can confidently say that Program
A causes Outcome A. A study has external validity if it is generalizable
because its results are applicable to other programs, populations, and settings.
Internal Validity Is Threatened
Just as the best-laid plans of mice and men (and women) often go awry,
evaluation research no matter how well planned loses something in the
execution. Randomization may not produce equivalent study groups, for
example, or people in one study group may drop out more often than will
people in the other. Factors such as less-than-perfect randomization and
attrition can threaten or compromise an evaluation’s validity. There are at
least eight common threats to internal validity.
1. Selection of participants. This threat occurs when biases result
from the selection or creation of groups that are not equivalent.
Either the random assignment did not work or attempts to match
groups or control for baseline confounders were ineffective. As a
result, groups can be distinguished by being more affected by a given
policy, more mature, and more affected by differential administration
and content of the baseline and postprogram measures. Selection can
interact with history, maturation, and instrumentation.
2. History. Unanticipated events occur while the evaluation is in progress, and this history jeopardizes internal validity. A change in policy
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–141
or a historical event may affect participants’ behavior while they are
in the program. For instance, the effects of a school-based program
to encourage healthier eating may be affected by a healthy eating
campaign on a popular children’s television show.
3. Maturation. Processes (e.g., physical and emotional growth) occur
within participants inevitably as a function of time, threatening validity. Children in a 3-year school-based physical education program
mature physically, for example.
4. Testing. This threat can occur because taking one test has an effect
on the scores of a subsequent test. For instance, after a 3-week program, participants are given a test. They recall their answers on the
pretest, and this influences their responses to the second test. The
influence may be positive (they learn from the test) or negative (they
recall incorrect answers).
5. Instrumentation. Changes in a measuring instrument or changes in
observers or scorers cause an effect that can diminish validity. For
example, Researcher A makes slight changes between the questions
asked at baseline and those asked after the conclusion of the program. Or Researcher B administers the baseline measures, but
Researcher A administers the posttest measures.
6. Statistical regression. This effect operates when participants are
selected on the basis of extreme scores and regress or go back toward
the mean (e.g., average score) of that variable. Only people at great
risk are included in the program, for example. Some of them inevitably regress to the mean or average score. Regression to the mean is a
statistical artifact (i.e., due to some factor or factors outside of the
study).
7. Attrition (dropout) or loss to follow-up. This threat to internal
validity is the differential loss of participants from one or more groups
on a nonrandom basis. For instance, participants in one group drop
out more frequently than do participants in the others or are lost to
follow-up. The resulting two groups, which had similar characteristics
at baseline, no longer do.
8. Expectancy. A bias is caused by the expectations of the evaluator,
the participants, or both. Participants in the experimental group
expect special treatment, for example, while the evaluator expects to
give it to them (and sometimes does). Blinding is one method of
dealing with expectancy. A second is to ensure that a standardized
process is used in delivering the program.
142– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
External Validity Is Threatened
Threats to external validity are most often the consequence of the
way in which participants or respondents are selected and assigned. For
example, respondents in an experimental situation may answer questions
atypically because they know they are in a special experiment; this is called
the Hawthorne effect. External validity is also threatened whenever
respondents are tested, surveyed, or observed. They may become alert to the
kinds of behaviors that are expected or favored. There are at least four relatively common sources of external invalidity.
1. Interaction effects of selection biases and the experimental
treatment. This threat to external validity occurs when an intervention or program and the participants are a unique mixture, one that
may not be found elsewhere. The threat is most apparent when
groups are not randomly constituted. Suppose a large company volunteers to participate in an experimental program to improve the
quality of employees’ leisure time activities. The characteristics of the
company (some of which, like leadership and priorities, are related to
the fact that it volunteered for the experiment) may interact with the
program so that the two together are unique; the particular blend of
company and program can limit the applicability of the findings.
2. Reactive effects of testing. These biases occur when a baseline
measure interacts with the program, resulting in an effect that will not
generalize. For example, two groups of students participate in an ethics program evaluation. Group 1 is given a test before watching a film,
but Group 2 just watches the film. Group 1 performs better on a posttest because the pretest sensitizes them to the program’s content,
and they pay more attention to the film’s content.
3. Reactive effects of experimental arrangements or the
Hawthorne effect. This threat to external validity can occur because
participants know that they are participating in an experiment. This
threat is caused when people behave uncharacteristically because
they are aware that their circumstances are different. (They are being
observed by cameras in the classroom, for instance, or they have
been chosen for an experiment.)
4. Multiple program interference. This threat results when participants are in other complementary activities or programs that interact.
For example, participants in an experimental mathematics program
are also taking physics class. Both teach differential calculus.
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–143
External validity is dependent upon internal validity. Research findings
cannot be generalized to other populations and settings unless we first know
if these findings are due to the program or to other factors.
Randomized controlled trials with double blinding have the greatest
chance of being internally valid—assuming that their data collection and
analysis are also valid. As soon as the researcher begins to deviate from the
strict rules of an RCT, threats to internal validity begin to appear. Example 4.16
illustrates a sample of the threats to internal and external validity found in
evaluation reports.
Example 4.16 Threats to Internal and External Validity: Reducing Confidence in the
Evidence of the Effectiveness of Four Programs
1. Evaluating a Health Care Program to Get Adolescents to Exercise (Patrick et al., 2006)
An additional concern in interpreting results is the potential impact on our findings of
measurement reactivity in which self-reported behavior is influenced by the measurement process
itself. Repeated assessments of the target behaviors as well as extensive surveys on thoughts and
actions used to change behaviors (not described in this article) could have motivated and even
instructed adolescents in both conditions to change behaviors, and control participants reported
improvements in several diet and physical activity behaviors [reactive effects of testing].
Measurement effects have been demonstrated in studies promoting physical activity through
primary care settings, and this also may occur with diet assessment.
2. Evaluating a Mental Health Intervention for Schoolchildren Exposed to Violence: A Randomized
Controlled Trial (Stein et al., 2003)
The CBITS [Cognitive Behavioral Intervention for Trauma in Schools] intervention was not compared
with a control condition such as general supportive therapy, but rather with a wait-list delayed
intervention. As a consequence, none of the informants (students, parents, or teachers) were
blinded to the treatment condition. It is possible that the lack of blinding [expectancy] may have
contaminated either the intervention or assessments. School staff and parents may have provided
more attention and support to students who were eligible for the program while they were on a
waiting list; alternatively, respondents may have been more likely to report improvement in
symptoms for those students whom they knew had received the intervention.
3. HIV-Risk-Reduction Intervention Among Low-Income Latina Women (Peragallo et al., 2005)
Individuals lost to follow-up (n = 112) differed from those who received at least one session of
the intervention (n = 292) [attrition, generalizability] with respect to age (younger), ethnicity
(Puerto Rican), years in the United States (slightly more years in United States), education
(completed 1 more year), marital status (less likely to be married), insurance source (more likely
to have insurance), and acculturation (more non-Hispanic acculturation) [selection].
(continued)
144– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
Example 4.16 (Continued)
4. Tall Stature in Adolescence and Depression in Later Life (Bruinsma et al., 2006)
Another possibility is that the assessment or treatment procedures predisposed women to
depression either because it medicalized the issue of their height or because of the intrusiveness
of the assessment and treatment [reactive effects of testing and of experimental arrangements]
and its effect on adolescent girls. In this study, there was evidence that women who reported a
negative experience of assessment or treatment procedures were significantly more likely to have
a history of depression than women who did not, which is consistent with other studies.
A high-quality research article will always describe threats to its validity,
sometimes called limitations, in the discussion or conclusions section.
The following checklist consists of questions to ask when evaluating a
study’s internal and external validity.
What Evidence-Based Public Health Practice Should Watch For: A
Checklist for Evaluating a Study’s Internal and External Validity
99 If the research has two or more groups, is information given on the
number of people in each group who were eligible to participate?
99 If the research has two or more groups, is information given on the
number in each group who agreed to participate?
99 If the research has two or more groups, is information given on the
number in each group who were assigned to groups?
99 If the research has two or more groups, is information given on the
number in each group who completed all of the program’s activities?
99 Were reasons given for refusal to participate among participants
(including personnel)?
99 Were reasons given for not completing all program or data collection
activities?
99 Did any historical or political event occur during the course of the
study that may have affected its findings?
99 In long-term studies, was information given on the potential effects
on outcomes of physical, intellectual, and emotional changes among
participants?
99 Was information provided on concurrently running programs that
might have influenced the outcomes?
99 Was there reason to believe that taking a preprogram measurement
affected participants’ performance on a postprogram measurement?
Chapter 4. Research Design, Validity, and Best Available Evidence– ●–145
99
99
99
99
99
This problem might arise in evaluations of programs that take a few
weeks or require only a few sessions.
Was there reason to believe that changes in measures or observers
may have affected the outcomes?
Did the researchers provide information on whether observers or
people administering the measures (e.g., tests, surveys) were trained
and monitored for quality?
If participants were chosen because of special needs, did the
researchers discuss how they dealt with regression toward the mean?
Did the researchers provide information on how staff ensured that
the program was delivered in a standardized manner?
Were participants or researchers blinded to the intervention? If not,
did the researchers provide information on how the outcomes were
affected?
THE PROBLEM OF INCOMPARABLE PARTICIPANTS:
STATISTICAL METHODS TO THE RESCUE
Randomization is designed to reduce disparities between experimental and
control groups by balancing them with respect to all characteristics (e.g.,
participants’ age, sex, or motivation) that might affect a study’s outcome.
With effective randomization, the only difference between study groups is
whether or not they are assigned to receive an experimental program. The
idea is that, if discrepancies in outcomes are subsequently found by statistical
comparisons (e.g., the experimental group improves significantly), they can
be attributed to the fact that some people received the experiment while
others did not.
In observational and nonrandomized studies, the researcher cannot
assume that the groups are balanced before they receive (or do not receive)
a program or intervention. In observational studies, for example, measured
participant characteristics are obtained before, during, and after program
participation, and it is often difficult to determine exactly which characteristics are baseline variables. Also, there frequently are unmeasured characteristics that are not available, inadequately measured, or unknown. But if the
participants are different, then how can the evaluator who finds a difference
between experimental and control outcomes separate the effects of the
intervention from differences in study participants? One answer is to consider taking care of potential confounders during the data analysis phase
using statistical methods such as analysis of covariance and propensity score
methods.
146– ●–EVIDENCE-BASED PUBLIC HEALTH PRACTICE
Analysis of Covariance
Analysis of covariance (ANCOVA) is a statistical procedure that
results in estimates of intervention or program effects adjusted for participants’ background (and potentially confounding) characteristics or covariates (e.g., age, gender, educational background, severity of illness, type of
illness, motivation). The covariates are included explicitly in a statistical
model.
Analysis of covariance adjusts for the confounder by assuming (statistically) that all participants are equally affected by the same confounder, say,
age. That is, the ANCOVA can provide an answer to this question: If you balance the ages of the participants in the experimental and control groups so
that age has no influence on one group versus the others, how do the
experimental and control groups compare? The ANCOVA removes age as a
possible confounder at baseline.
The choice of covariates to include in the analysis comes from the literature, preliminary analysis of study data, and expert opinion on which characteristics of participants might influence their willingness to participate in and
benefit from study inclusion.
Example 4.17 illustrates the use of ANCOVA in a study protocol or plan
to improve work task performance in young adults with Down syndrome.
Example 4.17 Excerpt From Study Protocol of a Randomised Controlled Trial to
Investigate if a Community-Based Strength Training Programme Improves
Work Task Performance in Young Adults With Down Syndrome
Aim. The aim of this study is to investigate if a student-led community-based progressive resistance
training programme can improve these outcomes in adolescents and young adults with Down
syndrome.
Methods. A randomised controlled trial will compare progressive resistance training with a control
group undertaking a social programme. Seventy adolescents and young adults with Down syndrome
aged 14–22 years and mild to moderate intellectual disability will be randomly allocated to the
intervention or control group using a concealed...
Purchase answer to see full
attachment