Introduction and Overview
In: Confidence Intervals
By: Michael Smithson
Pub. Date: 2011
Access Date: May 6, 2019
Publishing Company: SAGE Publications, Inc.
City: Thousand Oaks
Print ISBN: 9780761924999
Online ISBN: 9781412983761
DOI: https://dx.doi.org/10.4135/9781412983761
Print pages: 2-3
© 2003 SAGE Publications, Inc. All Rights Reserved.
This PDF has been generated from SAGE Research Methods. Please note that the pagination of the
online version will vary from the pagination of the print book.
SAGE
SAGE Research Methods
2003 SAGE Publications, Ltd. All Rights Reserved.
Introduction and Overview
This monograph surveys methods for constructing confidence intervals, which estimate and represent
statistical uncertainty or imprecision associated with estimates of population parameters from sample data. A
typical example of a confidence interval statement is a pollster's claim that she or he is 95% confident that
the true percentage vote for a political candidate lies somewhere between 38% and 44%, on the basis of
a sample survey from the voting population. Pollsters often refer to the gap between 38% and 44% as the
“margin of error.” In statistical terms, the interval from 38% to 44% is a 95% confidence interval, and 95% is
the confidence level. The pollster's claim actually means that she or he has a procedure for constructing an
interval that, under repeated random sampling in identical conditions, would contain the true percentage of
the vote 95% of the time. We will examine this more technical meaning in the next chapter.
This interval conveys a lot of information concisely. Not only does it tell us approximately how large the vote
is, but it also enables anyone so disposed to evaluate the plausibility of various hypothetical percentages. If
the previous election yielded a 39% vote for this candidate, for instance, then it is not beyond the bounds
of plausibility (at the 95% confidence level) that the candidate's popularity has remained the same. This is
because 39% is contained in the interval from 38% to 44% and therefore is a plausible value for the true
percentage vote. That said, we also cannot rule out the possibilities of an increase by as much as 5% or a
decline by as much as 1%.
The confidence interval also enables us to assess the capacity of the poll to resolve competing predictions
or hypotheses about the candidate's popularity. We can rule out, for instance, a hypothesis that the true
percentage is 50%, but we cannot rule out hypothetical values of the percentage vote that fall within the
38%-44% interval. If, for example, the candidate needs to gain a clear majority vote to take office, then this
poll is able to rule that out as implausible if the election were held on the same day as the poll (assuming that
a 95% confidence level is acceptable to all concerned). If, on the other hand, the candidate needs only a 4%
increase to take office, then the confidence interval indicates that this is a plausible possibility. In fact, as we
will see in Chapter 2, a confidence interval contains all the hypothetical values that cannot be ruled out (or
rejected). Viewed in that sense, it is much more informative than the usual significance test.
This monograph refers to a fairly wide variety of statistical techniques, but many of these should be familiar to
readers who have completed an undergraduate introductory statistics unit for social science students. Where
less familiar techniques are covered, readers may skip those parts without sacrificing their understanding of
the fundamental concepts. In fact, Chapters 2–4 and 7 cover most of the fundamentals. Chapter 2 introduces
the basis of the confidence interval framework, beginning with the concepts of a sampling distribution and
a limiting distribution. Criteria for “best” confidence intervals are discussed, along with the trade-off between
confidence and precision (or decisiveness). The strengths and weaknesses of confidence intervals are
presented, particularly in comparison with significance tests.
Chapter 3 covers “central” confidence intervals, for which the same standardized distribution may be used
Page 2 of 3
Confidence Intervals
SAGE
SAGE Research Methods
2003 SAGE Publications, Ltd. All Rights Reserved.
regardless of the hypothetical value of the population parameter. Many of these will be familiar to some
readers because they are based on the t, normal, chi-square, and F distributions. This chapter also introduces
the transformation principle, whereby a confidence interval for a parameter may be used to construct an
interval for any monotonic transformation of that parameter. Finally, there is a brief discussion of the effect
that sampling design has on variability and therefore on confidence intervals.
Chapter 4 introduces “noncentral” confidence intervals, based on distributions whose shape changes with
the value of the parameter being estimated. Widely applicable examples are the noncentral t, F, and
X2 (chi-squared) distributions. Confidence intervals for the noncentrality parameters associated with these
distributions may be converted into confidence intervals for several popular effect-size measures such as
multiple R2 and Cohen's d.
Chapters 5 and 6 provide extended examples of the applications of confidence intervals. Chapter 5 covers
a range of applications in ANOVA and linear regression, with examples from research in several disciplines.
Chapter 6 deals with topics in categorical data analysis, starting with univariate and bivariate techniques and
proceeding to multi-way frequency analysis and logistic regression.
Chapter 7 elucidates the relationship between the confidence interval and significance testing frameworks,
particularly regarding power. The use of confidence intervals in designing studies is discussed, including the
distinctions arising between considerations of confidence interval width and power. Chapter 8 provides some
concluding remarks and brief mentions of several topics related to confidence intervals but not dealt with in
this monograph, namely measurement error, complex sample designs, and meta-analysis.
I have received useful advice from many colleagues and students on drafts of this monograph. I am especially
indebted to John Beale, Geoff Cumming, Chris Dracup, John Maindonald, Craig McGarty, Jeff Ward, and the
students in the ACSPRI Summer School 2001 Confidence Interval Workshop for detailed and valuable ideas,
data, criticism, and error detection. Of course, I am solely responsible for any remaining errors or flaws in this
work.
http://dx.doi.org/10.4135/9781412983761.n1
Page 3 of 3
Confidence Intervals
Statistical Research Designs for Causal
Inference
In: Designing Research in the Social Sciences
By: Martino Maggetti, Fabrizio Gilardi & Claudio M. Radaelli
Pub. Date: 2015
Access Date: May 6, 2019
Publishing Company: SAGE Publications Ltd
City: London
Print ISBN: 9781849205016
Online ISBN: 9781473957664
DOI: https://dx.doi.org/10.4135/9781473957664
Print pages: 69-92
© 2013 SAGE Publications Ltd All Rights Reserved.
This PDF has been generated from SAGE Research Methods. Please note that the pagination of the
online version will vary from the pagination of the print book.
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
Statistical Research Designs for Causal Inference
Introduction
In Chapter 3 we discussed the different ways in which the social sciences conceptualize causation and
argued that there is no single way in which causal relationships can be defined and analysed empirically. In
this chapter, we focus on a specific set of approaches to constructing research designs for causal analysis,
namely, one based on the potential-outcomes framework developed in statistics. As discussed in Chapter
3, this perspective is both probabilistic and counterfactual. It is probabilistic because it does not assume
that the presence of a given cause leads invariably to a given effect, and it is counterfactual because it
involves the comparison of actual configurations with hypothetical alternatives that are not observed in reality.
In essence, this approach underscores the necessity to rely on comparable groups in order to achieve valid
causal inferences. An important implication is that the design of a study is of paramount importance. The
way in which the data are produced is the critical step of the research, while the actual data analysis, while
obviously important, plays a secondary role. However, a convincing design requires research questions to be
broken down to manageable pieces. Thus, the big trade-off in this perspective is between reliable inferences
(that is, conclusions based on empirical evidence) on very specific causal relationships on the one hand, and
their broader context and complexity (and, possibly, theoretical relevance) on the other hand.
The chapter first distinguishes between two general perspectives on causality, namely, one that places
the causes of effects in the foreground, and another that is more interested in the effects of causes. We
then introduce the potential-outcomes framework before discussing several research designs for causal
inference, notably various types of experiments and quasi-experiments. This is followed by a discussion of
the implications for research design, and the conclusion summarizes the main points.
Causes of Effects and Effects of Causes
To understand the specificities of statistical research designs for causal inference, it is useful to consider
a general difference between quantitative and qualitative approaches to causal analysis. While quantitative
approaches typically focus on the ‘effects of causes’, qualitative approaches usually examine the ‘causes
of effects’ (Mahoney and Goertz, 2006). An equivalent distinction is that between ‘forward’ and ‘reverse’
causal inference: forward causal inference asks ‘What might happen if we do X?’ while reverse causal
inference asks ‘What causes Y?’ (Gelman, 2011). The difference between the two approaches overlaps in
part with that characterizing ‘variable-oriented research’ on the one hand and ‘case-oriented research’ on
the other (Ragin, 1987: 34–68; see also Chapter 3, this volume). Obviously, both are legitimate and fruitful
perspectives in the social sciences, each with its own trade-offs. Moreover, it would be wrong to draw a sharp
distinction between qualitative and quantitative research. As we will see throughout this chapter, although
Page 2 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
statistical research designs for causal inference necessarily rely on quantitative techniques (otherwise they
would not be ‘statistical’), qualitative information and substantive knowledge are an important precondition for
meaningful analyses and are often an integral component of experiments and quasi-experiments.
For instance, consider the case of women's quotas in parliamentary elections. Figure 4.1 compares the
percentage of women in parliament in 69 countries with and 84 countries without quotas (Tripp and Kang,
2008). Each dot represents a country, and Finland, Sweden, France and the Netherlands are highlighted.
Horizontal lines represent the average percentage of women in parliament in each group. From an effectsof-causes perspective, we would investigate the consequences of quotas on female representation. That
is, the starting point is the presumed cause (quotas), and the aim is to measure its causal connection
with the presumed effect (for example, the percentage of women in parliament). The fact that, on average,
countries with quotas have more women in parliament than those without quotas suggests that quotas might
be conducive to better female representation. On the other hand, from a causes-of-effects perspective we
would begin with the outcome and trace our way back to the possible causes. For instance, we could ask
why two relatively similar countries such as Finland and the Netherlands have similar shares of women in
parliament (about 37 per cent), although only the Netherlands has gender quotas. We could also ask why,
in Sweden, there are almost four times as many women in parliament as in France (45.3 per cent compared
to 12.2 per cent), given that both countries have introduced quotas. The first perspective would be likely to
produce a single estimate of the causal effect, while the second would probably give an extensive account of
the numerous factors influencing female representation and explain the cases holistically, that is, in all their
complexity. However, significant qualitative knowledge is also required in the former, both for constructing an
appropriate research design and for interpreting the finding correctly.
Page 3 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
Figure 4.1 Percentage of women in parliament in 69 countries with and 84 countries without quotas.
Each dot represents a country. Horizontal lines represent the average percentage of women in
parliament in each group
Statistical research designs embrace the effect-of-causes approach. As Gelman (2011) argues, ‘What causes
Y?’ is often the question that motivates the analysis in the first place. However, attempting to answer the
question directly leads inevitably to a proliferation of hypotheses, most of which are actually likely to have
some validity. Thus, the risk is that the analysis becomes intractable. This is the problem of overdetermination,
or the fact that there are always a myriad of factors contributing in some way to a specific outcome. As we
discuss in Chapter 6, there are methods that allow us to address this issue from a case-oriented perspective,
that is, within a causes-of-effects approach. However, statistical designs reframe the question in terms of
the effects of causes. They break the question down, identify a particularly interesting factor, and ask what
consequences it has on the outcome of interest. An implication of this strategy is that multiple analyses
are needed to uncover complex causal paths, because each analysis can examine only one at a time. Or,
as Gelman (2011) puts it, in this perspective we are trying to learn about a specific causal path within a
more complex causal structure, but not about the causal structure itself. Thus, statistical designs prioritize
Page 4 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
the reliability of very specific causal estimates at the expense of the broader context in which they operate
and possibly even of the connection with the original (theoretical and/ or empirical) problem, which must be
redefined in order to make it fit within the strict requirements of the analytical design.
The Potential-Outcomes Framework
The potential-outcomes framework, also known as the counterfactual model, presupposes a dichotomous
treatment (Di), such as (to continue our example from the previous section) the presence or absence of
women's quotas. If Di = 1, then country i has quotas for the representation of women in parliament, while if Di
= 0, then it does not. Further, the framework assumes that there are two potential outcomes for each unit i,
Y1i and Y0i The outcomes are associated with the two possible values of the treatment. In our example, Y1i
is the percentage of women in parliament in country i in the presence of quotas, while Y0i is that percentage
if the same country i does not have quotas. Formally, we can represent this idea as follows:
Notice that both outcomes refer to the same unit. But, of course, it is impossible that, in our example, the same
country both does and does not have quotas. This is why the two outcomes are called ‘potential’; only one is
realized and can be observed, while the other is its logical counterpart, which exists only in the realm of ideas.
However, conceptually, both are necessary for the definition of a causal effect. If we were able to observe,
for the same country i, the percentage of women both with and without quotas, then we could compute the
causal effect for that country simply as the difference between the two outcomes:
On this basis, and always on the assumption that both outcomes can be observed (which, in fact, is not
possible), we can define two other quantities. The first is the average treatment effect (ATE), which, as the
name indicates, is the average effect of the treatment for all units (for instance, the average effect of quotas
in all countries):
That is, the ATE is defined as the average difference between the two potential outcomes in all countries.
The second quantity is the average treatment effect on the treated (ATT), that is, the effect of the treatment
averaged only over units that actually receive the treatment (for instance, the average effect of quotas in
countries with quotas):
That is, we make the same computation as for the ATE, but only for the subset of countries with quotas (those
for which Di = 1). Countries without quotas (Di = 0) are disregarded.
Page 5 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
These definitions rely on a critical assumption, namely, the so-called stable unit treatment value assumption
(SUTVA) (Morgan and Winship, 2007: 37–40). This has two components. First, the treatment must be the
same for all units. While the effect of the treatment can vary across units (if it did not, we would not need to
compute averages for the ATE and ATT), the treatment itself must be equivalent in all units. In our example,
this assumption is in fact violated because there are several types of quotas, namely, compulsory or voluntary
party quotas, reserved lists, and women-only lists (Tripp and Kang, 2008: 347). By collapsing them in a simple
‘quotas versus no quotas’ dichotomy, we assume that each of these instruments has the same consequences
for female representation, which is unlikely to be the case. However, this assumption is necessary in the
potential-outcomes framework. Second, the outcomes in one unit must be independent of the treatment
status in other units. In other words, the percentage of women in a given country must be unrelated to whether
or not other countries have quotas. This assumption should be met in our example, but it is easy to imagine
other situations in which it does not hold, for instance when the treatment has network effects or other types
of externalities. The interdependencies discussed in Chapter 7 are good cases in point.
As noted above, these definitions of treatment effects are purely theoretical. In reality, we cannot observe
the same unit both with and without the treatment. This is known as the ‘fundamental problem of causal
inference’ (Holland, 1986), and it is what makes causal inference so difficult in practice. The nature of the
problem is summarized in Table 4.1. In reality we can observe two outcomes, namely, in our example, the
percentage of women in parliament in the presence of quotas given that there are actually quotas, and
the percentage in the absence of quotas given that there are actually no quotas. However, to compute the
quantities defined above, we would need also the two corresponding counterfactual outcomes, namely, the
percentage of women in parliament in the absence of quotas in countries that actually have quotas, and the
percentage in the presence of quotas in countries that actually have quotas. To illustrate more intuitively, take
the case of France. Because this country has women's quotas, we are here in the top-left corner of Table
4.1. To compute the causal effect of quotas in France, we should take the difference between the observed
percentage of women in parliament (12.2 per cent) and the value that we would observe if France had no
quotas, that is, the value of the top-right corner of Table 4.1. The same logic applies to countries that have no
quotas, namely, those in the bottom-right corner, which would need to be compared with their counterfactuals
in the bottom-left corner.
Table 4.1 The fundamental problem of causal inference (based on Morgan and Winship, 2007: 35)
Page 6 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
What if we compute the difference between the two quantities we can actually observe? As we have seen
in Figure 4.1, countries with quotas have, on average, more women in parliament (19.2 per cent) than those
without (13.2 per cent). It turns out that this observed difference in averages is equal to the ATT (one of our
quantities of interest), plus a selection bias (Angrist and Pischke, 2009: 14). In our example, the selection
bias corresponds to the average difference between the percentage of women in parliament without quotas
in countries that actually have quotas (a counterfactual) and the percentage without quotas in countries
that actually do not have them (which is observable). The former group includes countries such as France,
Germany, and Sweden, while the latter includes countries such as Ghana, Syria, and Vietnam. In fact, Table
4.2 shows that the two groups differ systematically in a number of ways. Countries with quotas tend to
be wealthier, more democratic, and more likely to have a proportional system of electoral representation.
Although the difference is only borderline significant, women in countries with quotas also tend to be more
educated. All these factors are likely to be associated with a higher share of women in parliament even in
the absence of quotas. This is what ‘selection bias’ means in this context. Countries are not assigned quotas
randomly; they self-select into this policy. Therefore, countries with and countries without quotas differ in a
number of ways and the two groups are not well comparable.
Table 4.2 Countries with and countries without quotas are quite different (calculations based on Tripp
and Kang, 2008)
In sum, within the potential-outcomes framework, causal effects are clearly defined but cannot be directly
computed in practice because the required counterfactuals are unobservable. However, researchers can rely
on several methods to estimate them. We turn to these in the next section.
Methods
Regression
Regression Analysis Is:
[a]n extension of correlation analysis, which makes predictions about the value of a dependent
variable using data about one or more independent variables. A key parameter estimated in a
regression analysis is the magnitude of change in the dependent variable associated with a unit
Page 7 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
change in an independent variable. This parameter is referred to as the slope or the regression
coefficient. (Brady and Collier, 2004: 303)
In most quantitative studies, the default research design applies this technique to observational data, that
is, information that was not generated by a process controlled by the researcher. The data set used by
Tripp and Kang (2008) is a typical example. By contrast, experimental data are those produced under the
supervision of the researcher. To continue with our example, a bivariate regression (that is, including just one
explanatory variable) of the share of women in parliament on quotas indicates that countries with quotas have
on average about 6 per cent more women in parliament than countries without quotas, and that the difference
is statistically highly significant.1 This difference corresponds exactly to what is shown in Figure 4.1. An
obvious problem with this analysis is that it fails to control for the differences that exist across countries
beyond the presence of quotas, such as those shown in Table 4.2. In other words, the bivariate regression
neglects the selection bias problem. A multivariate regression (that is, including several explanatory variables)
can mitigate it, to a certain extent. If we include the variables listed in Table 4.2, quotas remain significantly
associated with female representation, but the size of the effect is reduced by half in comparison with the
bivariate regression. That is, with per capita gross domestic product (GDP), women's education, democracy,
and the type of electoral system controlled for, countries with quotas have on average only about 3.2 per cent
more women in parliament than countries without quotas.2 The inclusion of control variables is known also
as ‘covariate adjustment’, which means that the analysis adjusts the estimate of the causal effect for those
covariates (that is, variables) that can be taken into account.
Under some conditions, regression can yield unbiased estimates of causal effects (Morgan and Winship,
2007: 136–42). These conditions, however, are quite restrictive and generally unlikely to be met in practice.
First, there must be no omitted variables in the analysis. That is, in our example, all factors influencing
the percentage of women in parliament besides quotas must be measured and included in the regression.
Obviously, no analysis can ever fulfil this requirement perfectly, which means that only rarely can the causal
estimates produced by regression analysis be credibly considered unbiased.
Second, the functional relationship between the control variables and the outcome must be fully and correctly
specified. This means, for instance, that any non-linearities in the relationship between say, per capita GDP
and women's representation (for instance, the correlation is stronger at lower levels of per capita GDP),
as well as any interactions (for instance, the correlation between women's education and representation
depends on the level of per capita GDP) must be explicitly and correctly modelled. This quickly becomes
intractable with even just a handful of variables, a problem that is known as the ‘curse of dimensionality’.
This requirement stems from the fact that, in most practical situations, the treatment and control groups are
quite different; in other words, the covariates are not balanced between them. In fact, this is the case in our
example, as shown in Table 4.2. Therefore, the analysis needs to make assumptions in order to extrapolate
the comparison between countries with and without quotas for specific combinations of control variables.
The problem can be alleviated by a method called ‘matching’ (Ho et al., 2007), which attempts to make the
Page 8 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
treated and control groups more similar by removing ‘incomparable’ cases. One can, for instance, compute
the probability that a unit receives the treatment (the ‘propensity score’) and then find, for each treated unit,
an untreated unit with a very similar propensity score. If this procedure is successful (which depends on
the characteristics of the data set), then a better balance between the two groups is achieved (that is, they
are more comparable) and the analysis becomes less dependent on the specific assumptions made by the
regression model. However, matching improves comparability only with respect to variables that can actually
be observed. Thus, the first condition (no omitted variables) remains a big problem.
Experiments
As we have seen, two main practical problems arise when the potential-outcomes approach is implemented
empirically. First, selection bias is ubiquitous, which means that the comparability of the treatment and control
groups is usually limited. Second, while regression can in principle solve this problem, omitted variables and
the ‘curse of dimensionality’ will in most cases lead to biased estimates of causal effects. The appeal of
the experimental approach is that it is much more effective in ensuring that treated and control units are in
fact comparable. This occurs through ‘randomization’, that is, random assignment of treatment to the units.
Specifically, what defines experiments is that randomization is undertaken by researchers themselves. With
randomization, systematic differences between the two groups can occur only by chance and, if the number
of units is sufficiently large, with a very low probability. Moreover, the procedure works for both observable and
unobservable characteristics, such that omitted variables are no longer a problem. Because randomization is
so powerful, the data can in principle be analysed with simple techniques, and the difference in means for the
outcome between treatment and control groups (or, equivalently, the coefficient of a bivariate regression) can
be interpreted as the ATE as well as the ATT. A common problem is that units are not selected randomly from
the population, such that it is not possible to generalize the estimates straightforwardly beyond the sample.
However, the estimates are still valid for the units that were part of the experiment. It should be emphasized
that, of course, randomization is not perfect and there are several ways in which it can go wrong. For instance,
it is possible that not all the units that are assigned to the treatment are actually treated or, conversely, that
some control units become exposed to it (‘non-compliance’); it is also possible that, for one reason or another,
outcomes cannot be observed for some units (‘attrition’) (Gerber and Green, 2008). However, experiments
have an unparalleled capacity to uncover causal relationships and are widely considered the ‘gold standard’
in this respect.
In our women's quotas example, an experiment would imply that quotas are attributed to countries randomly.
As a consequence, and in contrast to what we have seen in Table 4.2, the groups of countries with and without
quotas would be very similar, if not exactly identical, in all characteristics that could potentially affect women's
representation, including those that cannot be observed. Therefore, the average difference in the percentages
of women in parliament between the two groups could in principle be interpreted as the causal effect of
quotas. The example shows the advantages of the experimental approach, but also an obvious drawback
in the social sciences. In many, if not most, cases, randomization cannot be implemented for a number of
practical and ethical reasons. For instance, imposing a dictatorship on a random subset of democracies
Page 9 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
(to see the consequences on economic growth, for example) is impossible in practice and, even if it were
feasible, would be unethical. Given these problems, it is not surprising that experiments are not the first
method that comes to mind when one thinks of social science research. At the same time, in recent years
they have been used with increasing frequency and success and have become a mainstream tool for social
scientists (Druckman et al., 2006). We can distinguish among three broad types, namely, laboratory, survey
and field experiments, which we discuss in the following subsections.
Laboratory Experiments
Laboratory experiments are ‘experiments where the subjects are recruited to a common location, the
experiment is largely conducted at that location, and the researcher controls almost all aspects in that
location, except for subjects’ behavior’ (Morton and Williams, 2008: 346; emphasis in original). They are what
first comes to mind when we hear the word ‘experiment,’ namely, a relatively small group of people, not
necessarily representative of the broader population (for example, students), following precise instructions to
perform a set of abstract tasks.
Despite their stylized nature, laboratory experiments can help to uncover important causal relationships.
For example, Correll (2004) was interested in how cultural beliefs about gender differences in ability affect
career choices through the self-assessment of performance. If it is commonly accepted in society that, say,
men are better than women at mathematics, then the theory is that, at equal levels of objective skills, men
will evaluate their competence more highly than women do. Consequently, men will be more inclined than
women to pursue a career in a field where maths is important, thus reproducing existing gender imbalances.
To estimate the causal effect of cultural frames, Correll (2004) set up an experiment in which about 80
undergraduate students were asked to perform a test purportedly designed to develop a new examination
for graduate school admission. The test had no right or wrong answers (but was perceived as credible) and
all subjects were given the same score, that is, the same objective assessment of their skills. By contrast,
their cultural expectations (that is, the treatment) were manipulated by assigning subjects randomly to two
groups. The treated group was told that males tend to perform better at the task, while the control group was
informed that there are usually no gender differences in this context. After completing the test and receiving
the (fake) scores, subjects were asked to provide a self-assessment of their performance and to answer
questions about how likely they would be to pursue a career requiring high levels of the skills that were
purportedly tested. In line with the theoretical expectations, the analysis showed that, under the treatment
condition, females’ self-assessment was lower than males’, and that males’ assessment under the treatment
was higher than under the control condition. Further, these biased self-assessments were related to potential
career plans.
A second example is Dunning and Harrison (2010), which studied how cross-cutting cleavages moderate
the political saliency of ethnicity. The theory is that ethnic differences play a more important role in politics
if citizens speaking a given language, for instance, belong to a different religion and are poorer than
those speaking other languages. If, however, the different cleavages (linguistic, religious, economic) are
Page 10 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
not superposed in this way, then it is expected that language is less relevant as a determinant of political
behaviour. Dunning and Harrison (2010) studied this argument in the case of Mali, a highly ethnically diverse
country, by focusing on ‘cousinage’, which is a form of identity and social bonds connected with groups of
patronymics (surnames) but distinct from ethnicity. The 824 subjects of the experiments, recruited in Mali's
capital city, were shown videotaped political speeches by a purported political independent considering being
a candidate for deputy in the National Assembly. Subjects were asked to evaluate the candidate on a number
of dimensions. The treatment was the politician's last name, which subjects could readily associate with
both ethnicity and cousinage ties. This set-up yielded four combinations of subjects’ and politician's ethnicity
and cousinage, namely, same ethnicity/cousins, same ethnicity/not cousins, different ethnicity/cousins, and
different ethnicity/not cousins. Additionally, in the control group the politician's name was not given. In line with
theoretical expectations, the candidate was evaluated best by the subjects when they shared both ethnicity
and cousinage and worst in the opposite scenario. Additionally, cousinage compensated for ethnicity: the
candidate was evaluated similarly when subjects and candidate were from the same ethnic group but without
cousinage ties and when they were from a different ethnic group but with cousinage ties.
In order to produce valid results, laboratory experiments must consider an extensive list of potential problems,
such as the nature of experimental manipulations, location, artificiality, subjects’ selection and motivation,
and ethical concerns (for a thorough discussion, see Morton and Williams, 2010). Furthermore, they are
vulnerable to the objection that, while their internal validity may be strong (that is, their results are valid within
the context of the experiment), their conclusions cannot be generalized to the ‘real world’. We return to this
point in the conclusion.
Survey Experiments
Survey experiments randomly assign the respondents of a survey to control and treatment conditions through
the manipulation of the form or placement of questions (Gaines et al., 2007: 3–4). Because many survey
experiments use samples that are representative of the population, they promise to achieve both internal and
external validity, the first through randomization, and the second through representativeness (Barabas and
Jerit, 2010: 226). These potential qualities, in combination with increasingly easy and cheap access to survey
resources, have made survey experiments more popular among social scientists in recent years.
For example, Hainmueller and Hiscox (2010) examined attitudes towards immigration. They asked whether,
as predicted by the labour market competition model, people tend to oppose immigrants with a skills level
similar to their own, who would be perceived as a more direct threat in the competition for jobs. The
experiment was embedded in a survey completed by 1,601 respondents in the United States, who were
randomly divided into two groups. Those in the treatment group were asked whether they agreed that the
USA should accept more highly skilled immigrants from other countries. The question asked in the control
group was identical except that ‘highly skilled’ was replaced with ‘low-skilled’. The authors were able to
confirm that randomization worked well because the distributions of respondents’ characteristics in the two
groups were statistically indistinguishable. The main finding of the analysis is that, contrary to theory, both
Page 11 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
low-skilled and highly skilled respondents prefer highly skilled immigrants, which suggests that non-economic
concerns are very important to explaining attitudes towards immigration.
Another example is Linos (2011), who studied cross-national interdependencies (one of the topics of Chapter
7) in the field of family policy with an experiment in which 1,291 Americans were asked whether they agreed
that the United States should increase taxes to finance paid maternity leave. Respondents were assigned
randomly either to a control group, in which the question was formulated neutrally, or to one of four treatment
groups. In the first and second treatment groups, respondents were informed that the proposed policy was
already in place in Canada or in most Western countries, respectively. In the third, respondents learned
that the policy was recommended by the United Nations. Finally, in the fourth the policy was endorsed
by ‘American family policy experts’. The results show that, while in the control group only 20 per cent of
respondents supported increasing taxes to pay for maternity leave, the share jumped to about 40 per cent
in the treatment groups referring to Canada or other Western countries. Interestingly, the effect of foreign
models was comparable to that of American experts, while that of the UN was even slightly higher. Thus,
foreign experiences seemed to play a significant role in shaping public opinion on family policy, which could
be an important channel through which policies spread cross-nationally.
Researchers employing survey experiments face a distinct set of issues (Gaines et al., 2007; Barabas
and Jerit, 2010). The treatment can be problematic in several ways. It is typically administered as a
single exposure to an artificially intense stimulus, while in reality people may be exposed to it to varying
degrees, at several points in time, and in combination with other factors. Moreover, exposure to the realworld version of the treatment prior to the survey can bias the results. Also, survey experiments usually
measure the immediate effects of the treatment, but it would be important to know how long they last. In
short, even if the sample is representative, external validity can be compromised if the treatment itself lacks
representativeness.
Field Experiments
Field experiments ‘are experiments where the researcher's intervention takes place in an environment where
the researcher has only limited control beyond the intervention conducted’ (Morton and Williams, 2008: 346).
The central characteristic of experiments (randomized treatment assignment) is preserved but takes place in
the ‘real world’, which complicates its implementation in various ways. Field experiments are well established
particularly in the study of political behaviour and the political economy of development, but they have also
caught on in other sub-fields. Because of the logistical requirements, which often involve prolonged stays in
the area where the experiments take place and contacts with a large number of local actors, researchers gain
detailed knowledge of their cases, comparable to that of typical qualitative fieldwork. Thus, the qualitativequantitative distinction is not very meaningful here.
For instance, Olken (2010) studied a classic question of democratic theory, namely, the comparative
advantages of direct democracy and representation. The field experiment randomized the political process
Page 12 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
through which infrastructure projects were selected in 49 Indonesian villages. About 53 per cent of the
villages were randomly assigned to a direct democratic process in which all adults eligible to vote in national
elections could express their preference. In the remaining villages the standard process was followed. Project
selection took place in small meetings open to the public but, in fact, were attended by a limited number
of members of the local elite (such as government officials and representatives of various groups). On
average, about 20 times as many people participated in the referenda as in the meetings. The randomization
produced treatment and control groups that were statistically indistinguishable with respect to both village
characteristics (such as ethnic and religious fragmentation, distance to subdistrict capital, population) and
individual characteristics (education, gender, age, occupation). The results of the experiment showed that
the same projects were selected under both decision-making processes, which suggests that representation
does not lead to outcomes that are biased in favour of the elite's preferences. However, villagers were
significantly more satisfied with the decisions when they were taken through referenda. Thus, it seems that
the main effect of direct democracy is to increase the legitimacy of decisions, but not necessarily to shift their
content closer to the population's preferences.
Another field experiment attempted to uncover the effects of political advertising on voters’ preferences
by randomizing radio and television advertisements, for a total value of about $2 million, during the 2006
re-election campaign of Texas governor Rick Perry (Gerber et al., 2011). The study randomized both the
starting date and the volume of advertisements across 20 media markets in Texas, but not stations or
programmes. The outcome, that is, voters’ evaluation of the candidate, was measured using large daily polls.
Results showed a strong short-term effect of the advertisements. The maximum advertisements volume was
associated with an increase of almost five percentage points in the candidate's vote share during the week in
which the advertisements were aired. However, this effect vanished as soon as a week afterwards. Thus, the
results suggest that political advertising does make a difference, but this difference evaporates quite quickly.
In addition to problems common to all experiments (such as external validity), field experiments present
some specific challenges (Humphreys and Weinstein, 2009: 373–6). Given that many interesting variables
cannot be randomized because of practical constraints, only a relatively small subset of questions can be
investigated with this method. A possible solution is to focus on smaller units (for example, municipalities
instead of countries), but this will reduce the external validity of the analysis. Because field experiments
take place in real time and in real settings, many factors are not under the control of researchers and can
therefore contaminate the findings. A common problem is spillovers, or the fact that intervention in one unit
may affect outcomes in other units. As discussed above, this violates the SUTVA assumption of the potentialoutcomes framework. The logistics of field experiments also constrains their size and reduces the precision
of the estimates, which is a problem especially if the effects are small. Finally, because they operate in real
contexts, field experiments also raise certain ethical concerns.
Quasi-Experiments
Quasi-experiments are observational studies (that is, they use data that were not generated by a process
Page 13 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
controlled by the researcher) in which, thanks to circumstances outside the researcher's control, random
treatment assignment is approximated to a certain extent. That is, although the assignment of units to
treatment or to control status is not determined by the researchers but by naturally occurring social and
political processes, some features of the procedures make it credible to assume that it is ‘as if at random’. As
Dunning (2008) argues, the plausibility of this assumption is variable and the burden of proof must be on the
researcher. Thus, it is useful to situate quasi-experiments on a continuum with standard observational studies
at one end and classical randomized experiments at the other. Making the case convincingly usually requires
detailed knowledge of the context of the quasi-experiment. Moreover, the data are seldom readily available.
Their acquisition often necessitates archival work or other procedures typically associated with qualitative
studies. This demonstrates again that the distinction between quantitative and qualitative approaches is not
very relevant.
Quasi-experiments can take different forms. We discuss three: natural experiments, discontinuity designs,
and instrumental variables.
Natural Experiments
In natural experiments, the ‘as if at random’ component comes from some social, economic, and/or political
process that separates two groups cleanly on a theoretically relevant dimension. That is, although the quasirandomization occurs without the researcher's intervention, it produces well-defined treatment and control
groups.
For instance, Hyde (2007) studied the effects of international election monitoring on electoral fraud with
data from the 2003 presidential election in Armenia, using polling stations as units of analysis. The outcome
variable was the share of votes of incumbent President Kocharian, who was widely believed to have
orchestrated extensive fraud operations. Poll stations in the treatment group were those visited by
international observers, while those in the control group were not inspected by the monitors. To measure
the treatment status of poll stations, Hyde (2007) relied on the list of assigned polling stations produced
by the organization in charge of monitoring the elections, the Office for Democratic Institutions and Human
Rights of the Organization for Security and Co-operation in Europe (OSCE/ODIHR). The validity of the natural
experiment rests upon the assumption that international observers were assigned to polling stations in a
way that approximates random assignment, and Hyde (2007) discussed in detail why this assumption was
plausible in this case. The OSCE/ ODIHR staff completed the lists arbitrarily, only on the basis of logistical
considerations and with no knowledge of the socio-economic and political characteristics of the polling
stations. The analysis showed that the incumbent presidents received significantly more votes (between 2
and 4 per cent) in stations that were not monitored than in those that were visited by observers, which
suggests that this control mechanism has an impact on the extent of electoral fraud.
In another study, Bhavnani (2009) exploited an actual randomization, albeit one which he did not design, to
investigate the long-term effects of quotas on female representation, that is, their consequences after they
Page 14 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
are withdrawn. A policy initiative in India reserved a certain number of seats for women in local elections,
which were chosen randomly for one legislature. The goal of this selection procedure was not to allow an
evaluation of the policy (though this was a welcome side product), but rather to make it as fair as possible by
ensuring that men would be excluded from certain seats only temporarily, and without biases towards specific
seats. Reserved and unreserved seats were statistically indistinguishable on many relevant dimensions,
which suggests that the randomization is likely to have worked. The analysis of elections in 1997 and 2002
showed that quotas had an effect on female representation not only during the election in which they were
enforced, which must be true if the policy is implemented properly, but also in the next election, after they
were no longer in force. A comparison of districts that were open both in 1997 and in 2002 with those that
were reserved in 1997 but open again in 2002 shows that the percentage of female winners was significantly
higher in the latter districts (21.6 per cent compared to 3.7 per cent). This indicates that the effects of quotas
extend beyond their duration, possibly by introducing new female candidates into politics and by changing the
perceptions of voters and parties.
Natural experiments are appealing because they feature randomization in a real-world setting without the
direct involvement of the researcher. However, because researchers have no control over them, and because
good natural experiments are rare, they often originate in the availability of a convenient configuration instead
of in a previously defined research question. In this sense, they tend to be method-driven rather than problemdriven. Nonetheless, this is not necessarily problematic and the examples that we have just seen prove that
natural experiments can be used to investigate important questions.
Discontinuity Designs
Similar to natural experiments, discontinuity designs exploit sources of quasi-randomization originating in
social and political processes. In contrast to natural experiments, they rely on sharp jumps, or ‘discontinuities’,
in a continuous variable. The cut-off point determines whether a unit is exposed to the treatment or not,
the idea being that treatment assignment is ‘as if at random’ for units on either side of it. Elections are a
typical example of such discontinuities because it is quite reasonable to assume that, in narrow elections,
the outcome is due in large part to chance. While candidates who win by a landslide are likely to be very
different from those who receive only a handful of votes, candidates on either side of the election threshold
are probably similar in many respects.
Using these ideas, Eggers and Hainmueller (2009) compared the wealth at death of narrow winners and
losers in British national elections and found that successful Conservative Party candidates died with about
£546,000, compared with about £298,000 for candidates from the same party who were not elected. By
contrast, the difference was much smaller for Labour Party candidates, suggesting that the material benefits
of serving in Parliament differ across political parties. Gerber and Hopkins (2011) also relied on the random
component of elections, but to examine the effects of partisanship on public policy at the local level. The
comparison of 134 elections in 59 large American cities revealed that in most policy areas changes in
public spending were very similar regardless of whether a Republican or a Democrat narrowly won. The
Page 15 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
one exception was policing expenditures, which were higher under successful Republican candidates. These
findings suggest that partisan effects are small at the local level.
Lalive and Zweimüller (2009) exploited a different type of discontinuity, namely, the date at which a longer
period of parental leave entered into force in Austria, to estimate the effects of this policy on mothers’ further
childbearing and careers. Mothers giving birth after 30 June 1990 were able to benefit from paid leave of 2
years instead of 1 year under the policy in force until that date. Because of this sharp cut-off, the duration of
the parental leave can be considered to be randomly assigned to mothers giving birth shortly before or after
30 June. Indeed, the two groups were indistinguishable on many observed socio-economic characteristics
such as age and work history and profile. The comparison of the two groups showed that longer parental
leave causes women to have more additional children. It also reduces their employment and earnings, but
only in the short term.
Sharp cut-offs, such as those found in elections and other settings, generally offer quite convincing sources
of quasi-randomization, even though researchers should carefully check whether actors are aware of the
discontinuity and exploit it, as in the case of income tax thresholds (Green et al., 2009: 401). However, it is
important to note that the causal effects estimated with this method apply only at the threshold and cannot
be extrapolated to all units. Because, usually, only relatively few observations are sufficiently close to the
threshold, the results produced by regression discontinuity designs apply to a specific subsample, which
limits their external validity. Moreover, there are trade-offs but no clear guidelines regarding the width of the
window around the threshold (Green et al., 2009). A larger window (and, thus, more observations) makes
estimates more precise but potentially biased by unobserved factors, while a smaller window reduces the
bias but reduces the number of observations and, thus, the precision of the estimates.
Instrumental Variables
Instrumental variables are factors that can be used to replace treatment variables for which the ‘as if at
random’ assumption does not hold (Sovey and Green, 2010). They have to meet three crucial assumptions.
The first is relatively innocuous and states that the instrument and the treatment are correlated, after relevant
covariates are controlled for. The second and third assumptions are usually much more problematic. The
‘exclusion restriction’ means that the instrument affects outcomes exclusively through its correlation with the
treatment, that is, it has no direct effect on the outcomes, while the ‘ignorability assumption’ requires that the
instrument is ‘as if at random’. Thus, good instruments are those produced by some sort of quasi-experiment.
Concretely, the estimation proceeds in two stages. In the first, the treatment variable is regressed on the
instrument and the results are used to compute expected values for the treatment. In the second stage, these
values replace the treatment in the main regression.
In a famous study, Acemoglu et al. (2001) addressed the effects of institutions on economic development.
A simple regression of development on institutions is likely to be inappropriate (even with many control
variables), for two reasons. First, the causal relationship can arguably go both ways: better institutions cause
Page 16 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
higher economic development, but higher economic development can also cause better institutions. Second,
similar to the example of women's quotas discussed above, it is likely that countries with different degrees
of economic development are different on many other dimensions as well. To circumvent these problems,
Acemoglu et al. (2001) employed mortality rates of European settlers (proxied by those of soldiers, bishops,
and sailors) as an instrument for current institutions. The argument is that European powers set up different
types of institutions depending on their ability to settle. If a region was hospitable, then European-style
institutions were constructed with an emphasis on property rights and checks against government power,
while if it was not hospitable, colonizers set up ‘extractive states’ for the purpose of transferring as many
resources as possible from the colony. The analysis shows a strong association between current institutions,
instrumented by settler mortality, and economic development, which corroborates the argument that a causal
relationship is at play rather than a mere correlation. An important caveat is the plausibility of the exclusion
restriction, that is, the possibility that the effect of settler mortality on economic development could work
through something other than institutions. For instance, the mortality rates of colonizers could be related to
current diseases, which may have had an impact on development. In this case, institutions would not be
part of the causal chain. However, the authors argue convincingly that the causes of European deaths in
the colonies (mainly malaria and yellow fever) were not likely to be connected with economic development
because the indigenous populations had developed immunities against these diseases.
In another application, election-day rainfall was used as instrument for turnout to estimate its effects on
electoral outcomes in the United States (Hansford and Gomez, 2010). In effect, many studies have suggested
that higher turnout is beneficial to leftist parties (or Democrats in the United States), but the problem is
that many factors are likely to influence both the decision to vote and the vote itself at the same time. By
contrast, the weather on election day is likely to affect the choice to go to the polling booth, but not the
preference expressed in the vote.3 Moreover, rainfall on a specific day can probably be considered an ‘as if
at random’ event. The analysis was able to confirm that higher turnout does indeed cause a higher vote share
for Democratic candidates.
Finally, in a study already discussed in Chapter 3, Kern and Hainmueller (2009) studied the effects of
West German television on public support for the East German communist regime, using a survey of East
German teenagers. The survey included information for both the dependent (regime support) and treatment
(exposure to West German television) variables. Because it is highly likely that people who watch a lot of
West German programmes have different predispositions towards the communist regime in the first place,
the treatment cannot be considered ‘as if at random’. However, while West German television reception was
generally possible in East Germany, it was blocked in some regions (especially near Dresden) because of
their topography. As long as living in Dresden per se was not directly related to regime support and that
region was generally comparable with the rest of the country, living in Dresden can be used as an instrument
for television exposure. The analysis showed that, quite counter-intuitively, West German television caused
greater support for the East German regime, possibly because East German citizens consumed it primarily
for entertainment and not as a source of information.
Page 17 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
Like the other approaches, instrumental variables come with their own set of problems (Sovey and Green,
2010). In fact, the list of potential issues is even longer because, in addition to the need to find a suitable
‘quasi-experiment’, the instrument must fit within the model that is used in the estimation in a very specific
way. Also, the results must be interpreted carefully because the causal effects estimates apply to a particular
subset of units and are known as ‘local average treatment effects’. In sum, if the right conditions are fulfilled,
instrumental variables are a valuable tool, but in practice their application is quite tricky.
Lessons for Research Design
If we take the statistical approach to causal inference seriously, the consequences for research design are
wide-ranging. The main lesson is that the design is the most important part of the research because it is at
this stage that the possibility of credibly identifying causal effects can be influenced. In fact, in the ideal-typical
case of a ‘perfect’ research design, that is, an experiment that is designed and implemented flawlessly, the
analysis stage becomes almost trivial because it suffices to compare mean outcomes in the treatment and
control groups. The sophistication of the methods used in the analysis must increase with imperfections in
the research design in order to correct them expost.
To illustrate, consider again the example of women's quotas and female representation in parliament (Tripp
and Kang, 2008). The research design adopted by the authors, which is typical of cross-national quantitative
studies, was simply to collect data on as many countries as possible for the dependent variable (percentage
of women in parliament), treatment variable (quotas), and control variables (countries’ background
characteristics). Here ends the design stage and begins the analysis, which, to produce credible causal
estimates, needs to fix the basic problem that countries with and countries without quotas are not really
comparable. As discussed above, standard regression tools and newer matching methods can help, but only
up to a point. The fundamental problem is that they can adjust for the factors that we do observe, but not for
those that we do not, which are virtually always an issue. Thus, ex post fixes are bound to be imperfect.
By contrast, the statistical approach to causal inference aims to fix things ex ante by constructing or finding
suitable treatment and control groups in advance of the analysis. As we have seen, this goal can be achieved
with different means. First, we can design our own experiments in the lab or in the field, or base them on
surveys. That is, the treatment can be randomized by the researcher in an artificial setting, in the real world,
or via the questions asked in a survey. Second, we can try to find constellations in which randomization is
approximated without the direct intervention of the researcher. Natural experiments, discontinuity designs,
and suitable instrumental variables are three options. In all these cases, the most traction for causal
inferences is gained through the way the comparison between treatment and control groups is configured,
not through the specific techniques used to analyse the data. The key benefit is that, if randomization is
implemented properly or is approximated sufficiently in a real-world setting, it produces groups that are
comparable not only for their observed but also for their unobserved characteristics. This is a major advantage
for the validity of causal inferences.
Page 18 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
Thus, the quality of the research design is of the essence. The exacting requirement of a plausible ‘as if
at random’ assumption implies that downloading prepackaged data sets and letting the computer do the
counting is not enough, no matter how sophisticated the techniques. More creative solutions are required, and
few will involve broad cross-national comparisons, for the simple reason that broad international comparisons
are likely to be, well, incomparable. In fact, none of the examples discussed in this chapter compared
countries. Instead, they focused on specific within-country variations and used original data, often assembled
with great effort. Unfortunately, there are no clear guidelines for identifying promising comparisons. The
criteria that the research design must meet are clear, but discovering the right configuration in practice is an
art more than a science.
We emphasize that, in many ways, statistical research designs for causal inference transcend the usual
qualitative-quantitative distinctions. Obviously, they have strong quantitative components because they rely
on statistical techniques to estimate causal effects. However, they also require significant qualitative work and
substantive knowledge to identify the most promising cases, to collect hard-to-access data through archival
work or other qualitative procedures, and generally to construct a meaningful study. In some cases, such
as field experiments, researchers are actually involved in fieldwork comparable to that of many traditional
qualitative studies. Thus, these research designs do not fit well within a simple quantitative-qualitative
typology. The limits of such distinctions are a general theme of this book.
As with all approaches, statistical research designs for causal inference must face trade-offs. The most
important trade-off is that between validity and relevance. A common criticism of this approach is that it
leads to a focus on small, tractable questions at the expense of big problems that are harder to study. It is
undeniable that research in this tradition prioritizes internal over external validity. At the same time, the former
is arguably a prerequisite for the latter. In other words, it does not make much sense to generalize findings
that are not credible. Moreover, as Angrist and Pischke (2010) argue, external validity, or generalization,
remains an important goal that can be achieved through the cumulation of well-designed but necessarily
narrow studies. Finally, the examples discussed in this chapter studied problems such as the political
salience of ethnicity, attitudes towards immigration, the consequences of direct democracy in comparison to
representation, and the foreign influences of support for autocratic rule. These are all ‘big’ questions and,
even though each study individually did not provide definitive answers, they did supply convincing evidence
on the causal effects in a specific setting. Other studies should try to replicate them in other contexts. If they
are successful, then the external validity and generalizability of the findings will be strengthened.
Conclusion
Figure 4.2 summarizes the main points of this chapter. We can classify statistical research designs for causal
inference along two dimensions. First, is the treatment assigned randomly, and, if so, how? Second, to what
extent are the treated and control units comparable?
Page 19 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
Figure 4.2 A classification of statistical research designs for causal inference. Matching and
regression are in parentheses because, strictly speaking, they are estimation techniques and not
research designs
In the standard regression approach, supplemented or not by matching, there is no randomization and,
typically, self-selection into the treatment. For instance, the same variables that explain why countries adopt
women's quotas (the treatment) are likely to influence female representation in parliament (the outcome). The
problem is bigger if these variables are not included in the analysis (bivariate regression) than if they are
(multivariate regression), and matching can mitigate the problem further. However, there is no way around the
fact that the adjustment can be made only for those variables that can be observed, but not for those that are
unobserved. Therefore, the comparability of the treatment and control groups (countries with and countries
without quotas) and, consequently, the validity of causal inferences will be relatively limited.
By contrast, in experiments the treatment is randomized by researchers themselves and, in principle, the
treatment and control units will be highly comparable. Experiments can take place in the lab, in the field, and
within surveys. Quasiexperiments can credibly make the assumption that the treatment is assigned ‘as if at
random’ because of a particular process occurring in the real world, without the researcher's intervention. The
comparability of the treatment and control groups will in principle be quite high, significantly better than in the
standard regression approach, but somewhat worse than in experiments. The validity of the causal inferences
will vary accordingly.
In this context, an important trade-off is that between complexity or realism of the research question and
Page 20 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
reliability of the causal estimates. To achieve the latter, statistical research designs narrow down complex
theoretical and/or empirical questions to smaller, tractable questions. These research designs can produce
valid estimates of causal relationships, but many different analyses are necessary to give the full picture of
a complex phenomenon. By contrast, other research designs discussed in this book put the emphasis on a
holistic view of causal processes, but at the cost of validity.
To conclude, the statistical approach emphasizes the importance of research design for valid causal
inferences. The primary concern is the construction of comparable treatment and control groups. This will be
difficult with standard cross-national data sets. Instead, researchers should produce their own experiments or
look for configurations in the real world that can approximate them, which requires considerable qualitative
knowledge and not just the mastery of quantitative techniques.
Checklist
• The key for causal inference is the construction or identification of appropriate treatment and control
groups.
• Random assignment of treatment to the units (‘randomization’) is the gold standard for causal
inference because it is the best way to make sure that the treatment and control groups are
comparable.
• We speak of experiments when researchers themselves undertake the randomization. We can
distinguish between laboratory, survey and field experiments.
• We speak of quasi-experiments when randomization is approximated due to circumstances outside
the researchers’ control. Natural experiments and discontinuity designs belong to this category.
• A successful experiment or quasi-experiment requires not just the application of quantitative
techniques, but also significant qualitative knowledge.
Questions
1 Read closely five articles making causal arguments in your field of study. To what extent do
they correspond to a ‘causes-of-effects’ or ‘effects-of-causes’ perspective?
2 For each of the five articles, reframe the causal claims using the potential-outcomes
framework and construct the equivalent of Table 4.1.
3 Read five articles making causal arguments using standard regression methods. To what
extent can the findings actually be interpreted causally?
4 Think of a specific research question. What would be the ideal experiment to test the causal
argument? Now try to develop a research design that can approximate it as much as possible
in practice.
5 Read closely five of the articles cited as examples in this chapter (or other articles of your
choice) and assess them with respect to the trade-off between the validity of the causal
Page 21 of 22
Designing Research in the Social Sciences
SAGE
SAGE Research Methods
2013 SAGE Publications, Ltd. All Rights Reserved.
inference and the relevance or importance of the findings.
1% women = 13.18 (1.03) + 6.02 (1.53) × quotas. OLS estimates, standard errors in parentheses.
2% women = −1.67 (5.68) + 3.2 (1.55) × quotas + 6.02 × electoral system + 0.11 (1.16) × democracy + 0.11
(0.14) × women's education + 1.18 (0.59) × GDP/cap (log). OLS estimates, standard errors in parentheses.
3But recall the Italian expression ‘Piove, governo ladro’.
Further Reading
Angrist, J.D. and Pischke, J. (2009) Mostly Harmless Econometrics: An Empiricist Companion. Princeton,
NJ: Princeton University Press. A relatively non-technical introductory text written by economists.
Angrist, J.D. and Pischke, J. (2010) The credibility revolution in empirical economics: how better research
design is taking the con out of econometrics. Journal of Economic Perspectives, 24 (2): 3–30. A non-technical
summary of the book by the same authors. http://dx.doi.org/10.1257/jep.24.2.3
Morgan, S.L. and Winship, C. (2007) Counterfactuals and Causal Inference. Methods and Principles for
Social Research. Cambridge: Cambridge University Press. A relatively technical introductory text written by
sociologists. http://dx.doi.org/10.1017/CBO9780511804564
Morton, R.B. and Williams, K.C. (2010). Experimental Political Science and the Study of Causality: From
Nature to the Lab. Cambridge: Cambridge University Press. A relatively technical introductory text written by
political scientists. http://dx.doi.org/10.1017/CBO9780511762888
http://dx.doi.org/10.4135/9781473957664.n4
Page 22 of 22
Designing Research in the Social Sciences
Organizational research: Determining appropriate sample size in survey research
Barlett, James E;Kotrlik, Joe W;Higgins, Chadwick C
Information Technology, Learning, and Performance Journal; Spring 2001; 19, 1; ProQuest
pg. 43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Purchase answer to see full
attachment