The Effectiveness of Psychotherapy
The Consumer Reports Study
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Martin E. P. Seligman
University of Pennsylvania
Consumer Reports (1995, November) published an article
which concluded that patients benefited very substantially
from psychotherapy, that long-term treatment did considerably better than short-term treatment, and that psychotherapy alone did not differ in effectiveness from medication
plus psychotherapy. Furthermore, no specific modality of
psychotherapy did better than any other for any disorder;
psychologists, psychiatrists, and social workers did not
differ in their effectiveness as treaters; and all did better
than marriage counselors and long-term family doctoring.
Patients whose length of therapy or choice of therapist was
limited by insurance or managed care did worse. The methodological virtues and drawbacks of this large-scale survey are examined and contrasted with the more traditional
efficacy study, in which patients are randomized into a
manualized, fixed duration treatment or into control groups.
I conclude that the Consumer Reports survey complements
the efficacy method, and that the best features of these two
methods can be combined into a more ideal method that
will best provide empirical validation of psychotherapy.
H
ow do we find out whether psychotherapy works?
To answer this, two methods have arisen: the efficacy study and the effectiveness study. An efficacy
study is the more popular method. It contrasts some kind of
therapy to a comparison group under well-controlled conditions. But there is much more to an efficacy study than just a
control group, and such studies have become a high-paradigm endeavor with sophisticated methodology. In the ideal
efficacy study, all of the following niceties are found:
1. The patients are randomly assigned to treatment and
control conditions.
2. The controls are rigorous: Not only are patients
included who receive no treatment at all, but placebos containing potentially therapeutic ingredients credible to both
the patient and the therapist are used in order to control for
such influences as rapport, expectation of gain, and sympathetic attention (dubbed nonspecifics).
3. The treatments are manualized, with highly detailed
scripting of therapy made explicit. Fidelity to the manual
is assessed using videotaped sessions, and wayward
implementers are corrected.
4. Patients are seen for a fixed number of sessions.
5. The target outcomes are well operationalized (e.g.,
December 1995 • American Psychologist
Copyright 1995 by the American Psychological Association. Inc. 6 months
§30
E
CD
§•20
o
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
(8"
10
Note. N - 2, 738. Mean percentage who reported that treatment "made things a
lot better" with respect to four domains: enjoying life more, personal growth and
insight, self-esteem and confidence, and alleviating low moods. Those treated by
psychiatrists, psychologists, social workers, marriage counselors, and physicians
are segregated by treatment for more than six months versus treatment for less than
six months.
comes with higher credibility than studies that issue from
drug houses, from either APA, from consensus conferences
of the National Institute of Mental Health, or even from the
halls of academe.
In summary, the main methodological virtue of the CR
study is its realism: It assessed the effectiveness of psychotherapy as it is actually performed in the field with the population that actually seeks it, and it is the most extensive, carefully done study to do this. This virtue is akin to the virtues of
naturalistic studies using sophisticated correlational methods, in contrast to well-controlled, experimental studies. But
because it is not a well-controlled, experimental study like an
efficacy study, the CR study has a number of serious methodological flaws. Let us examine each of these flaws and ask to
what extent they compromise the CR conclusions.
Consumer Reports Study: Methodological
Flaws and Rebuttals
Sampling. Is there a bias such that those respondents who succeed in treatment selectively return their
questionnaires? CR, not surprisingly, has gone to considerable lengths to find out if its reader's surveys have sampling
bias. The annual questionnaires are lengthy and can run to
100 questions or more. Moreover, the respondents not only
devote a good deal of their own time to filling these out but
December 1995 • American Psychologist
also pay their own postage and are not compensated. So the
return rate is rather low absolutely, although the 13% return
rate for this survey was normal for the annual questionnaire.
But it is still possible that respondents might differ systematically from the readership as a whole. For the mental
health survey (and for their annual questionnaires generally), CR conducted a "validation survey," in which postage
was paid and the respondent was compensated. This resulted in a return rate of 38%, as opposed to the 13%
uncompensated return rate, and there were no differences
between data from the two samples.
The possibility of two other kinds of sampling bias,
however, is notable, particularly with respect to the remarkably good results for AA. First, since AA encourages lifetime membership, a preponderance of successes—rather
than dropouts—would be more likely in the three-year time
slice (e.g., "Have you had help in the last three years?").
Second, AA failures are often completely dysfunctional and
thus much less likely to be reading CR and filling out extensive readers' surveys than, say, psychotherapy failures
who were unsuccessfully treated for anxiety.
A similar kind of sampling bias, to a lesser degree,
cannot be overlooked for other kinds of treatment failures.
At any rate, it is quite possible that there was a large
oversampling of successful AA cases and a smaller
oversampling of successful treatment for problems other
than alcoholism.
Could the benefits of long-term treatment be an artifact
of sampling bias? Suppose that people who are doing well in
treatment selectively remain in treatment, and people who
are doing poorly drop out earlier. In other words, the early
dropouts are mostly people who fail to improve, but later
dropouts are mostly people whose problem resolves. CR
disconfirmed this possibility empirically: Respondents reported not only when they left treatment but why, including
leaving because their problem was resolved. The dropout
rates due to the resolution of the problem were uniform
across duration of treatment (less than one month = 60%; 1—
2 months = 66%; 3-6 months = 67%, 7-11 months = 67%; 12 years = 67%; over two years = 68%).
A more sweeping limit on generalizability comes from
the fact that the entire sample chose their treatment. To one
degree or another, each person believed that psychotherapy
and/or drugs would help him or her. To one degree or
another, each person acknowledged that he or she had a
problem and believed that the particular mental health professional seen and the particular modality of treatment chosen would help them. One cannot argue compellingly from
this survey that treatment by a mental health professional
would prove as helpful to troubled people who deny their
problems and who do not believe in and do not choose
treatment.
N o control groups. The overall improvement rates
were strikingly high across the entire spectrum of treatments and disorders in the CR study. The vast majority of
people who were feeling very poor orfairly poor when they
entered therapy made "substantial" (now feeling/air/}' good
or very good) or "some" (now feeling so-so) gains. Perhaps
971
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
the best news for patients was that those with severe problems got, on average, much better. While this may be a
ceiling effect, it is a ceiling effect with teeth. It means that if
you have a patient with a severe disorder now, the chances
are quite good that he or she will be much better within three
years. But methodologically, such high rates of improvement
are a yellow flag, cautioning us that global improvement over
time alone, rather than with treatment or medication, may be
the underlying mechanism.
More generally, because there are no control groups,
the CR study cannot tell us directly whether talking to sympathetic friends or merely letting time pass would have produced just as much improvement as treatment by a mental
health professional. The CR survey, unfortunately, did not
ask those who just talked to friends and clergy to fill out
detailed questionnaires about the results.
This is a serious objection, but there are internal controls which perform many of the functions of control groups.
First, marriage counselors do significantly worse than psychologists, psychiatrists, and social workers, in spite of no
significant differences in kind of problem, severity of problem, or duration of treatment. Marriage counselors control
for many of the nonspecifics, such as therapeutic alliance,
rapport, and attention, as well as for passage of time. Second,
there is a dose-response curve, with more therapy yielding
more improvement. The first point in the dose-response
curve approximates no treatment: people who have less than
one month of treatment have on average an improvement
score of 201, whereas people who have over two years of
treatment have an average score of 241. Third, psychotherapy does just as well as psychotherapy plus drugs for all
disorders, and there is such a long history of placebo controls inferior to these drugs that one can infer that psychotherapy likely would have outperformed such controls had
they been run. Fourth, family doctors do significantly worse
than mental health professionals when treatment continues
beyond six months. An objection might be made that since
total length of time in treatment—rather than total amount of
contact—is the covariate, comparing family doctors who do
not see their patients weekly with mental health professionals—who see their patients once a week or more—is not fair.
It is, of course, possible that if family doctors saw their
patients as frequently as psychologists do, the two groups
would do equally well. It was notable, however, that there
were a significant number of complaints about family doctors: 22% of respondents said their doctor had not "provided
emotional support"; 15% said their doctor "seemed uncomfortable discussing emotional issues"; and 18% said their
doctor was "too busy to spend time talking to me." At any
rate, the CR survey shows that long-term family doctoring
for emotional problems—as it is actually performed in the
field—is inferior to long-term treatment by a mental health
professional as it is actually performed in the field.
It is also relevant that the patients attributed their improvement to treatment and not time (determined by responses to "How much do you feel that treatment helped
you in the following areas?"), and I conclude that the benefits of treatment are very unlikely to be caused by the mere
passage of time. But I also conclude that the CR study could
be improved by control groups whose members are not
972
treated by mental health professionals, matched for severity
and kind of problem (but beware of the fact that random
assignment will not occur). This would allow the Bayesian
inference that psychotherapy works better than talking to
friends, seeing an astrologer, or going to church to be made
more confidently.
Self-report. CR's mental health survey data, as for
cars and appliances, are self-reported. Improvement, diagnosis, insurance coverage, even kind of therapist are not verified by external check. Patients can be wrong about any of
these, and this is an undeniable flaw.
But two things can be said in response. First, the noise
self-reports introduce—inaccuracy about improvement, incorrectness about the nature of their problem, even inaccuracy about what kind of a therapist they saw—may be random rather than systematic, and therefore would not necessarily bias the study toward the results found. Self-report, in
principle, can be either rosier or more dire than the report of
an external observer. Since most respondents are probably
more emotionally invested in psychotherapy than in their
automobiles, however, it will take further research to determine whether the noise introduced by self-report about
therapy is random or systematic.
Second, the most important potential inaccuracy produced by self-report is inaccuracy about respondents' own
emotional state before and after treatment, and inaccuracy in
ratings of improvement in the specific problem, in productivity at work, and in human relationships. This is, however, an
ever-present inaccuracy even with an experienced diagnostician, and the correlations between self-report and diagnosis
are usually quite high (not surprising, given the common
method variance). Such self-reports are the blood and guts
of a clinical diagnosis. But multiple observers are always a
virtue, and diagnosis by a third party would improve the
survey method noticeably.
Blindness. The CR survey is not double-blind, or
even single-blind. The respondent rates his or her own emotional state, and knows what treatment he or she had. So it is
possible that respondents exaggerate the virtues or vices of
their treatment to comply with or to overthrow their hypotheses about what CR wants to find. I find this far-fetched: If
nonblindness compromised readers' surveys, CR would have
long ago ceased publishing them, since the readers' evaluations of other products and services are always nonblind.
CR validates its data for goods and services in two ways:
against manufacturers' polls and for consistency over time.
Using both methods, CR has been unable to detect systematic distortions in its nonblind surveys of goods and services.
Inadequate outcome measures.
CR's indexes
of improvement were molar. Responses like made things a
lot better to the question "How much did therapy help you
with the specific problems that led you to therapy?" tap into
gross processes. More molecular assessment of improvement, for example, "How often have you cried in the last two
weeks?" or "How many ounces of alcohol did you have
yesterday?" would increase the validity of the method. Such
detail would, of course, make the survey more cumbersome.
A variant of this objection is that the outcome meaDecember 1995 • American Psychologist
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
sures were insensitive. This objection looms large in light of
the failure to find that any modality of therapy did better than
any other modality of therapy for any disorder. Perhaps if
more detailed, disorder-specific measures were used, the
dodo bird hypothesis would have been disconfirmed.
A third variant of this objection is that the outcome
measures were poorly normed. Questions like "How satisfied
were you with this therapist's treatment of your problem?
Completely satisfied, very satisfied, fairly well satisfied,
somewhat dissatisfied, very dissatisfied, completely dissatisfied," and "How would you describe your overall emotional state? very poor. I barely managed to deal with things;
fairly poor. Life was usually pretty tough for me; so-so: I had
my ups and downs; quite good: I had no serious complaints;
very good: Life was much the way I liked it to be" are seat-ofthe-pants items which depend almost entirely on face validity, rather than on several generations of norming. So the
conclusion that 90% of those people who started off very
poor or fairly poor wound up in the very good, fairly good,
or so-so categories does not guarantee that they had returned to normality in any strong psychometric sense. The
addition of extensively normed questionnaires like the Beck
Depression Inventory would strengthen the survey method
(and make it more cumbersome).
Retrospective.
The CR respondents reported retrospectively on their emotional states. While a one-time survey is highly cost-effective, it is necessarily retrospective.
Retrospective reports are less valid than concurrent observation, although an exception is worth noting: waiting for the
rosy afterglow of a newly completed therapy to dissipate, as
the CR study does, may make for a more sober evaluation.The
retrospective method does not allow for longitudinal observation of the same individuals for improvement across time.
Thus the benefits of long-term psychotherapy are inferred
by comparing different individuals' improvements crosssectionally. A prospective study would allow comparison of
the same individuals' improvements over time.
Retrospective observation is a flaw, but it may introduce random rather than systematic noise in the study of
psychotherapy effectiveness. The distortions introduced by
retrospection could go either in the rosier or more dire direction, but only further research will tell us if the distortions of
retrospection are random or systematic.
It is noteworthy that Consumer Reports generally uses
two methods. One is the laboratory test, in which, for example, a car is crashed into a wall at five miles per hour, and
damage to the bumper is measured. The other is the reader's
survey. These two methods parallel the efficacy study and
the effectiveness study, respectively, in many ways. If retrospection was a fatal flaw, CR would have given up the
reader's survey method long ago, since reliability of used
cars and satisfaction with airlines, physicians, and insurance
companies depends on retrospection. Regardless, the survey method could be markedly improved by being longitudinal, in the same way as an efficacy study. Self-report and
diagnosis both could be done before and after therapy, and a
thorough follow-up carried out as well. But retrospective
reports of emotional states will always be with us, since even
in a prospective study that begins with a diagnostic interDecember 1995 • American Psychologist
view, the patient retrospectively reports on his or her (presumably) less troubled emotional state before the diagnosis.
Therapy junkies. Perhaps the important finding
that long-term therapy does so much better than short-term
therapy is an artifact of therapy "junkies," individuals so
committed to therapy as a way of life that they bias the
results in this direction. This is possible, but it is not an
artifact. Those people who spend a long time in therapy may
well be "true believers." Indeed, the long-term patients are
distinct: They have more severe problems initially, are more
likely to have an emotional disorder, are more likely to get
medications, are more likely to see a psychiatrist, and are
more likely to have psychodynamic treatment than the rest of
the sample. Regardless, they are probably representative of
the population served by long-term therapy. This population
reports robust improvement with long-term treatment in the
specific problem that got them into therapy, as well as in
growth, insight, confidence, productivity at work, interpersonal relations, and enjoyment of life.
Perhaps people who had two or more years of therapy
are likely still to be in therapy and thus unduly loyal to their
therapist. They might then be more likely to distort in a rosy
direction. This seems unlikely, since a comparison of people
who had over two years of treatment and then ended therapy
showed the same high improvement scores as those with
over two years of treatment who were still in therapy (242 and
245, respectively).
Nonrandom assignment. The possibility of such
biases could be reduced by random assignment of patients
to treatment, but this would undermine the central virtue of
the CR study—reporting on the effectiveness of psychotherapy as it is actually done in the field with those patients
who actually seek it. In fact, the lack of random assignment
may turn out to be the crucial ingredient in the validity of the
CR method and a major flaw of the efficacy method. Many
(but assuredly not all) of the problems that bring consumers
into therapy have elements of what was called "wanhope" in
the middle ages and is now called "demoralization." Choice
and control by a patient, in and of itself, counteracts wanhope
(Seligman, 1991).
Random assignment of patients to a modality or to a
particular therapist not only undercuts the remoralizing effects of treatment but also undercuts the nonrandom decisions of therapists in choice of modality for a particular
patient. Consider, for example, the finding that drugs plus
psychotherapy did no better than psychotherapy alone for
any disorder (schizophrenia and bipolar depression were too
rare for analysis in this sample). The most obvious interpretation is that drugs are useless and do nothing over and
above psychotherapy. But the lack of random assignment
should prevent us from leaping to that conclusion. Assume,
for the moment, that therapists are canny about who needs
drugs plus psychotherapy and who can do well with psychotherapy alone. The therapists assign those patients accordingly so appropriate patients get appropriate treatment.
This is just the same logic as a self-correcting trajectory of
treatment, in which techniques and modalities are modified
with the patient's progress. This means that drugs plus
psychotherapy may actually have done pretty well after all—
973
but only in a cannily selected subset of people.
The upshot of this is that random assignment, the
prettiest of the methodological niceties in efficacy studies,
may turn out to be worse than useless for the investigation
of the actual treatment of mental illness in thefield.It is worth
mulling over what the results of an efficacy or effectiveness
study might be if half the patients with a particular disorder
were randomly assigned and were compared with half the
patients not randomly assigned. Appropriately assigning
individuals to the right treatment, the right drug, and the
right sequence of techniques, along with individuals' choosing a therapist and a treatment they believe in, may be crucial
to getting better.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
The Ideal Study
The CR study, then, is to be taken seriously—not only for its
results and its credible source, but for its method. It is largescale; it samples treatment as it is actually delivered in the
field; it samples without obvious bias those who seek out
treatment; it measures multiple outcomes including specific
improvement and more global gains such as growth, insight,
productivity, mood, enjoyment of life, and interpersonal relations; it is statistically stringent and finds clinically meaningful results. Furthermore, it is highly cost-effective.
Its major advantage over the efficacy method for studying the effectiveness of psychotherapy and medications is
that it captures how and to whom treatment is actually delivered and toward what end. At the very least, the CR study
and its underlying survey method provides a powerful addition to what we know about the effectiveness of psychotherapy and a pioneering way of finding out more.
The study is not without flaws, the chief one being the
limited meaning of its answer to the question "Can psychotherapy help?" This question has three possible kinds of
answers. The first is that psychotherapy does better than
something else, such as talking to friends, going to church,
or doing nothing at all. Because it lacks comparison groups,
the CR study only answers this question indirectly. The
second possible answer is that psychotherapy returns people
to normality or more liberally to within, say, two standard
deviations of the average. The CR study, lacking an untroubled group and lacking measures of how people were
before they became troubled, does not answer this question.
The third answer is "Do people have fewer symptoms and a
better life after therapy than they did before?" This is the
question that the CR study answers with a clear "yes."
974
The CR study can be improved upon, allowing it to
speak to all three senses of "psychotherapy works." These
improvements would combine several of the best features of
efficacy studies with the realism of the survey method. First,
the survey could be done prospectively: A large sample of
those who seek treatment could be given an assessment
battery before and after treatment, while still preserving
progress-contingent treatment duration, self-correction, multiple problems, and self-selection of treatment. Second, the
assessment battery could include well-normed questionnaires
as well as detailed, behavioral information in addition to more
global improvement information, thus increasing its sensitivity and allowing it to answer the return-to-normal question.
Third, blind diagnostic workups could be included, adding
multiple perspectives to self-report.
At any rate, Consumer Reports has provided empirical
validation of the effectiveness of psychotherapy. Prospective and diagnostically sophisticated surveys, combined with
the well-normed and detailed assessment used in efficacy
studies, would bolster this pioneering study. They would be
expensive, but, in my opinion, very much worth doing.
REFERENCES
Consumer Reports. (1994). Annual questionnaire.
Consumer Reports. (1995, November). Mental health: Does therapy
help? pp. 734-739.
Howard, K., Kopta, S., Krause, M., & Orlinsky, D. (1986). The doseeffect relationship in psychotherapy. American Psychologist, 41,
159-164.
Howard, K., Orlinsky, D., & Lueger, R. (1994). Clinically relevant
outcome research in individual psychotherapy. British Journal of
Psychiatry, 165, 4-8.
Lipsey, M., & Wilson, D. (1993). The efficacy of psychological,
educational, and behavioral treatment: Confirmation from metaanalysis. American Psychologist, 48, 1181-1209.
Luborsky, L., Singer, B., & Luborsky, L. (1975). Comparative studies
of psychotherapies. Archives of General Psychiatry, 32, 995-1008.
Muiioz, R., Hollon, S., McGrath, E., Rehm, L., & VandenBos, G.
(1994). On the AHCPR guidelines: Further considerations for
practitioners. American Psychologist, 49, 42-61.
Seligman, M. (1991). Learned optimism. New York: Knopf.
Seligman, M. (1994). What you can change & what you can't. New
York: Knopf.
Shapiro, D., & Shapiro, D. (1982). Meta-analysis of comparative
therapy outcome studies: A replication and refinement. Psychological Bulletin, 92, 581-604.
Smith, M., Glass, G., & Miller, T. (1980). The benefit of psychotherapy. Baltimore: Johns Hopkins University Press.
December 1995 • American Psychologist
Purchase answer to see full
attachment