Psychological Assessments and measurements

Content Type

User Generated

User

Elnafzbz1984

Subject

Humanities

Description

Bornstein, R. F. (2011). Toward a process focused model of test score validity: Improving psychological assessment in science and practice. Psychological Assessment, 23 (2), 532-544.

Then consider a construct (psychological variable) of interest in your specialty area, and a test (you can use the Buros website to select one) designed to assess that specific construct. .

Compare and contrast the PF and traditional models of validity.
Describe why Bornstein maintains that it is important to use the PF model of validity.
Using the 4 steps outlined by Bornstein how would you assess validity for your chosen construct and selected test to assess that construct?

Bornstein 2011.pdf

Unformatted Attachment Preview

Psychological Assessment 2011, Vol. 23, No. 2, 532–544 © 2011 American Psychological Association 1040-3590/11/$12.00 DOI: 10.1037/a0022402 Toward a Process-Focused Model of Test Score Validity: Improving Psychological Assessment in Science and Practice Robert F. Bornstein Adelphi University Although definitions of validity have evolved considerably since L. J. Cronbach and P. E. Meehl’s classic (1955) review, contemporary validity research continues to emphasize correlational analyses assessing predictor– criterion relationships, with most outcome criteria being self-reports. The present article describes an alternative way of operationalizing validity—the process-focused (PF) model. The PF model conceptualizes validity as the degree to which respondents can be shown to engage in a predictable set of psychological processes during testing, with those processes dictated a priori by the nature of the instrument(s) used and the context in which testing takes place. In contrast to the traditional approach wherein correlational methods are used to quantify the relationship between test score and criterion, the PF model uses experimental methods to manipulate variables that moderate test score– criterion relationships, enabling researchers to draw more definitive conclusions regarding the impact of underlying psychological processes on test scores. By complementing outcome-based validity assessment with a process-driven approach, researchers will not only improve psychology’s assessment procedures but also enhance their understanding of test bias and test score misuse by illuminating the intra- and interpersonal factors that lead to differential performance (and differential prediction) in different groups. Keywords: validity, construct validity, psychological assessment, psychometrics, test bias sions in applied settings, and use and interpret test results fairly, in unbiased ways. To be sure, the existence of assessment tools that yield well-validated scores does not ensure scientific rigor and accurate, unbiased decision making, but the absence of such tools guarantees that neither of these things will occur. Thus, continued efforts to increase our understanding of test score validity and improve our validation methods will benefit the science and practice of psychology in myriad ways. The present article contributes to that effort by describing an approach to validity—the process-focused (PF) Model—that differs markedly from the traditional perspective. In contrast to the traditional approach wherein the heart of validity lies in outcome—in the relation of predictor scores to some criterion measure—the PF model conceptualizes validity as the degree to which respondents can be shown to engage in a predictable set of psychological processes during assessment, with those processes dictated a priori by the nature of the instrument(s) used, and context in which testing takes place. The PF model differs from traditional validity assessment not only with respect to how validity is conceptualized but also with respect to empirical emphasis: In contrast to the traditional approach wherein correlational methods are used to quantify the relationship between test score and criterion, the PF model uses experimental methods to manipulate variables that moderate test score– criterion relationships, enabling researchers to draw more definitive conclusions regarding the impact of underlying processes (e.g., autobiographical memory search in response to a self-report questionnaire item) and moderating variables (e.g., motivation, mood state) on test scores. In short, the PF model shifts the emphasis of validity theory and research from outcome to process, and from correlation to exper- If there is a single challenge that characterizes all of psychology’s diverse subfields, that challenge is assessment. Psychologists measure things. These “things” take many forms, including observable behaviors and hidden mental states; dyadic interactions and intergroup dynamics; changes in traits, symptoms, skills, and abilities over time; and a broad array of neurophysiological and neurochemical processes (along with their associated behaviors and mental activities). Given psychologists’ near universal reliance on assessment, it is not surprising that researchers have devoted considerable time and effort to maximizing test score validity—to ensuring that researchers’ assessment tools measure what we think they do. Maximizing test score validity is not merely an academic exercise, but one that goes to the heart of psychological science and practice, with widespread social implications. The availability of measures that yield scores with strong validity evidence enables psychologists to enhance the scientific rigor of their research, make accurate deci- This material is based on work supported by National Science Foundation Grant 0822010. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation. I thank Joseph Masling, Carolyn Morf, and Kathleen Slaney for their helpful comments on earlier versions of this paper and Julia Biars and Alexandra Rosen for help in collecting and coding studies included in Table 2. Correspondence concerning this article should be addressed to Robert F. Bornstein, Derner Institute of Advanced Psychological Studies, Adelphi University, 212 Blodgett Hall, Garden City, NY 11530. E-mail: bornstein@adelphi.edu 532 VALIDITY AS PROCESS imentation. By complementing traditional validity assessment with a process-driven approach, we will not only improve psychology’s assessment procedures but also enhance researchers’ understanding of test bias and test score misuse by illuminating the underlying intra- and interpersonal dynamics that lead to differential performance (and differential prediction) in different groups. I begin by reviewing the traditional concept of validity and its limitations, and the evolution of validity theory and research during the past several decades. I then outline an alternative process-focused model of validity. I discuss how data from process-focused and outcome-focused studies may be combined and conclude by elaborating research, practice, and social policy implications of an integrated perspective. The Traditional Conceptualization of Test Score Validity Although psychology has a long history of using standardized assessment instruments to quantify aptitude, attitude, achievement, personality, and psychopathology, contemporary validity theory and research began in earnest in the mid-20th century, with the publication of Cronbach and Meehl’s (1955) “Construct Validity in Psychological Tests,” and 4 years later, Campbell and Fiske’s (1959) “Convergent and Discriminant Validation by the MultitraitMultimethod Matrix.” Most theoretical analyses and empirical studies of validity during the past 50 years have taken as their starting point ideas and assumptions outlined in these two seminal papers.1 Following the logic of Cronbach and Meehl (1955) and Campbell and Fiske (1959), validity has traditionally been operationalized as a statistic, the validity coefficient (usually expressed as r), that reflects the magnitude of the relationship between a predictor (the test score) and some criterion (an outcome measure). A wide variety of criteria are predicted by psychological tests, some overt and readily observable, others hidden and only detectable indirectly. When an observable criterion (e.g., suicide attempts) is assessed, the validity coefficient is said to be an index of criterion validity; when an unobservable construct (e.g., suicidal ideation) is assessed, the validity coefficient is an index of construct validity. Construct validity is in turn divided into convergent validity (the degree to which a test score is associated with some theoretically related variable), and discriminant validity (the degree to which a test score is unrelated— or minimally related—to a theoretically unrelated variable). Criterion validity can be operationalized in terms of concurrent validity (when the test score is used to assess some outcome in the here-and-now), and predictive validity (when the test score is used to predict some outcome in the future), although as psychometricians have noted, the point in time at which concurrent validity morphs into predictive validity is difficult to specify, and varies as a function of the criterion being assessed and purpose of the assessment. There are also a number of validity-related variables that are not indices of validity in its strictest sense, but are nonetheless germane in the present context. For example, researchers often seek to quantify internal validity using analyses of individual test items and response patterns to derive estimates of internal consistency and factor structure. Though these coefficients do not address issues regarding the degree to which test scores predict external variables or outcomes, when internal reliability data conform to a 533 priori expectations regarding item interrelationships, factors, and clusters, these data indirectly support the construct validity of scores derived from the measure. Test score reliability— especially retest reliability—also has implications for validity: When a test is designed to quantify a construct that is presumed to be stable over time (e.g., manual dexterity, narcissism), inadequate retest reliability is prima facie evidence of a problem with test score validity. Most psychometricians agree that face validity—test “obviousness”—is not a true index of validity.2 It is important to note that regardless of whether one is considering criterion, construct, convergent, or discriminant validity, traditional validity coefficients all have two things in common. First, they are all indices of strength of association, and represented as correlations of one kind or another. Second, they are all assessed observationally. Even when test scores and outcome measures are obtained in laboratory settings under highly controlled conditions, they invariably reflect the pairing of data collected during one period of observation (the testing) with data collected during a second period of observation (the comparison measure). Although traditional correlational methods continue to be widely used in validity research, in recent years psychometricians have increasingly used confirmatory factor analysis (CFA)—a variant of structural equation model (SEM)—to examine hypothesized causal relations among variables and draw inferences regarding underlying process links (see Hershberger, 2003, for a historical review). SEM is particularly useful in enabling researchers to delineate latent variables—variables not assessed directly by the tests administered, but which emerge from a well-specified model when predicted patterns of variable intercorrelations are obtained (see Ullman, 2006). In certain instances, these latent variables represent underlying, unobservable psychological processes; to obtain more definitive evidence regarding underlying process links, latent variables that emerge from SEM analyses may then be explored via experimental studies wherein key parameters are manipulated (Bollen & Davis, 2009; Hagemann & Meyerhoff, 2008; Schumacker & Lomax, 2004). As Schumacker and Lomax (2004) noted, In structural equation modeling, the amount of influence rather than cause-and-effect relationship is assumed and interpreted by direct, indirect, and total effects among variables . . . . Model testing involves the use of manipulative variables, which, when changed, affect the 1 Other seminal early papers on construct validity were MacCorquodale and Meehl’s (1948) discussion of the epistemological challenges involved in validating scores derived from measures of hypothetical constructs and Loevinger’s (1957) discussion of construct validation as one component of psychologists’ broader efforts to develop and refine theoretical concepts. 2 In this context, it is important to distinguish the narrow use of the terms internal and external validity, as these terms apply to test scores from the more general use of these terms by Campbell and Stanley (1963), who discussed various threats to the integrity of psychological assessments and experimental designs (see Slaney & Maraun, 2008). BORNSTEIN 534 model outcome values, and whose effects can hence be assessed.” (p. 56)3 Refinements of the Traditional View SEM and CFA have had a profound influence on validity research in recent years (see Hershberger, 2003; Tomarken & Waller, 2005; Ullman, 2006). Beyond these innovative statistical techniques, three substantive conceptual refinements of the traditional view of validity have emerged; each extends this view in an important way. Construct Representation Embretson (1983) distinguished the long-standing goal of construct validity, the weaving of a “nomological net” of relationships between test score and an array of theoretically related variables (which she termed “nomothetic span”) from a complementary goal that she termed “construct representation”: efforts to identify the theoretical mechanisms that underlie item responses. Drawing primarily from research on cognitive modeling, Embretson (1983, 1994; Embretson & Gorin, 2001) advocated the use of direct observation of testees, path analysis, posttest interview data, and other external indices to illuminate the processes in which people engage while completing psychological tests. Since Embretson’s introduction of the concept of construct representation, numerous researchers have used these techniques to deconstruct the cognitive processes that occur when respondents engage items from various measures of aptitude, intelligence, and mental ability, adding expert ratings of item content and causal modeling techniques as additional methods for evaluating construct representation (see Cramer, Waldrop, van der Maas, & Borsboom, 2010; Kane, 2001; Mislevy, 2007). Attribute Variation Noting that measures of test score– criterion association provide limited information regarding the degree to which a test actually measures the variable it purports to assess, Borsboom, Mellenbergh, and van Heerden (2004) outlined an attribute variation approach, arguing that rigorous validity assessment requires demonstrating that changes in an attribute can be linked directly to changes in scores on a test designed to measure that attribute. Consistent with Embretson’s (1983) construct representation view, Borsboom et al. (2004) suggested that “somewhere in the chain of events that occurs between item administration and item response, the measured attribute must play a causal role in determining what value the measurement’s outcomes will take” (p. 1062). Emphasizing naturally occurring variations in traits and abilities rather than direct manipulation of underlying variables, Borsboom et al. cited latent class analyses that detect Piagetian developmental shifts in children’s reasoning over time (e.g., Jansen & Van der Maas, 1997, 2002) as exemplars of the attribute variation approach (see also Strauss & Smith, 2009, for examples of the attribute variation approach in clinical assessment). Consequential Validity Originally described by Cronbach (1971), and elaborated extensively by Messick (1989, 1994, 1995), the concept of consequen- tial validity represents a very different perspective on test score validation. According to this view, validity lies not only in the degree to which a test score is capable of predicting some theoretically related outcome but also in the degree to which that test score is actually used (and interpreted) in such a way as to yield valid data (see also Kane, 1992, for a related discussion). Thus, evidential (research-based) validity can be distinguished from consequential (impact-based) validity, the former representing a test’s potential to provide accurate, useful, and unbiased information, and the latter representing the degree to which the test truly does yield accurate, useful, and unbiased assessment data in vivo. Inherent in the consequential validity framework is the assumption that an evidentially valid test score can provide consequentially valid data in certain contexts, but consequentially invalid data in others, depending on how the test score is interpreted. For example, intelligence test scores may be interpreted differently in two different schools, and psychopathology scores may be interpreted differently in two different clinics; in both situations, the test score in question might well yield consequentially valid information in one setting and consequentially invalid information in the other. Validity Assessment in Theory and in Practice As Jonson and Plake (1998) noted, a major trend in validity assessment since the mid-1950s has been a shift from conceptualizing validity in terms of discrete and separable subtypes (e.g., concurrent, predictive) to a more integrative approach wherein validity is conceptualized as a unitary concept (see Messick, 1995, for a detailed discussion of this issue). The notion that the overarching concept of construct validity can incorporate a broad spectrum of evidence was implied in Cronbach and Meehl’s (1955) seminal paper, suggested more directly by Loevinger (1957), made explicit by Cronbach (1971) and Messick (1989), and is now the most widely accepted framework for unifying and integrating various aspects of test score validity (see Shepard, 1993; Slaney & Maraun, 2008). Influenced by Messick’s (1989, 1994, 1995) seminal conceptual analyses of test score validity (see also Cronbach, 1971), the most recent edition of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999) describes validity as “the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests . . . . The process of validation involves accumulating evidence to provide a sound scientific basis for the proposed test score interpretations.” (p. 9). Consistent with most contemporary psychological assessment texts (e.g., McIntire 3 Tomarken and Waller (2005) provided a particularly thorough and balanced review of the advantages and limitations of SEM, noting that although SEM is a powerful method for testing parameters of models that incorporate various combinations of observed and latent variables, it cannot provide definitive support for an hypothesized set of variable interrelations in the absence of external confirming evidence. Tomarken and Waller concluded that SEM “is arguably the most broadly applicable statistical procedure currently available”, but went on to note that “SEM is not, however, a statistical magic bullet. It cannot be used to prove that a model is correct and it cannot compensate for a poorly designed study” (p. 56). VALIDITY AS PROCESS & Miller, 2000), the 1999 Standards still enumerates distinct types of validity evidence (see Table 1 for a summary of these categories), but also argues that distinctions among various types of validity evidence are less sharp than earlier frameworks had suggested and that multiple forms of converging evidence should be used to establish the validity of test scores within a particular context. To the degree that psychologists have used the broad-based validation strategies described in the 1999 edition of the Standards, one would expect that during the past decade, psychology has moved beyond the traditional correlational-observational conceptualization of validity to a more integrative view. But have theoretical shifts in researchers’ conceptualization of validity altered the practice of validity assessment in vivo? A review of present validity practices suggests that they have not. Since publication of the 1999 Standards, there have been two major reviews of researchers’ operationalization and assessment of test score validity. In the first, Hogan and Agnello (2004) surveyed 696 research reports from the American Psychological Association’s Directory of Unpublished Experimental Mental Measures (Goldman & Mitchel, 2003), identifying the types of validity evidence reported for each measure. They found that for 87% of measures, the only validity evidence reported involved correlations between test scores and scores on other self-report scales. Only 5% of measures had been validated using behavioral outcome criteria Table 1 Types of Validity Evidence in the 1999 Standards Type of validity evidence Typical validation procedure Evidence based on test content Logical analysis and expert ratings of item content, item-construct fit, item relevance, universe sampling, and criterion contamination Interview-based and observational analyses of participants’ responses to items or tasks; comparison of process differences across groups; studies of observer/interviewer decision processes Factor analyses, cluster analyses, item analyses, differential item functioning studies Concurrent and predictive validity, convergent and discriminant validity, validity generalization, criterion group differences, studies examining impact of interventions/manipulations on test scores, longitudinal studies Studies of expected/obtained benefits of testing; studies of unintended negative consequences Evidence based on response processes Evidence based on internal structure Evidence based on relations to other variables Evidence based on consequences of testing Note. A complete description of types of validity evidence and validation strategies is included in the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999). Discussions of these validity categories are found in Goodwin and Leech (2003), Jonson and Plake (1998), and Messick (1995). 535 (e.g., work performance, academic course grades). No entries reported validity evidence wherein experimental procedures were used to manipulate participants’ response processes. Summarizing the implications of their findings, Hogan and Agnello (2004) concluded that despite recent definitional shifts, “the vast majority [of studies] reported correlations with other variables . . . . and little use was made of the numerous other types of validation approaches” (p. 802). Similar results were obtained by Cizek, Rosenberg, and Koons (2008), who surveyed sources of validity evidence for all 283 tests reviewed in the 16th edition of the Mental Measurements Yearbook (MMY; Spies & Plake, 2005). Cizek et al. found that the vast majority of reports relied exclusively on correlational methods to evaluate test score validity, with only fivre of 283 entries (1.8%) assessing participant response processes. The proportion of MMY entries reporting process data was highest for developmental tests (5.9%), followed by behavioral measures (4.0%), achievement tests (3.7%), and tests of cognitive skills (1.5%). In every other test category (i.e., attitude, motor skill, personnel, personality, psychopathology, social, and vocational), the number of process-based validity studies was zero. A review of recent empirical literature examining test score validity confirms the findings of Hogan and Agnello (2004) and Cizek et al. (2008). Table 2 summarizes these data, analyzing the measures and methods used by validity researchers in the five journals that have published the greatest number of validity studies during 2006 –2008.4 As Table 2 shows, despite shifts in the theoretical conceptualization of validity (e.g., Messick, 1989) and the recommendations of the 1999 Standards, the procedures used by assessment researchers to evaluate the validity of test scores remain largely unchanged. The vast majority of validity studies published in leading journals used correlational methods (91%), relying exclusively on self-report outcome measures (79%). Only 9% of studies in the five leading validity journals used experimental procedures. To obtain a more complete picture of the methods used in correlational validity investigations in Table 2, the 442 studies in this category were classified into two subgroups: (a) studies in which SEM and CFA were used to examine patterns of variable interrelations and (b) studies that simply assessed the magnitude of association between predictor and criterion. Analysis revealed that 104 of 442 studies (24%) used CFA and/or SEM; the remaining 338 studies (76%) reported predictor-criterion correlational analyses. Proportions of studies in which SEM/CFA was used ranged from a low of 15% (Journal of Personality Assessment) to a high of 36% (Educational and Psychological Measurement), with Psychological Assessment (21%), Assessment (22%), and Journal of Pychoeducational Assessment (24%) falling between these extremes. 4 These journals were identified via a PsycNFO search conducted in May 2009 using the keywords Validity, Construct Validity, Criterion Validity, and Validation. Interrater reliability in coding validity article characteristics was determined using procedures analogous to those of Cizek et al. (2008): All articles were coded by the author, and a second rater unaware of these initial codings independently recoded a sample of 100 articles (approximately 20% of the total). Agreement in coding was 98% for study design/method (correlational vs. experimental), and 93% for outcome/criterion (self-report vs. alternative measure). BORNSTEIN 536 Table 2 Validity Assessment Strategies, 2006 –2008 Journal Number of validity articles Proportion of studies using correlational designs Assessment Educational & Psychological Measurement Journal of Personality Assessment Journal of Psycho educational Assessment Psychological Assessment Overall 93 93 131 49 120 486 94% 92% 89% 94% 91% 91% Proportion of studies using experimental designs Proportion of studies using selfreport outcome measures Proportion of studies using alternative outcome measures 6% 8% 11% 6% 9% 9% 78% 91% 75% 71% 76% 79% 22% 9% 25% 29% 24% 21% Note. Validity articles include all articles that reported data regarding the construct validity of scores derived from a psychological test (even if these data were not explicitly identified as validity evidence by the authors). Only articles reporting original data were coded; literature reviews, meta-analyses, comments, and case studies were excluded. Studies using experimental designs used a manipulation with two or more conditions to contrast test responses in different groups (e.g., contrasting instructions, different pretest primes, treatment intervention vs. control/no treatment prior to test administration). Alternative outcome measures were any outcome measures not based on participant self-report (e.g., recording of participant behavior in laboratory or field, physiological measures, reports by knowledgeable informants, behaviors coded from archival/chart records). Thus, the patterns in Table 2 confirm what earlier reviews had suggested: There is a substantial disconnect between the idealized descriptions of test score validity offered by psychometricians and the everyday practices of validity researchers in the laboratory, clinic, classroom, and field. Conceptual shifts notwithstanding, validity assessment in psychology remains much as it was 50 years ago. One cannot ascertain from these data why the test score validation process has not changed appreciably in response to theoretical shifts. Continued reliance on traditional methods might be due in part to the fact that the definition of validity in the 1999 Standards—although potentially useful—is somewhat vague, without clear guidelines regarding operationalization, implementation, and integration of different forms of validity evidence. Slow progress in this area might also be due in part to the fact that most discussions of construct representation and the methods used to assess it have focused exclusively on tests of cognitive ability, not extending these principles to measures of personality, psychopathology, attitudes, interests, motives, and values (see Embretson, 1998; Embretson & Gorin, 2001; Kane, 2001; Mislevy, 2007). Our continued reliance on traditional validation methods might also reflect a more generalized reluctance among assessment researchers to move beyond well-established methods that—although flawed—are widely accepted in the scientific community. Whatever the cause, recent theoretical refinements have had minimal impact on the validation efforts of psychometricians, and the difficulties that have long characterized the measurement of validity remain. Validity continues to be conceptualized primarily in terms of observation and correlation, and operationalized as a validity coefficient (or set of validity coefficients). Most psychological test scores yield validity coefficients in the small to moderate range (Meyer et al., 2001), predicting a modest amount of variance in outcome. Not only do psychologists rely primarily on self-report outcome measures to validate scores derived from psychological tests (Bornstein, 2001), they frequently discuss validity evidence based on self-reports as if this evidence reflected actual behavior (Bornstein, 2003). As numerous researchers have noted, most studies using traditional validity assessment procedures fail to examine differential test score validities in different contexts and settings (Mischel, 1984; Mischel, Shoda, & Mendoza-Denton, 2002), in different groups of respondents (Young, 2001), and as a function of the stated or implied purpose of the test (Steele & Aronson, 1995). Psychology’s reliance on correlational methods to quantify validity would be problematic in the best of circumstances, but is especially so given the nature of our discipline: As Meehl (1978) observed, in the social sciences, everything correlates with everything, at least to some degree. Meehl’s wry observation echoes an earlier conclusion by Guilford (1946) that summarized in stark terms the fundamental problem with the traditional correlationalobservational conceptualization of validity: When validity is equated with magnitude of predictor– criterion association, a test score is, by definition, valid for anything with which it correlates. Psychology can do better. By using experimental manipulations to alter respondents’ psychological processes during testing and assessing the impact of these manipulations on test scores, strong conclusions can be drawn regarding whether or not a test score is actually measuring what it is thought to measure. When these data are combined with traditional predictor– criterion association results, validity assessment will become more rigorous, the utility of psychology’s measurement procedures will be enhanced, and test bias can be minimized. A PF Model of Validity In the PF model, validity is conceptualized as the degree to which respondents can be shown to engage in a predictable set of psychological processes during testing; once these processes are identified, experimental manipulations are introduced to alter these processes and determine whether the manipulations affect test scores in meaningful ways. The PF framework reverses the usual procedure for dealing with extraneous variables that alter psychological test scores: Rather than regarding them as problematic, the PF model conceptualizes variables that are seen as confounds in traditional validity assessment (e.g., self-presentation effects) as opportunities for manipulation, exploration, and focused analysis—windows on underlying processes that would otherwise remain hidden. VALIDITY AS PROCESS Instrument-Based Processes To a substantial degree, the psychological activities in which people engage when responding to psychological tests are determined by the nature of the instruments themselves— by the types of questions asked and the tasks and activities required of the respondent. Table 3 uses a process-based framework to classify the array of assessment tools used by psychologists today, grouping these instruments into six categories based on the mental activities and behaviors involved in responding to these tests (see also Bornstein, 2007, for a detailed discussion of this issue).5 As Table 3 shows, self-attribution tests (which are usually described as objective or self-report tests; e.g., the NEO Personality Inventory; Costa & McCrae, 1985) typically take the form of questionnaires wherein people are asked to acknowledge whether or not each of a series of descriptive statements is true of them, or rate the degree to which these statements describe them accurately. Stimulus attribution tests require people to interpret ambiguous stimuli, and here the fundamental task is to attribute meaning to a stimulus that can be interpreted in multiple ways; this attribution process occurs in much the same way as the attributions that each of us make dozens of times each day as we navigate the ambiguities of the social world (e.g., when we attempt to interpret our friend’s failure to greet us as we pass on the street; see Kawada, Oettingen, Gollwitzer, & Bargh, 2004). Performance-based tests include the Bender (1938) VisualMotor Gestalt Test; the Implicit Association Test (Nosek, Greenwald, & Banaji, 2005); occupational screening tools that require behavior samples as part of the assessment; and various intelligence, mental state, and neuropsychological measures. Within the PF framework, performance-based tests are distinguished from stimulus-attribution tests because different processes are involved: Whereas performance-based tests require the respondent to perform structured behavioral tasks (e.g., copy figures from cards, assemble jigsaw puzzles), with performance evaluated according to predefined scoring criteria, respondents’ scores on stimulusattribution tests like the Rorschach Inkblot Method (Rorschach, 1921) and Thematic Apperception Test (Murray, 1943) are derived from open-ended descriptions and elaborations of test stimuli. In the PF framework, constructive tests are also distinguished from stimulus-attribution tests, because constructive tests require respondents to create—literally to “construct”—novel products (e.g., drawings, written descriptions) with minimal guidance from the examiner and no test stimulus physically present (e.g., Machover’s, 1949, Draw-a-Person [DAP] test). In contrast to stimulusattribution tests, which require respondents to describe stimuli whose essential properties were determined a priori, in constructive tests the “stimulus” exists only in the mind of the respondent (e.g., a self-schema or parental image). Continuing through Table 3, observational measures (as are often used to quantify behavior in hospitals, classrooms, shopping malls, and other settings; e.g., Baltes, 1996; Sproull, 1981), may be distinguished from informant-report tests (wherein data are derived from knowledgeable informants’ descriptions or ratings; e.g., Achenbach, Howell, Quay, & Conners, 1991). Though in both cases, judgments are made by an individual other than the person being evaluated, different processes are involved in generating these judgments, with observational measures based on direct observation and immediate recording of behavior, and informant- 537 report tests based on informants’ retrospective, memory-derived conclusions regarding characteristics of the target person (see Meyer et al., 2001, for a discussion of self- vs. informant-derived psychological test data). Context-Based Influences In contrast to instrument-based processes, which are inherent in the measure itself, context-based influences are situational factors that alter test responses by influencing respondents’ motivations and goals, or by modifying aspects of respondents’ cognitive or emotional processes during testing. Context-based influences not only reflect external variables that affect test performance (historically conceptualized as confounds to be minimized) but also represent potential manipulations that—when used to alter instrument-based processes in theoretically meaningful ways— provide unique information regarding the mental operations that occur during testing. Context-based influences may be divided into four categories. Assessment setting. Assessment setting effects are shaped not only by the physical milieu in which testing occurs (e.g., corporate office, psychiatric hospital, research laboratory) but also by respondents’ perceptions of, and beliefs regarding, this milieu (see Butcher, 2002, and Rosenthal, 2003, for examples). Thus, a student who had a pleasant experience in an earlier learning disability (LD) assessment is likely to approach testing more openly—less defensively—than one whose past LD assessment experiences were negative. A person voluntarily seeking admission to a psychiatric unit will respond to self-attribution test items in an intake packet quite differently than a person who has been brought to the unit involuntarily. Someone completing psychological tests as part of their induction into the military is likely to approach testing very differently if they have been drafted than if they volunteered for service. Instructional set. Studies have demonstrated that the way an instrument is labeled and described influences the psychological processes that occur during testing. For example, Steele and Aronson (1995) found that African American— but not Caucasian— college students perform more poorly on Scholastic Aptitude Test (SAT) items when these items are identified as indices of intelligence than when the same items are identified as indices of problem-solving ability; presumably the increased anxiety experienced by African American students who are concerned that their performance might confirm a preexisting racial stereotype diminishes attentional capacity and temporarily impairs certain cognitive skills. Using a very different paradigm and set of outcome measures, Bornstein, Rossner, Hill, and Stepanian (1994) found that college students’ self-attributed interpersonal dependency scores increased when testing was preceded by a positive description of dependency-related traits and behaviors, but decreased when testing was preceded by a negative description of dependency. These same students’ stimulus-attribution-based dependency scores (i.e., 5 Portions of this section are adapted from Bornstein (2007, pp. 203– 204). 538 BORNSTEIN Table 3 A Process-Based Framework for Classifying Psychological Tests Test category Self-attribution Stimulus-attribution Performance based Constructive Observational Informant report Key characteristics Representative tests Test scores reflect the degree to which the person attributes various traits, feelings, thoughts, motives,behaviors, attitudes, or experiences to him- or herself. Person attributes meaning to an ambiguous stimulus, with attributions determined in part by stimulus characteristics and in part by the person’s cognitive style, motives, emotions, and need states. Test scores are derived from person’s unrehearsed performance on one or more structured tasks designed to tap on-line behavior and responding. Generation of test responses requiresperson to create or construct a novel image or written description within parameters defined by the tester. Test scores are derived from observers’ratings of person’s behavior exhibited in vivo, or in a controlled setting. Test scores are based on knowledgeable informants’ ratings or judgments of a person’s characteristic patterns of behavior and responding. NEO Personality Inventory Strong Vocational Interest Blank Beck Depression Inventory Rorschach Inkblot Method Thematic Apperception Test Wechsler Adult Intelligence Scale Bender Visual-Motor Gestalt Test Draw-a-Person Test Qualitative and Structural Dimensions of Object Relations Spot Sampling Behavior Trace Analysis SWAP-200 Informant-Report version of the NEO Personality Inventory Note. Adapted from Table 1 in “Toward a Process-Based Framework for Classifying Personality Tests: Comment on Meyer and Kurtz (2006)” by R. F. Bornstein, 2007, Journal of Personality Assessment, 89, pp. 202–207. Copyright 2007 by Taylor & Francis. Reprinted with permission. scores on Masling, Rabie, & Blondheim’s, 1967, Rorschach Oral Dependency [ROD] scale) were unaffected by instructional set.6 Affect state. A respondent’s emotional state (e.g., elated, depressed, anxious) affects test responses in at least two ways. First emotional reactions— especially strong emotional reactions—take up cognitive capacity, making it more difficult for the person to focus attention on the task at hand or divide their attention between competing tasks (Arnell, Killman, & Fijavz, 2007). In this way, emotional reactions alter performance on measures of intelligence, aptitude, achievement, neuropsychological functioning, and mental state. Second, moods and other affect states have biasing effects, priming affect-consistent nodes in associative networks and thereby increasing the likelihood that certain associates and not others will enter working memory (Hänze & Meyer, 1998; Robinson & Clore, 2002). Studies have also shown that people are more likely to retrieve mood-congruent than mood-incongruent episodic memories, though these effects are stronger when free-recall procedures are used than when highly structured (e.g., questionnaire) measures are used (McFarland & Buehler, 1998; Zemack-Ruger, Bettman, & Fitzsimons, 2007). Thus, mood-priming effects are particularly salient when stimulusattribution tests and constructive tests are administered. Examiner effects. As Masling (1966, 2002) and others have shown, examiner characteristics and behaviors alter psychological test responses in predictable ways (see Butcher, 2002, for reviews of studies in this area). For example, testers who interact with respondents in a distant or an authoritarian manner elicit selfattribution and stimulus-attribution test responses that are more guarded and defensive than those elicited by testers who treat respondents more warmly during the evaluation. When examiners create rapport with respondents prior to administering intelligence test items, respondents tend to produce higher intelligence scores than are obtained when testing is not preceded by rapport building. Similar findings emerge in performance-based occupational screens. Garb (1998) provided an extensive review of the literature documenting the impact of clinician expectancy effects on the outcome of psychological assessments. Although some of these biasing effects stem from clinicians’ misperceptions of respondent behavior based on characteristics of the person being evaluated (e.g., gender, age, physical attractiveness), these effects also stem from the manner in which the examiner interacts with the examinee prior to and during testing, which may alter the examinees’ cognitive processes, emotional states, and motives (Allen, Montgomery, Tubman, Frazer, & Escovar, 2003). Implementing the PF Model Table 4 summarizes in broad terms the four steps involved in test score validation using the PF model. As Table 4 shows, the first step in process-focused test score validation involves specifying the underlying processes that should occur as individuals respond to test stimuli (e.g., retrospective memory search, associative priming) and identifying context variables (e.g., affect state, instructional set) that potentially alter these processes. Next, process– outcome links are operationalized and tested empirically (Step 2), and the results of these assessments are evaluated (Step 3). Finally, process-focused test score validity data are contextualized by enumerating limiting conditions (e.g., flaws in experimental design) that might have influenced the results and evaluating the generalizability and ecological validity of PF data by 6 Although the RIM has been the topic of considerable controversy in recent years, much of this debate has centered on the utility of Exner’s (1991) comprehensive system. Even vocal critics of the RIM acknowledge the psychometric soundness of the ROD scale and the strong validity evidence in support of the measure. As Hunsley and Bailey (1999) noted, “One excellent example of a scale that does have scientific support . . . is the Rorschach Oral Dependency scale. The history of research on this scale may serve as a useful guide for future attempts to validate [other] Rorschach scales” (p. 271). VALIDITY AS PROCESS Table 4 A Process-Focused Model of Validity 1) Deconstruct assessment instrument(s) a) Specify underlying psychological processes b) Identify context variables that alter these processes 2) Operationalize and evaluate process–outcome links a) Turn process-altering variables into manipulations b) Delineate hypothesized outcomes c) Experimental design 3) Interpret outcome a) Process-based validity results b) Limiting conditions 4) Evaluate generalizability/ecological validity a) Population b) Context and setting assessing the degree to which similar patterns are obtained in different populations and settings. This latter task entails conducting replications of the initial investigation in different contexts, using new participant samples. Thus, Steps 1–3 will occur whenever a process-focused validity study is conducted; Step 4 represents a long-term goal that requires additional studies. Research examining the process-focused validity of scores derived from self-attribution and stimulus attribution measures of interpersonal dependency illustrates one way in which the PF model may be implemented. As Bornstein (2002) noted, the Interpersonal Dependency Inventory (IDI; Hirschfeld et al., 1977) and ROD scale (Masling et al., 1967) are both widely used, and both yield well-validated (from an outcome perspective) scores that have been shown to predict a broad array of dependencyrelated behaviors (e.g., suggestibility, help seeking, compliance, interpersonal yielding) in laboratory and field settings (see Bornstein, 1999, for a meta-analysis of behaviorally referenced validity evidence for these two measures). Although scores derived from the IDI and ROD scale show good concurrent and predictive validity, IDI and ROD scale scores correlate modestly with each other—typically in the range of .2 to .3—raising questions regarding the degree to which the two measures are tapping similar constructs (see McClelland, Koestner, & Weinberger, 1989, for parallel findings regarding the intercorrelation of self-attribution and stimulus attribution need for achievement scores). Using the logic of the PF model, Bornstein et al. (1994) and Bornstein, Bowers, and Bonner (1996a) assessed the process validity of IDI and ROD scale scores in a series of experiments wherein manipulations were used to alter one set of processes but not the other, and assess the differential impact of these manipulations on self- versus stimulus attribution dependency scores. In the first investigation, Bornstein et al. (1994) deliberately altered participants’ self-presentation goals by introducing an instructional manipulation immediately prior to testing. Bornstein et al. administered the IDI and ROD scale to a mixed-sex sample of college students under three different conditions. One third of the participants completed the tests in a negative set condition; prior to completing the IDI and ROD scale, these participants were told that both were measures of interpersonal dependency and that the study was part of a program of research examining the negative aspects of dependent personality traits (following which several negative consequences of dependency were described). One third of the participants completed the two measures in a positive set 539 condition; these participants were told that the study was part of a program of research examining the positive, adaptive aspects of dependency (following which several positive features of dependency were described). The remaining participants completed the measures under standard conditions, wherein no mention is made of the purpose of either scale or the fact they assess dependency. Bornstein et al. (1994) found that relative to the control condition, participants’ IDI scores increased significantly in the positive set condition and decreased significantly in the negative set condition; ROD scores were unaffected by instructional set. In a follow-up investigation, Bornstein et al. (1996a) used a retest design to examine the impact of induced mood state on IDI and ROD scores, having college students complete the two measures under standard conditions, then calling participants back for a second testing 6 weeks later and asking them to write essays regarding traumatic events, joyful events, or neutral events to induce a corresponding mood immediately prior to testing. On the basis of previous findings regarding the impact of mood on the priming of nodes in associative networks (e.g., Rholes, Riskind, & Lane, 1987), Bornstein et al. hypothesized that induction of a negative mood state would produce a significant increase in dependent imagery (e.g., increases in associations related to passivity, helplessness, frailty, and vulnerability), leading to increases in ROD scale scores. Because the impact of mood on response to questionnaire items is comparatively modest (Hirschfeld, Klerman, Clayton, & Keller, 1983), Bornstein et al. hypothesized that IDI scores would not increase significantly in the negative mood condition. The expected patterns were obtained: Induction of a negative mood led to a significant increase in ROD— but not in IDI—scores (see Bornstein, 2002, for descriptions of other PF studies involving measures of interpersonal dependency). Thus, the PF model proved useful in illuminating the processes that underlie self-attribution and stimulus-attribution dependency scores, and in helping explain the modest intercorrelations between scores on two widely used measures of the same construct (see also McClelland et al., 1989, for a discussion of this issue). Similar logic can be used in other domains as well. For example, one might examine the psychological processes involved in responding to self-attribution narcissism test items by manipulating respondents’ self-focus (e.g., inducing self-focus vs. external/field focus using a mirror manipulation; see George & Stopa, 2008) prior to testing. By introducing a cognitive load as participants complete a brief neurological test or dementia screen, evidence regarding the process-focused validity of these measures can be examined. To ascertain whether state anxiety can indeed be inferred from DAP test data (Briccetti, 1994), an anxiety-inducing manipulation (vs. no manipulation) can be implemented. Finally, given the psychological processes that occur during observational ratings and informant reports, one might expect that providing false feedback regarding an individual prior to obtaining observer and/or informant judgments regarding that person would alter these judgments in predictable ways. Because observational ratings occur in the here-and-now whereas informant reports are retrospective (and therefore more susceptible to retrieval-based memory distortion), one would hypothesize that false feedback should alter informant reports more strongly than observational ratings. BORNSTEIN 540 Implications of the PF Model: Research, Psychometrics, Practice, and Social Policy Table 5 contrasts the traditional and PF models in five areas: evidence, research strategy, validity coefficient generalizability, test development goals, and challenges. In addition to highlighting operational differences between the two perspectives, Table 5 illustrates how the PF model shifts psychologists’ understanding of the generalizability of validity data (from concordance of validity coefficients across groups to documentation of similar underlying processes in different groups), and the strategies involved in test development (from finding optimal criterion measures and maximizing test score– criterion relationships to finding optimal manipulations and maximizing the impact of these manipulations on underlying process). Although the PF model yields unique information that the traditional outcome-focused approach cannot provide, neither method alone yields a truly comprehensive picture of test score validity. When both approaches are used, psychologists can derive two separate validity coefficients, both of which may be represented as standard effect sizes (e.g., r or d): an outcome effect size (the traditional estimate of predictor– criterion association), and a process effect size (a numerical index of the degree to which a theoretically relevant manipulation altered test score in line with a priori predictions). Moreover, just as one may conceptualize the outcome effect size as a single predictor– criterion correlation or as the sum total (or average) of an array of interrelated predictor– criterion correlations (Rosenthal, 1991), one may conceptualize the process effect size with respect to a single experimental manipulation, or an array of converging manipulations that would all be expected to have similar effects on a given test score. Note that when the two frameworks are integrated in this way, a given measure can potentially fall into one of four categories: 1. Adequate outcome and process validity. This is the best possible result, reflecting a situation wherein a test score predicts what one hopes it does, and the psychological processes in which respondents engage while completing the measure are in line with expectations. 2. Adequate outcome but not process validity. In this situation, the test score appears to be assessing what one hopes, but it is not clear why, because respondents’ reactions during testing differ from what was expected. 3. Adequate process but not outcome validity. Here, the measure seems to be tapping the expected underlying psychological processes, but test scores do not relate to external, theoretically related indices as anticipated. 4. Inadequate outcome and process validity. Neither process nor outcome are as expected, and it might be time to move on to a new test. As these four scenarios illustrate, a key advantage of combining outcome and process validity data is that these data not only point to potential limitations in a measure but also suggest specific interventions to correct these limitations. Scenario 2 suggests devoting greater attention to process than outcome issues, and determining whether the problematic process results reflect difficulties in the test itself or the manipulation used to evaluate it. Scenario 3 suggests devoting greater attention to outcome than to process issues, and considering whether the disappointing predictor– criterion relationships stem from flaws in the test or in the outcome measures used to validate scores derived from it. Thus, the PF model represents both an affirmation of and challenge to the 1999 Standards’ conceptualization of validity as a unitary concept, with validity broadly defined as “the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests” (American Educational Research Association et al., 1999, p. 9). In support of this view, the PF model suggests that a complete picture of test score validity can only be obtained by integrating divergent sources of validity data obtained via different methods and procedures. In contrast to the unitary concept view, however, the PF model argues that distinctions among certain types of validity evidence (in this case, process validity and outcome validity) remain useful and should be retained. Process and outcome validity evidence for a given test score should be considered both separately and in combination so that convergences and divergences among different forms of validity data can be scrutinized. The notion of validity as a truly unitary concept—though admirable—is premature. Research and Psychometric Implications Although implementing the PF model involves shifting the focus of validity research from correlation to experimentation, process-focused validity studies need not be limited to true experiments wherein underlying processes are manipulated directly, but may also involve quasiexperiments wherein preexisting groups are selected on the basis of presumed process differences. For example, following up on their initial investigations, Bornstein, Bowers, and Bonner (1996b) found significant positive correlations between IDI scores and Bem’s (1974) Sex Role Inventory (BSRI) femininity scores, and significant negative correlations between IDI scores and BSRI masculinity scores. Stimulus attribution dependency scores were unrelated to gender role orientation in Table 5 Contrasting the Traditional and Process-Focused Models Domain Traditional model Process-focused model Research method Validity coefficient generalizability Test development goals Test development challenges Degree to which test score correlates with theoretically related variable Assessment of predictor–criterion correlation Concordance of validity coefficients across groups Maximize test score–outcome correlation Finding optimal criterion measure(s) Degree to which test score is altered via manipulation of theoretically related process Assessment of impact of experimental manipulation Documentation of similar processes across groups Demonstrate impact of theoretically related process Finding optimal manipulation(s) Key evidence VALIDITY AS PROCESS participants of either gender. These patterns suggest that gender differences in self-reported dependency are due, at least in part, to women’s and men’s efforts to present themselves in genderconsistent ways on psychological tests. Not surprisingly, a metaanalytic synthesis of gender differences in self-attribution and stimulus attribution dependency scores found that women scored significantly higher than men on every self-attribution dependency scale, but not on any stimulus attribution dependency measure (Bornstein, 1995). The PF model has implications for test development, in that potential test stimuli must not only be constructed to maximize the association between test score and theoretically related variables, but also with respect to underlying process issues. In other words, test items should be constructed to engage those psychological processes (e.g., retrospection, spontaneous association) that reflect the construct being assessed and the method being used to assess it. Thus, in addition to performing preliminary factor- and clusteranalytic studies when refining psychological test items, and evaluating the degree to which potential test items are associated with external indices, psychometricians should assess the impact of relevant process manipulations on responses to each test item. One might reasonably argue that requiring psychometricians to conduct experimental process-focused validity studies prior to the publication of psychological tests puts an undue burden on test publishers, slowing the test development process and delaying the introduction of new measures that could potentially benefit patients, clinicians, and psychologists in various applied settings (e.g., organizational, forensic, etc.). There is no ideal solution to this dilemma, and the best approach may be one that seeks a middle ground: Just as a substantial (but not necessarily comprehensive) body of psychometric data should be obtained before a new psychological test is used in vivo, a substantial (but not necessarily definitive) body of process-focused experimental evidence should be collected prior to publication of a new test. Moreover, just as researchers continue to collect psychometric data postpublication so the strengths and limitations of a test can be better understood and the measure revised and improved, researchers should continue to collect process-focused data postpublication so the underlying processes engaged by the measure are brought into sharper focus. These data can also be used to refine and improve the test. As Bornstein et al.’s (1996a, 1994) findings illustrated, inherent in the PF model is a new conceptualization of test score divergence: When two measures of a construct engage different underlying processes, one can deliberately dissociate these processes, using manipulations that alter one set of processes but not the other. Thus, in addition to shifting the emphasis from outcome to process, and from correlation to experimentation, the PF model calls our attention to the importance of meaningful test score discontinuity (see also Meyer, 1996, 1997, and Meyer et al., 2001, for discussions of this issue). Just as convergent validity evidence must be accompanied by discriminant validity evidence to yield a complete picture of test score validity using the traditional outcome-based approach, manipulations that cause two measures of a given construct to converge more strongly must be complemented by manipulations that cause scores on these measures to diverge more sharply when the PF model is used. Note that manipulations that cause two test scores to converge should also increase the convergence of these two test scores with a common 541 theoretically related external criterion, whereas manipulations that cause two test scores to diverge should lead to greater divergence in the magnitude of test score– criterion links. This is another means through which process-focused and outcome-based validity data may be integrated. Practice and Social Policy Implications Principle 9.05 of the American Psychological Association’s (2002) Ethical Principles and Code of Conduct states that “Psychologists who develop tests and other assessment techniques use appropriate psychometric procedures and current scientific or professional knowledge for test design, standardization, validation, reduction or elimination of bias, and recommendations for use” (p. 14). The PF model’s framework for conceptualizing test score divergence has implications for understanding the sources of group differences in performance; ultimately the PF model may enhance psychologists’ ability to develop measures that generalize more effectively across gender, age, race, and ethnicity, thereby reducing test bias and test score misuse. Although public controversy regarding test bias has tended to emphasize group differences in outcome (e.g., ethnic and racial differences in SAT scores), psychometricians have increasingly focused on differential predictor– criterion relationships as a key index of bias (e.g., situations wherein scores on a personnel selection screen predict occupational success more effectively in members of one group than another). The PF model provides a framework for evaluating the degree to which group differences in predictive validity may be rooted in underlying process: When a test score predicts an outcome more effectively in one group than in another, this differential outcome validity is likely to be rooted, at least in part, in intergroup process differences, and these can be detected by introducing manipulations designed to alter the processes in question. Thus, differential predictive validities of SAT scores in African American and Caucasian students should increase when manipulations design to increase stereotype threat are used, and decrease when threat-reducing manipulations are used. Gender differences in self-attributed dependency should increase when test instructions are written to focus respondents’ attention on gender role issues, and decrease when instructions that deemphasize gender are used (see Major & O’Brien, 2005, for a discussion of contextual cues that moderate self-schema- and selfpresentation-related psychological processes in various groups). Two practice and policy implications follow, one having to do with addressing concerns regarding test bias prior to publication, the other with remedying flaws in existing tests. With respect to the former, psychologists developing measures that have historically tended to yield problematic group differences should deliberately evaluate the degree to which similar underlying processes are engaged in different groups, using standard PF manipulations (e.g., changes in test labels, induction of a negative mood or state anxiety) during the early stages of item development. When process differences are identified, these can be addressed before the test is used in vivo (e.g., by altering item content, revising test instructions, or evaluating the impact of varying item formats on performance). With respect to the latter issue, the PF model suggests an alternative definition of test bias: empirically demonstrable differences in the psychological processes engaged by different groups BORNSTEIN 542 of respondents. With this in mind, educators, policymakers, and mental health professionals who seek to document test bias (or the absence of bias) can use a process-focused framework alongside the traditional outcome-based approach. Once process-based sources of bias are identified in research settings, strategies for reducing these sources of bias in vivo may be implemented. In forensic contexts, demonstrable group differences in process— when coupled with differences in predictor– criterion relationships— represent compelling evidence that an assessment procedure does not yield comparable outcomes in different groups.7 Conclusion: Toward an Integrated, Integrative Perspective on Test Score Validity The goals of psychology have evolved during the past several decades, and so must the goals of validity research. Historically the relationship between experimentation and test score validity has been largely unidirectional, as researchers sought instruments with well-validated scores to enhance the rigor of their experiments. The PF model turns this unidirectional relationship into a bidirectional one: Just as one cannot conduct a rigorous experiment without valid test scores, one cannot validate test scores rigorously unless one uses experimental procedures as part of the overall validation strategy. Unlike traditional outcome-based validity assessment, the PF model explicitly links psychological testing to other areas of psychology (e.g., cognitive, social, developmental). Many of the manipulations used in PF studies to date have drawn upon ideas and findings from psychology’s various subfields, including research on memory, mood, self-presentation, implicit motivation, gender role socialization, and other areas. In this respect, the PF model not only enhances psychologists’ understanding of test score validity, but may also help connect psychology’s disparate subfields, contributing to the unification of a discipline that has fractionated considerably in recent years. 7 The PF model has pedagogical implications as well, teaching students the value of experimental methods and the ways in which experimental data enrich correlational results. Because the PF framework links assessment to other areas of psychology (e.g., cognitive, social), it helps deepen students’ perception of psychological science as a unified discipline. Moreover, the phenomenological emphasis of the PF framework—increased attention to the mental processes and subjective experience of the respondent— enables students to grasp the complexities of psychological assessment in ways that the traditional approach cannot. References Achenbach, T. M., Howell, C. T., Quay, H. C., & Conners, C. K. (1991). National survey of problems and competencies among four- to sixteenyear olds: Parents’ reports for normative and clinical samples. Monographs of the Society for Research in Child Development, 56(3, Serial No. 225). Allen, A., Montgomery, M., Tubman, J., Frazier, L., & Escovar, L. (2003). The effects of assessment feedback on rapport-building and selfenhancement process. Journal of Mental Health Counseling, 25, 165– 182. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: Author. American Psychological Association. (2002). Ethical principles of psychologists and code of conduct. Washington, DC: Author. Arnell, K. M., Killman, K. V., & Fijavz, D. (2007). Blinded by emotion: Target misses follow attention capture by arousing distractors in RSVP. Emotion, 7, 465– 477. doi:10.1037/1528-3542.7.3.465 Baltes, M. M. (1996). The many faces of dependency in old age. Cambridge, England: Cambridge University Press. Bem, S. L. (1974). The measurement of psychological androgeny. Journal of Consulting and Clinical Psychology, 42, 155–162. doi:10.1037/ h0036215 Bender, L. (1938). A visual-motor gestalt test and its clinical use. New York, NY: American Orthopsychiatric Association. Bollen, K. A., & Davis, W. R. (2009). Causal indicator models: Identification, estimation, and testing. Structural Equation Modeling, 16, 498 – 522. doi:10.1080/10705510903008253 Bornstein, R. F. (1995). Sex differences in objective and projective dependency tests: A meta-analytic review. Assessment, 2, 319 –331. doi: 10.1177/1073191195002004003 Bornstein, R. F. (1999). Criterion validity of objective and projective dependency tests: A meta-analytic assessment of behavioral prediction. Psychological Assessment, 11, 48 –57. doi:10.1037/1040-3590.11.1.48 Bornstein, R. F. (2001). Has psychology become the science of questionnaires? A survey of research outcome measures at the close of the 20th century. The General Psychologist, 36, 36 – 40. Bornstein, R. F. (2002). A process dissociation approach to objectiveprojective test score interrelationships. Journal of Personality Assessment, 78, 47– 68. doi:10.1207/S15327752JPA7801_04 Bornstein, R. F. (2003). Behaviorally referenced experimentation and symptom validation: A paradigm for 21st century personality disorder research. Journal of Personality Disorder, 17, 1–18. Bornstein, R. F. (2007). Toward a process-based framework for classifying personality tests: Comment on Meyer and Kurtz (2006). Journal of Personality Assessment, 89, 202–207. Bornstein, R. F., Bowers, K. S., & Bonner, S. (1996a). Effects of induced mood states on objective and projective dependency scores. Journal of Personality Assessment, 67, 324 –340. doi:10.1207/s15327752jpa6702_8 Bornstein, R. F., Bowers, K. S., & Bonner, S. (1996b). Relationships of objective and projective dependency scores to sex role orientation in college students. Journal of Personality Assessment, 66, 555–568. doi: 10.1207/s15327752jpa6603_6 Bornstein, R. F., Rossner, S. C., Hill, E. L., & Stepanian, M. L. (1994). Face validity and fakability of objective and projective measures of dependency. Journal of Personality Assessment, 63, 363–386. doi: 10.1207/s15327752jpa6302_14 Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061–1071. doi:10.1037/0033295X.111.4.1061 Briccetti, K. A. (1994). Emotional indicators of deaf children on the Draw-a-Person test. American Annals of the Deaf, 139, 500 –505. Butcher, J. N. (Ed.). (2002). Clinical personality assessment: Practical approaches (2nd ed.). New York, NY: Oxford University Press. Campbell, D. T., & Fiske, D. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. doi:10.1037/h0046016 Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasiexperimental designs for research. Chicago, IL: Rand-McNally. Cizek, G. J., Rosenberg, S. L., & Koons, H. H. (2008). Sources of validity evidence for educational and psychological tests. Educational and Psychological Measurement, 68, 397– 412. doi:10.1177/0013164407310130 Costa, P. T., & McCrae, R. R. (1985). NEO Personality Inventory manual. Odessa, FL: Psychological Assessment Resources. Cramer, A. O. J., Waldorp, L. J., van der Maas, H. L. J., & Borsboom, D. VALIDITY AS PROCESS (2010). Comorbidity: A network perspective. Behavioral and Brain Sciences, 33, 137–150. doi:10.1017/S0140525X09991567 Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 443–507). Washington, DC: American Council on Education. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. doi:10.1037/h0040957 Embretson (Whitely), S. E. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179 –197. doi:10.1037/0033-2909.93.1.179 Embretson, S. E. (1994). Applications of cognitive design systems to test development. In C. R. Reynolds (Ed.), Cognitive assessment: A multidisciplinary perspective (pp. 107–135). New York, NY: Plenum Press. Embretson, S. (1998). A cognitive design system approach for generating valid tests: Approaches to abstract reasoning. Psychological Methods, 3, 300 –396. Embretson, S., & Gorin, J. (2001). Improving construct validity with cognitive psychology principles. Journal of Educational Measurement, 38, 343–368. Exner, J. E. (1991). The Rorschach: A comprehensive system, Volume 2 (2nd ed.). New York, NY: Wiley. Garb, H. N. (1998). Studying the clinician: Judgment research and psychological assessment. Washington, DC: American Psychological Association. doi:10.1037/10299-000 George, G., & Stopa, L. (2008). Private and public self-awareness in social anxiety. Journal of Behavior Therapy and Experimental Psychiatry, 39, 57–72. doi:10.1016/j.jbtep.2006.09.004 Goldman, B. A., & Mitchel, D. F. (2003). Directory of unpublished experimental mental measures (Vol. 8). Washington, DC: American Psychological Association. Goodwin, L. D., & Leech, N. L. (2003). The meaning of validity in the new Standards for Educational and Psychological Testing: Implications for measurement courses. Measurement and Evaluation in Counseling and Development, 36, 181–192. Guilford, J. P. (1946). New standards for test evaluation. Educational and Psychological Measurement, 6, 427– 438. Hagemann, D., & Meyerhoff, D. (2008). A simplified estimation of latent state-trait parameters. Structural Equation Modeling, 15, 627– 650. doi: 10.1080/10705510802339049 Hänze, M., & Meyer, H. A. (1998). Mood influences on automatic and controlled semantic priming. American Journal of Psychology, 111, 265–278. doi:10.2307/1423489 Hershberger, S. L. (2003). The growth of structural equation modeling: 1994 –2001. Structural Equation Modeling, 10, 35– 46. doi:10.1207/ S15328007SEM1001_2 Hirschfeld, R. M. A., Klerman, G. L., Clayton, P. J., & Keller, M. B. (1983). Personality and depression: Empirical findings. Archives of General Psychiatry, 40, 993–998. Hirschfeld, R. M. A., Klerman, G. L., Gough, H. G., Barrett, J., Korchin, S. J., & Chodoff, P. (1977). A measure of interpersonal dependency. Journal of Personality Assessment, 41, 610 – 618. doi:10.1207/ s15327752jpa4106_6 Hogan, T. P., & Agnello, J. (2004). An empirical study of reporting practices concerning measurement validity. Educational and Psychological Measurement, 64, 802– 812. doi:10.1177/0013164404264120 Hunsley, J., & Bailey, J. M. (1999). The clinical utility of the Rorschach: Unfulfilled promises and an uncertain future. Psychological Assessment, 11, 266 –277. doi:10.1037/1040-3590.11.3.266 Jansen, B. R. J., & Van der Maas, H. L. J. (1997). Statistical tests of the rule assessment methodology by latent class analysis. Developmental Review, 17, 321–357. Jansen, B. R. J., & Van der Maas, H. L. J. (2002). The development of children’s rule use on the balance scale task. Journal of Experimental Child Psychology, 81, 383– 416. doi:10.1006/jecp.2002.2664 543 Jonson, J. L., & Plake, B. S. (1998). A historical comparison of validity standards and validity practices. Educational and Psychological Measurement, 58, 736 –753. doi:10.1177/0013164498058005002 Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535. doi:10.1037/0033-2909.112.3.527 Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38, 319 –342. doi:10.1111/j.17453984.2001.tb01130.x Kawada, C. L. K., Oettingen, G., Gollwitzer, P. M., & Bargh, J. A. (2004). The projection of implicit and explicit goals. Journal of Personality and Social Psychology, 86, 545–559. doi:10.1037/0022-3514.86.4.545 Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635– 694. MacCorquodale, K., & Meehl, P. E. (1948). On a distinction between hypothetical constructs and intervening variables. Psychological Review, 55, 95–107. Machover, K. (1949). Personality projection in the drawing of the human figure. Springfield, IL: Charles C Thomas. doi:10.1037/11147-000 Major, B., & O’Brien, L. T. (2005). The social psychology of stigma. Annual Review of Psychology, 56, 393– 421. doi:10.1146/annurev .psych.56.091103.070137 Masling, J. M. (1966). Role-related behavior of the subject and psychologist and its effect upon psychological data. In D. Levine (Ed.), Nebraska symposium on motivation (pp. 67–104). Lincoln: University of Nebraska Press. Masling, J. (2002). Speak, memory, or goodbye, Columbus. Journal of Personality Assessment, 78, 4 –30. doi:10.1207/S15327752JPA7801_02 Masling, J., Rabie, L., & Blondheim, S. H. (1967). Obesity, level of aspiration, and Rorschach and TAT measures of oral dependence. Journal of Consulting Psychology, 31, 233–239. doi:10.1037/h0020999 McClelland, D. C., Koestner, R., & Weinberger, J. (1989). How do self-attributed and implicit motives differ? Psychological Review, 96, 690 –702. doi:10.1037/0033-295X.96.4.690 McFarland, C., & Buehler, R. (1998). The impact of negative affect on autobiographical memory: The role of self-focused attention to moods. Journal of Personality and Social Psychology, 75, 1424 –1440. doi: 10.1037/0022-3514.75.6.1424 McIntire, S. A., & Miller, L. A. (2000). Foundations of psychological testing. Boston, MA: McGraw-Hill. Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806 – 834. doi:10.1037/0022006X.46.4.806 Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: Macmillan. Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23, 13–23. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as inquiry into score meaning. American Psychologist, 50, 741–749. doi:10.1037/0003066X.50.9.741 Meyer, G. J. (1996). The Rorchach and MMPI: Toward a more scientific understanding of cross-method assessment. Journal of Personality Assessment, 67, 558 –578. doi:10.1207/s15327752jpa6703_11 Meyer, G. J. (1997). On the integration of personality assessment methods: The Rorschach and MMPI. Journal of Personality Assessment, 68, 297–330. doi:10.1207/s15327752jpa6802_5 Meyer, G. J., Finn, S. E., Eyde, L. D., Kay, G. G., Moreland, K. L., Dies, R. R., . . . Read, G. M. (2001). Psychological testing and psychological assessment: A review of evidence and issues. American Psychologist, 56, 128 –165. doi:10.1037/0003-066X.56.2.128 Mischel, W. (1984). Convergences and divergences in the search for 544 BORNSTEIN consistency. American Psychologist, 39, 351–364. doi:10.1037/0003066X.39.4.351 Mischel, W., Shoda, Y., & Mendoza-Denton, R. (2002). Situation-behavior profiles as a locus of consistency in personality. Current Directions in Psychological Science, 11, 50 –54. doi:10.1111/1467-8721.00166 Mislevy, R. J. (2007). Validity by design. Educational Researcher, 36, 463– 469. doi:10.3102/0013189X07311660 Murray, H. A. (1943). Thematic Appreciation Test manual. Cambridge, MA: Harvard University Press. Nosek, B. A., Greenwald, A. G., & Banaji, M. R. (2005). Understanding and using the Implicit Association Test: Method variables and construct validity. Personality and Social Psychology Bulletin, 31, 166 –180. doi:10.1177/0146167204271418 Rholes, W. S., Riskind, J. H., & Lane, J. W. (1987). Emotional states and memory biases: Effects of cognitive priming and mood. Journal of Personality and Social Psychology, 52, 91–99. doi:10.1037/00223514.52.1.91 Robinson, M. D., & Clore, G. L. (2002). Episodic and semantic knowledge in emotional self-report: Evidence for two judgment processes. Journal of Personality and Social Psychology, 83, 198 –215. doi:10.1037/00223514.83.1.198 Rorschach, H. (1921). Psychodiagnostik. Bern, Switzerland: Bircher. Rosenthal, R. (1991). Meta-analytic procedures for social research (2nd ed.). Thousand Oaks, CA: Sage. Rosenthal, R. (2003). Covert communication in laboratories, classrooms, and the truly real world. Current Directions in Psychological Science, 12, 151–154. doi:10.1111/1467-8721.t01–1-01250 Schumacker, R. E., & Lomax, R. G. (2004). A beginner’s guide to structural equation modeling (2nd ed.). Mahwah, NJ: Erlbaum. Shepard, L. A. (1993). Evaluating test validity. Review of Research in Education, 19, 405– 450. Slaney, K. L., & Maraun, M. D. (2008). A proposed framework for conducting data-based test analysis. Psychological Methods, 13, 376 – 390. doi:10.1037/a0014269 Spies, R. A., & Plake, B. S. (Eds.). (2005). The sixteenth mental measurements yearbook. Lincoln, NE: Buros Institute of Mental Measurements. Sproull, L. S. (1981). Managing education programs: A micro-behavioral analysis. Human Organization, 40, 113–122. Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African Americans. Journal of Personality and Social Psychology, 69, 797– 811. doi:10.1037/0022-3514.69.5.797 Strauss, M. E., & Smith, G. T. (2009). Construct validity: Advances in theory and methodology. Annual Review of Clinical Psychology, 5, 1–25. Tomarken, A. J., & Waller, N. G. (2005). Structural equation modeling: Strengths, limitations, and misconceptions. Annual Review of Clinical Psychology, 1, 31– 65. doi:10.1146/annurev.clinpsy.1.102803.144239 Ullman, J. B. (2006). Structural equation modeling: Reviewing the basics and moving forward. Journal of Personality Assessment, 87, 35–50. doi:10.1207/s15327752jpa8701_03 Young, J. W. (2001). Differential validity, differential prediction, and college admissions testing: A comprehensive review and analysis. New York, NY: The College Board. Zemack-Rugar, Y., Bettman, J. R., & Fitzsimons, G. J. (2007). Effects of nonconsciously priming emotional concepts on behavior. Journal of Personality and Social Psychology, 93, 927–939. doi:10.1037/00223514.93.6.927 Received July 1, 2010 Revision received November 12, 2010 Accepted November 15, 2010 䡲
Purchase answer to see full attachment