Cross Cultural Research

Anonymous
timer Asked: Mar 6th, 2018
account_balance_wallet $20

Question description

Cross-cultural research is a method of study that psychologists use to compare data and behaviors of people from differing cultures, rather than a single culture. In cross-cultural research, you need to ensure that there is equivalence throughout the study, as well as a lack of bias in your measures, associations, and conclusions. Equivalence is the evidence that your research uses the same techniques and measures to test the same phenomenon across cultures, and this equivalence helps your research to be considered valid and reliable. In addition to equivalence, you must be aware of the potential for personal bias in any cross-cultural research you conduct.

A bias is prejudicial predisposition that can prevent impartial thinking. In cross-cultural research, a bias can appear in various forms, such as the Barnum statement (a one-size-fits-all description) or the self-fulfilling prophecy (your assumptions about others can cause them to meet those expectations) (Matsumoto & Juang, 2008; Shiraev & Levy, 2010).

For this Discussion, analyze the theoretical, methodological, and ethical issues included in the research study attached regarding the cross cultural study of the prevalence of late life depression in low and middle income countries.

With these thoughts in mind: Post a brief summary of the research study selected, including the topic and conclusions of the study. Then explain any possible theoretical, methodological, and ethical issues involved in the study. Finally, share your thoughts about how, as a scholar-practitioner, you might address one or more of these issues.

APA Format. 3-4 Paragraphs. In-text Citations to support your writing.

Child Development, July/August 2007, Volume 78, Number 4, Pages 1255 – 1264 Lost in Translation: Methodological Considerations in Cross-Cultural Research Elizabeth D. Peña The University of Texas at Austin In cross-cultural child development research there is often a need to translate instruments and instructions to languages other than English. Typically, the translation process focuses on ensuring linguistic equivalence. However, establishment of linguistic equivalence through translation techniques is often not sufficient to guard against validity threats. In addition to linguistic equivalence, functional equivalence, cultural equivalence, and metric equivalence are factors that need to be considered when research methods are translated to other languages. This article first examines cross-cultural threats to validity in research. Next, each of the preceding factors is illustrated with examples from the literature. Finally, suggestions for incorporating each factor into research studies of child development are given. In the study of child development, cross-cultural (and intracultural) studies of knowledge acquisition are important for both theoretical and practical reasons. First, cross-cultural methods allow researchers to test, modify, and extend current theories of development (Devescovi & D’Amico, 2005; Katzir, Shaul, Breznitz, & Wolf, 2004; Slobin, 1985). Such research provides insights about the interaction among universal and specific factors in development in the context of social and linguistic variation. Second, the non-Englishspeaking child population in the United States has increased significantly. According to the U.S. census (Shin & Bruno, 2003), in 1990 14% of the school-age population spoke a language other than English at home, but the proportion had increased to 18% by 2000. There is a practical need to learn about developmental trajectories for children who speak languages other than English or who speak English as a second language. Demographic shifts in the child population due to immigration are paralleled in other countries as well: England (Coleman & Rowthorn, 2004), Italy (Livi Bacci, 2004), Australia (Shah & Long, 2003), and Sweden, Norway, and Denmark, (Kemnitz, 2003). Both theoretical and practical needs drive the heightened interest in cross-cultural research. Cross-cultural research often necessitates translation of methods (i.e., instruments and instructions to participants) from English to other languages. In the United States, instruments (such as cognitive tests, The work for this article was initiated while the author was a Fellow at the Center for Advanced Study in the Behavioral Sciences, Stanford, CA. Correspondence concerning this article should be addressed to Elizabeth Peña, Department of Communication Sciences and Disorders, University of Texas at Austin, One University Station A1100, Austin, TX 78712. Electronic mail may be sent to lizp@mail. utexas.edu. language tests, social behavior scales, and school adjustment scales) are typically standardized in English with mainstream American children but they are unlikely to be standardized for other language groups. In addition, question sets, procedures, and coding schemes are developed to address a particular research question. These too are typically developed for mainstream English-speaking children based on what is known about their development. Translation of these instruments and procedures presents particular methodological challenges that can threaten the validity of results. Researchers in child development thus need to be conscious of the pitfalls in translating methods developed for one population and language community to another. Objectivity is a hallmark in research methodology. However, Greenfield (1994) points out that when psychologists, for example, study development within their own culture, they use their own implicit knowledge of the culture—often unacknowledged—when doing research. This insider’s perspective, as practiced by members of a discipline, often becomes the basis for norms (Rogler, 1999), setting the standard for what is studied and how it is studied (Zuckerman, 1988). However, methodological norms developed within and for a given population cannot necessarily be transported without adaptation for the target population. Development of instrumentation and elicitation procedures appropriate for a question under study is fundamental in research design. Accordingly, detailed methods allow readers to determine the validity and reliability of reported results. When research involves populations that do not speak the # 2007 by the Society for Research in Child Development, Inc. All rights reserved. 0009-3920/2007/7804-0014 1256 Peña majority language (e.g., English in the United States), particular attention to development of instrumentation and procedures is needed to ensure validity and reliability. For instance, the Publication Manual of the American Psychological Association (2001) has a section on reduction of language bias. Guidelines state that the language used in the procedure of a study needs to be specified. Furthermore, the method used to translate test instruments to a language other than English must be detailed. But mere translation of elicitation procedures and instrumentation is not sufficient to guard against potential cultural bias and therefore validity threats. Challenges to Validity in Translation Bias is a distinct threat to validity in translation of methods in cross-cultural research. The literature on bias in test development provides a useful framework for discussion of linguistically appropriate research methods. An important principle for such a discussion is the notion of fairness in test development. Fairness is evaluated in the context of the goals or function of the test instrument. Definitions of fairness include equal treatment in context and purpose of testing, and comparable opportunity to demonstrate abilities on the construct the test is intended to measure (Standards for Educational and Psychological Testing, 1999). An important methodological goal therefore is to ensure equivalence at the level of context and opportunity when one is designing cross-cultural research studies of child development. Equivalence may be at the level of stimuli for the purpose of exploring similarity and variation in response, or equivalence may be at the level of outcome for the purpose of understanding differences in circumstances that bring about specific developmental results. The principles of equal treatment and comparable opportunity can be applied to development of crosscultural methods. Instructions and tests used across languages need to be equivalent to provide equal opportunity to demonstrate the skill under study. However, translation may not by itself ensure equal opportunity for the participants to demonstrate their abilities. Instructions to participants and the content of the instrument(s) used to gather data are potential sources of bias. In addition to linguistic equivalence, the notion of equivalence can be interpreted and applied in several additional ways: functional equivalence, cultural equivalence, and metric equivalence (Arnold & Matus, 2000; Bracken & Barona, 1991; Erkut, Alarcón, Coll, Tropp, & Garcı́a, 1999; Geisinger, 1994; Rogers, Gierl, Tardif, Lin, & Rinaldi, 2003; Rogler, 1999; Sechrest, Fay, & Hafeez Zaidi, 1972; Sireci & Berberoglu, 2000; Standards for Educational and Psychological Testing, 1999; Valencia & Rankin, 1985). The type of equivalence identified as necessary depends on a study’s goals and involves consideration of stimuli and outcomes. Here, principles drawn from disciplines that have a long tradition of cross-cultural research, such as anthropology and sociology, as well as applied fields, such as clinical psychology and nursing, guide development of a framework appropriate for research in child development. Linguistic equivalence typically refers to translating instructions and instruments, and checking the translation with methods such as back-translation (translation from the first language to the second, and then back to the first by a second person; Standards for Educational and Psychological Testing, 1999) or expert review (Hambleton, 2001). Functional equivalence means that the instructions and instrument will elicit the same target behavior (Greenfield et al., 2006). Cultural equivalence considers how respondents will interpret a given direction or test item and develops items that tap the same cultural meaning for each cultural linguistic group (Alonso et al., 1998). Metric equivalence has to do with the difficulty of the specific item expressed in two distinct languages (Azen et al., 1999; Kim, Han, & Phillips, 2003; Muñiz, Hambleton, & Xing, 2001) and is essential for development of ability tests for example. Together, these types of equivalence provide a way to examine potential methodological bias. Examples from the literature demonstrate the challenges to achievement in each of these kinds of equivalence. Linguistic Equivalence Direct translation usually satisfies the standard for ensuring linguistic equivalence. Researchers employ two main types of techniques when translating instruments and instructions. In translation and back-translation (Arnold & Matus, 2000; Beck, Bernal, & Froman, 2003; Brislin, 1986; Hambleton, 2001; Rogers et al., 2003) a translator first translates the instrument or instructions from the source language to the target language. A second translator then independently translates the target version back to the source language. The original and back-translated versions are compared to identify differences, which are then resolved. This procedure is akin to the game of ‘‘telephone’’ but with a cross-linguistic twist. Another technique is to have a native language speaker review the translation to ensure its accuracy. The main goal for linguistic equivalence is to make certain that the words and linguistic meaning used in Lost in Translation the instruments and instructions are the same for both versions (Grisay, 2003; Sireci & Berberoglu, 2000). A problem with linguistic equivalence is that even if words are the same across two sets of methods there are potential differences that may result in different patterns of responses. That is, the same stimuli may result in different outcomes. These different patterns of response may be due to differences in cultural interpretation, familiarity, or frequency of occurrence. If these are the research questions, linguistic equivalence is sufficient and appropriate. If, however, the purpose of a translated instrument is to make a judgment of a developmental status, linguistic equivalence without consideration of functional, cultural, and metric equivalence may introduce bias. For example, when the Preschool Language Scale – 3 (Zimmerman, Steiner, & Pond, 1992), a test of linguistic and conceptual development, was initially translated to Spanish (Zimmerman, Steiner, & Pond, 1993) all the test items were retained in the same order. An item analysis by Restrepo and Silverman (2001) demonstrated that although the items were linguistically equivalent, item difficulty at each age level was not. All concepts and linguistic forms are not learned at the same point in development in all languages; thus, these may be easier or harder cross-linguistically. For example, understanding object functions was more difficult in English than in Spanish, but prepositions were more difficult in Spanish than in English. Use of such an instrument in a comparative study of linguistic or cognitive development could be misleading because it possibly under- or over-estimates developmental status. The most recent Spanish version (Zimmerman, Steiner, & Pond, 2002) used many of the same (English) items, introduced new items based on milestones of Spanish language development (e.g., gender agreement), and based item order on Spanish difficulty to yield psychometrically equivalent tests (discussed further later). Such tests are more appropriate for evaluation of development because they compare an individual child against a linguistically and culturally appropriate standard. Another example that illustrates validity threats in translation concerns the ways target behaviors are elicited. The adaptation of the Peabody Picture Vocabulary Test – Revised (PPVT – R; Dunn & Dunn, 1981) to the Spanish Test de Vocabulario en Imagenes Peabody (TVIP, Dunn, Padilla, Lugo, & Dunn, 1986) provides such a case. In Spanish, as in English, this test is a singleword recognition task. Children hear a word (usually a noun) spoken by the examiner and select one of four pictures that best depicts the given word. In both versions the words are presented without an article (e.g., ‘‘dog’’ not ‘‘the dog’’). In English, the article carries 1257 little linguistic information and nouns frequently occur without articles. In Spanish however, nouns are typically accompanied by the article, which marks gender and number (la – feminine singular, las – feminine plural, el – masculine singular, los – masculine plural). The test manual of the TVIP, therefore, instructs examiners not to include the article when saying the word because it may provide children with additional cues that will enable them to select the correct word on the basis of the gender and number information (for an experimental example of how children use grammatical gender in Spanish, see Lew-Williams & Fernald, in press). Spanish-speaking examiners often object to this instruction, calling it ‘‘unnatural Spanish.’’ Omitting the article could result in a functional difference unintentionally affecting test performance because Spanish-speaking children do not typically hear nouns without their articles. An alternative way to have constructed the test, allowing the more typical use of the article + noun, would have been to control for gender and number. That is, each of the four alternates (the target and foils) could have been of the same gender and number so as not to provide an extra cue yet letting children hear the target word in its familiar context. These two examples show that establishing linguistic equivalence through the established methods of expert translation and back-translation in instrumentation and instructions or elicitation of targets is not always sufficient for development of methods in studies of groups who do not speak English. Although the examples target test instruments specifically, they illustrate the threats that are inherent in using translated methods in cross-cultural studies of development. Instructions or elicitation procedures must also be scrutinized to ensure equal opportunity to demonstrate the target ability. Examination of functional, cultural, and metric equivalence may be needed to guard against validity threats. Functional Equivalence Rogler (1999) argued that preservation of the language used in the original language version—or linguistic equivalence—is a potential source of cultural insensitivity if the translation yields functional differences. In other words, translation from one language to another can result in incongruity in meaning, threatening content validity of a study’s methods. Functional equivalence addresses some of these threats by ensuring that the instrument and elicitation method allow examination of the same construct. This aspect of translation is often overlooked in favor of achieving uniformity in instrumentation and procedures. 1258 Peña One translation method with the purpose of equalizing concepts or function over linguistic equivalence is referred to as ‘‘decentering’’ (Sechrest et al., 1972). Decentering is often used by professional translators to obtain equivalence in meaning and salience with respect to the respondent, in combination with the translation/back-translation approach. This procedure may yield an instrument with translated items that have shifted away from the source instrument’s wording to represent the concept in a linguistically familiar way in the target language. Another translation method is a ‘‘dual-focus’’ approach (Erkut et al., 1999). This method uses a research team drawn from both of the cultural and linguistic groups under study. The instrumentation and instructions to be used in the research study are developed simultaneously in the two languages so that methods that are linguistically appropriate for each of the target groups focus on equality in clarity (rather than linguistic equivalence) for each. Thus, instruments are parallel with respect to the behavior or concepts tested but with different stimuli. The following example demonstrates how functional equivalence in instrumentation and procedures can be obtained by use of a combination of translation, back-translation, decentering, and dual-focus procedures (see Bedore, Peña, Garcı́a, & Cortez, 2005). The Bilingual English Spanish Assessment (BESA; Peña, Gutierrez-Clellen, Iglesias, Goldstein, & Bedore, 2007) is designed to identify language impairment in Latino children between the ages of 4 years 0 months and 6 years 11 months in the United States. The BESA contains two language versions (English and Spanish) targeting four domains: semantics, morpho-syntax, pragmatics, and phonology. Development of the characteristic properties items from the semantics subtest of the BESA provides an illustration of functional equivalence. These items were designed to elicit descriptions of common objects (e.g., school bus, truck, spoon, and fork) from children ages 4, 5, and 6. Of interest here are the question frames used to elicit description in each of the two languages. Observation of a bilingual preschool classroom teacher – student interaction was the beginning point for development of question frames and indicated that different types of questions were used in the two languages. During item development, the different question types identified in the classroom were piloted with a small number of children and included Spanish and English versions (using both dual-focus and back-translation techniques) of ‘‘tell me about . . .,’’ ‘‘describe . . .,’’ ‘‘tell me . . .,’’ and ‘‘tell me three things about . . .,’’ among others. The pilot data indicated that ‘‘tell me three things about . . .’’ in English and ‘‘describe . . .’’ in Spanish yielded functionally equivalent language performance (resulting in decentered elicitation frames). The different question forms are appropriate for each language and have the effect of eliciting similar target behavior in each of the two languages. These question frames were incorporated into the BESA and tested with a larger number of children (Bedore et al., 2005). Statistical analyses indicated similar performance in each language for both monolingual and bilingual children. Thus elicitation frames for each language in the final version are functionally equivalent and linguistically different, yet both elicit linguistically similar responses. Attention to functional equivalence allows children to demonstrate their knowledge in an elicitation context that is familiar in each language (Fagundes, Haynes, Haak, & Moran, 1998; Peña, 2001). This type of equivalence levels the crosscultural playing field. Cultural Equivalence In their study, van der Veer, Ommundsen, Hak, and Larsen (2003) pointed out that items may have different salience for different cultural and linguistic groups, even if the items meet the criteria for linguistic and functional equivalence. Disparities in salience may be due to the distinct cultural and historical ways in which concepts are interpreted by respondents. Cultural equivalence with respect to respondents’ interpretations and responses to given items needs to be explored when one is developing methods and procedures. The notion of cultural equivalence is related to that of functional equivalence (Arnold & Matus, 2000; Geisinger, 1994; Muñiz et al., 2001; Sechrest et al., 1972). Cultural equivalence focuses more centrally on the way members of different cultural and linguistic groups view or interpret the underlying meaning of an item. Cultural interpretations may affect the ways individuals respond to instructions and research instruments, including standardized and nonstandardized tests (Canino & Guarnaccia, 1997; Hendrickson, 2003). Culturally determined definitions of developmental abilities such as knowledge (Zambrano & Greenfield, 2004), creativity (Baldwin, 2001), and language (McCollum & Chen, 2001; Posada, Carbonell, Alzate, & Plata, 2004; Suizzo, 2004; Vigil, 2002) may also affect the ways children and families from linguistically and culturally diverse backgrounds report information. Zambrano and Greenfield (2004) hypothesized that ‘‘different ethnic groups have their own implicit, informal theories of knowledge and that these ethno-theories form the assumptions on which the Lost in Translation explicit formal theories are based’’ (p. 251). That is, Western theories of intelligence rest on culturalspecific assumptions (see also Greenfield, 1997). Zambrano and Greenfield illustrated that the concept of knowledge has core understandings, albeit with overlapped meanings in American and in several Maya (Tzotzil-speaking) communities. The American English word know refers to facts, theories, and practice or ‘‘know-how.’’ The Tzotzil word na translates to the English know but in addition to facts, theory, and practice, it also implies habitual practice that indicates mastery and is part of the person’s character. This critical distinction of habitual practice or mastery is similar to an example reported by Peña and Jackson (2000), in which a Mexican immigrant child about 2-1/2 years old was referred for a speechlanguage evaluation because the developmental milestone of first words was reported by the mother to have been reached at 24 months of age, a significant delay. In the initial interview, the Spanish-speaking speech-language pathologist asked the mother for examples of the child’s first words. All the examples were of word combinations, rather than single words, clearly within the age-expected range. In this example, the mother’s ethnotheory of ‘‘learning first words’’ was talking to communicate using word combinations. Earlier use of single words did not count because they were not evidence of mastery of talking. Simply asking when the child first began to talk did not produce a response that had the same cultural meaning it might have for a mainstream, English-speaking, American mother. A study by Garstein, Slobodskaya, and Kinsht (2003) provides another illustration of potential culturally biased responses. In this study of infant temperament, Russian and American mothers were asked to complete the Infant Behavior Questionnaire – Revised (IBQ – R; Gartstein & Rothbart, 2003), which consists of 14 scales—for example, activity level, smile and laughter, soothability, sadness, and vocal reactivity. The materials were translated to Russian with a translation and back-translation technique. Results comparing Russian and American infants confirmed expected cultural differences. Specifically, U.S. mothers reported ‘‘higher levels of smiling and laughter, high and low intensity pleasure, perceptual sensitivity and vocal reactivity’’ (p. 322). On the other hand, Russian mothers reported higher distress to limitations. The authors of this study acknowledged that the emphasis on development of certain behaviors with respect to expression of emotions may be directly linked to the types of emotions that their children then demonstrated. However, there may be another, related factor at work—that 1259 mothers interpret their babies’ temperamental characteristics on the basis of their cultural expectations. Data on bilingual Russian-English personal narratives lend credibility to this argument. Marian and Kaushanskaya (2004) compared autobiographical memories of bilingual Russian-English speakers in each of their two languages to explore the notion of self-construal. They found that narratives retrieved in Russian included fewer personal pronouns and more group pronouns (an indicator of collectivism) than those retrieved in English (indicating individualism). Furthermore, emotional intensity of the narratives had different patterns in each of the two languages. Memories encoded in Russian were less positive than those encoded in English. A way to test the possibility of cultural bias and to disentangle actual infant behaviors from a mother’s interpretation of those behaviors would be to conduct a study of item salience, similar to that described by van der Veer et al. (2004). Applied to the study under discussion, a subset of Russian and American mothers would rate the temperament of a set of control infants via videotape using the IBQ – R. Comparisons would then be made for the two samples. If there were systematic differences in ratings by nationality, they would point to cultural differences in how mothers interpret the same behavior. These ratings could then be used to calibrate the responses from the larger sample to examine actual differences in infant behaviors independently of cultural bias. Furthermore, these comparisons could be used to shed light on the nature of such cultural differences. Metric Equivalence The final aspect of equivalence discussed here is that of metric equivalence. Metric equivalence refers to equivalence in item or question difficulty. This type of equivalence is particularly important when one is developing instruments in more than one language or adapting an instrument from one language to another. Review of the methods used for adapting and developing vocabulary assessments from English to other languages exemplifies the methodological challenges. An early example of adaptation of an English test to another language is illustrated by the adaptation of the PPVT – R (Dunn & Dunn, 1981) to a Spanish version, the TVIP (Dunn et al., 1986). The original English corpus was based on English-language dictionaries, and selections were based on item difficulty to sample across a broad age range. These items were translated to Spanish, and similar field testing was conducted in Spain to select items on the basis of item 1260 Peña difficulty. These items compose the TVIP. After standardization in Spain, standardization was completed in Mexico and Puerto Rico with monolingual Spanish speakers. Although the TVIP was deliberately not normed with bilingual U.S. populations, Dunn et al. (1986) provided comparative information drawn from pilot studies of bilingual children. The Mexican and Puerto Rican children performed below the mean on the same item set compared with the Spanish children. In addition, the U.S. bilingual children performed about 1 SD below their monolingual Spanish-speaking Mexican and Puerto Rican counterparts on this test. This adaptation has been criticized on several methodological and psychometric bases (Berliner, 1988; Cummins, 1988; Mercer, 1988; Prewitt Diaz, 1988; Trueba, 1988; Willig, 1988). An important lesson for the current purpose is that different dialects and usages vary across and within languages. First, beginning solely with an English corpus is not appropriate because words may have different frequencies and different uses in the two languages. Second, using the same words for Spanish, Mexican, and Puerto Rican Spanish tests may have resulted in differences in performance because vocabulary use and frequency are different for these three populations. Specifically, names for things differ across dialects of Spanish (e.g., pantallas, aretes, and pendientes refer to ‘‘earrings’’ in Puerto Rican, Mexican, and Castillian Spanish, respectively) as it does for dialects of English (e.g., torch and flashlight name the same thing in British vs. American English). Developing parallel vocabulary measures based on word frequency rather than translation may provide a superior way to develop psychometrically parallel instruments. Tamayo (1987) developed an English word list and two Spanish word lists, one matched to English on the basis of translation and the other matched on the basis of item frequency. The English and Spanish lists were administered to two groups of 80 eighth-grade students, one group fluent in English and the other in Spanish. Children responded by providing a definition of each target word. The two groups were further matched by gender, age, and academic achievement. Comparison of the children’s performance indicated that the English and Spanish versions matched by frequency yielded comparable performance by the two groups, whereas those matched by translation resulted in group differences. These results imply that metric equivalence as applied to instrument development should consider lexical frequency in each of the target languages if other psychometric data are not available. The procedures developed by Fenson, Bates, and colleagues for the adaptation of the Bates – MacArthur Communication Developmental Inventories (CDI; Fenson et al., 1993) illustrate how several methods were used in combination for development of this instrument for various languages including Italian (Caselli et al., 1995), Mexican Spanish (JacksonMaldonado, Thal, Marchman, Bates, & GutierrezClellen, 1993), Cuban Spanish (Pearson & Fernandez, 1994), Mandarin (Tardif, Gelman, & Xu, 1999), Finnish (Lyytinen, Poikkeus, Leiwo, Ahonen, & Lyytinen, 1996), Canadian French (Poulin-Dubois, Graham, & Sippola, 1995), and Hebrew (Maitel, Dromi, Sagi, & Bornstein, 2000). For each adaptation careful attention was paid to typology of the target language, word frequencies, and word class (Fenson et al., 1994). For example, the Italian version differs from the American English version on the specific content words included, but both versions use an approximately equal number of words from different categories (e.g., animal, food, and clothing items have the same number of items but consist of different words; Caselli et al., 1995; Caselli, Casadio, & Bates, 1999). The grammatical function word categories reflect the structural differences of the two languages in the types of words and number of each type. Specifically, the Italian version includes adverbials but the English version does not; the English version includes 27 prepositions and the Italian version include 17. A section was added in the Italian version to examine verb conjugation and noun declension. For Mexican Spanish, modifications were also made to render the instrument linguistically and culturally relevant (Jackson-Maldonado et al., 1993). As in the Italian version, lexical categories were added to reflect verb conjugation; gender in articles, pronouns, and adjectives; and number in articles and pronouns. As in the examples for Italian, the Spanish content reflected culturally appropriate vocabulary and routines. For example, tortillitas and ojitos replaced pat-acake and peek-a-boo. Within-language adaptations were also made on the CDI specific for the target population. For example, Hamilton, Plunkett, and Schafer (2000) adapted the CDI for a British population. Examination of their most recent word list (referred to as the Oxford 1998 CDI) indicates several changes from the American version. For example, the Oxford 1998 CDI includes pushchair, brick, biscuit, sweets, and nappy, whereas the American version includes stroller, block, cookie, candy, and diaper. Furthermore, analysis of word frequencies indicated that the items on the Oxford 1998 CDI occurred more frequently in British English than in American English, providing evidence of validity for the target population. In sum, for these adaptations, the authors used several methods in combination, including deriving Lost in Translation items from studies of natural language samples, available corpora from language experiments, vocabulary lists from other scales in the target language, and vocabulary lists from CDIs already adapted to other languages. Furthermore, they asked informants from the target population to review the words and to identify irrelevant words, as well as to add relevant words that had not been included. The use of these procedures has resulted in instruments that are appropriate for the populations for which they are intended and that have a high degree of reliability. Cross-cultural comparisons using these instruments are likely to yield valid results. Similarly, in development of the tasks for the BESA Semantics Test, Peña, Bedore, and Rappazzo (2003) compared Spanish- and English-speaking children’s performance on six tasks (analogies, characteristic property, categorization, functions, linguistic concepts, and similarities and differences). Each task had 86 items in each language: 12 items per task and 26 items for categorization. Items were developed using a combination of dual focus, where half of the items were developed for each language independently; translation, where each item was translated to the other language (half to English, half to Spanish); and decentering, where each item was further adapted so that the question context was different but relevant for each group. The contextualizing aspect was conducted in part so that children would not recognize the question if tested in both languages (see Peña et al., 2003, for more details). Analyses revealed significant Language  Task interactions. Some tasks were relatively easier in English than in Spanish (e.g., similarities and differences) but other tasks were easier in Spanish than in English (e.g., functions). For the second stage of data collection, representation of items was based primarily on item difficulty, while attempting to represent all the task types. Thus, the item configuration for each language is different. Finally, items are arranged by difficulty for the target language. This configuration and arrangement results in psychometrically parallel tests that are not linguistically equivalent but that validly assess semantic performance in each language. Preliminary data analysis comparing children with and without language impairment in Spanish and English on these same tasks indicates differential performance by task and (test) language. That is, certain tasks function better to differentiate language impairment in Spanish, whereas other tasks are better suited for English. For example, more functions items discriminated between children with and without language impairment in Spanish, but more similarities-anddifferences items discriminated these children in 1261 English. In sum, psychometric adaptation may mean that the resulting instruments will have different items and different arrangement of items. Guidelines for Translation of Instruments for Cross-Cultural Research There is no question that cross-cultural research is beneficial and desirable from both theoretical and practical perspectives. Theoretically, cross-cultural research allows researchers to test and extend theories of development. A cross-cultural approach can help to identify universals in development and to discover variation attributable to linguistic and cultural differences. Most of the examples described here are based on tests of language. Nonetheless, the principles apply to other domains of developmental research that require translation from English to another language. Knowledge of how development unfolds in different linguistic and cultural contexts informs application of the state of the art to a broader population of children. Consideration of additional aspects of equivalence, such as functional, cultural, and metric equivalence, can reduce potential methodological bias. For translation of instructions to participants and of instrumentation, first consider whether bias will be introduced to the study. Techniques such as translation and back-translation result in well-translated instructions and instruments, but they may not provide equally relevant methods for the populations under study. Decentering—that is, adapting the question or item so that it is culturally familiar—can also be used in conjunction with translation. When instruments and instructions are adapted from well-established protocols into other languages, assuring functional equivalence is essential if the study’s focus is on whether children are able to perform a given task. Functional equivalence can be evaluated by interviewing informants, conducting literature reviews, and examining available corpora. The CHILDES (MacWhinney, 2000) and SALT (Miller & Iglesias, 2005) reference databases provide rich sources of raw data that can be analyzed to develop functionally equivalent instructions and instrumentation. Examination of word frequencies and typical question forms, for example, can be used to guide selection of target vocabulary or development of question frames. Instructions, questions, and tasks will not have the same degree of cultural relevance for participant groups. In particular, when tasks are set up in the same way for different cultural/linguistic groups, 1262 Peña differences in outcomes may be influenced by differences in expectations or interpretation rather than by differences in the trait under study. The potential for cultural mismatch highlights the need to consider cultural equivalence in developing methods for crosscultural research. Debriefing participants during the pilot stage of an adapted method may also help researchers understand how participants might interpret instructions or questions and can also be used to explore the nature of cultural differences. Asking for examples or explanations is another way to understand research participants’ response patterns. These added steps will allow for the examination of potential culturally driven responses to help disentangle true from ostensible differences. Psychometric equivalence is particularly critical for development of instrumentation if comparisons between groups will be made that focus on judgments of ability. Items may not be equally difficult across languages even if the target concept or question occurs in both languages. Some types of items may be rendered more or less complex when translated; words selected in the translation may have different frequencies of occurrence and influence difficulty. These considerations are especially important in translation of instruments that test linguistic and cognitive ability, but they are also important in academic domains such as math (Abedi & Lord, 2001; Towse & Saxton, 1998). For development of psychometric instruments, item difficulty needs to be taken into account. The conventional way to determine item difficulty is to calculate the percentage of participants (e.g., at a given age) who respond correctly to an item. Another way to index difficulty is to examine frequency of occurrence in the target language (Tamayo, 1987, 1990) or to examine age of acquisition of target words or concepts (Bates et al., 2003). In summary, four features that need to be considered when one conducts research across cultural/ linguistic groups include: linguistic equivalence, functional equivalence, cultural equivalence, and metric equivalence. Attention to these methodological features is critical for establishing a study’s validity. Their consideration and application when needed will reduce validity threats in cross-cultural research. References Abedi, J., & Lord, C. (2001). The language factor in mathematics tests. Applied Measurement in Education, 14, 219 – 234. Alonso, J., Black, C., Norregaard, J.-C., Dunn, E., Andersen, T. F., Espallargues, M., et al. (1998). Cross-cultural differences in the reporting of global functional capacity: An example in cataract patients. Medical Care, 36, 868 – 878. Arnold, B. R., & Matus, Y. E. (2000). Test translation and cultural equivalence methodologies for use with diverse populations. In I. Cuellar & F. A. Paniagua (Eds.), Handbook of multicultural mental health (pp. 121 – 136). San Diego, CA: Academic Press. Azen, S. P., Palmer, J. M., Carlson, M., Mandel, D., Cherry, B. J., Fanchiang, S.-P., et al. (1999). Psychometric properties of a Chinese translation of the SF – 36 Health Survey Questionnaire in the well elderly study. Journal of Aging & Health, 11, 240 – 251. Baldwin, A. Y. (2001). Understanding the challenge of creativity among African Americans. Journal of Secondary Gifted Education, 12(3), 121 – 125. Bates, E., D’Amico, S., Jacobsen, T., Székely, A., Andonova, E., Devescovi, A., et al. (2003). Timed picture naming in seven languages. Psychonomic Bulletin & Review, 10, 344 – 380. Beck, C. T., Bernal, H., & Froman, R. D. (2003). Methods to document semantic equivalence of a translated scale. Research in Nursing & Health, 26, 64 – 73. Bedore, L. M., Peña, E. D., Garcı́a, M., & Cortez, C. (2005). Conceptual versus monolingual scoring: When does it make a difference? Speech, Language, Hearing Services in Schools, 36, 188 – 200. Berliner, D. C. (1988). Meta-comments: A discussion of critiques of L.M. Dunn’s monograph bilingual Hispanic children on the U.S. Mainland. Hispanic Journal of Behavioral Sciences, 10, 273 – 299. Bracken, B. A., & Barona, A. (1991). State of the art procedures for translating, validating and using psychoeducational tests in cross-cultural assessment. School Psychology International, 12, 119 – 132. Brislin, R. (1986). Back-translation methods: The wording and translation of research instruments. In W. J. Lonner & J. W. Berry (Eds.), Field methods in cross-cultural research (pp. 137 – 164). Beverly Hills, CA: Sage. Canino, G., & Guarnaccia, P. (1997). Methodological challenges in the assessment of Hispanic children and adolescents. Applied Development Sciences, 1, 124 – 134. Caselli, M. C., Bates, E., Casadio, P., Fenson, J., Fenson, L., Sanderl., et al. (1995). A cross-linguistic study of early lexical development. Cognitive Development, 10, 159 – 199. Caselli, M. C., Casadio, P., & Bates, E. (1999). A comparison of the transition from first words to grammar in English and Italian. Journal of Child Language, 26, 69 – 111. Coleman, D., & Rowthorn, R. (2004). The economic effects of immigration into the United Kingdom. Population and Development Review, 30, 579 – 624. Cummins, J. (1988). ‘‘Teachers are not miracle workers’’: Lloyd Dunn’s call for Hispanic activism. Hispanic Journal of Behavioral Sciences, 10, 263 – 272. Devescovi, A., & D’Amico, S. (2005). The competition model: Crosslinguistic studies of online processing. In M. Tomasello & D. I. Slobin (Eds.), Beyond nature – nurture: Essays in honor of Elizabeth Bates (pp. 165 – 191). Mahwah, NJ: Erlbaum. Lost in Translation Dunn, L., & Dunn, L. (1981). Peabody Picture Vocabulary Test – Revised. Circle Pines, MN: American Guidance Service. Dunn, L., Padilla, R., Lugo, S., & Dunn, L. (1986). Test de Vocabulario en Imagenes Peabody. Circle Pines, MN: American Guidance Service. Erkut, S., Alarcón, O., Coll, C. G., Tropp, L. R., & Garcı́a, H. A. V. (1999). The dual-focus approach to creating bilingual measures. Journal of Cross-Cultural Psychology, 30, 206 – 218. Fagundes, D., Haynes, W., Haak, N., & Moran, M. (1998). Task variability effects on the language test performance of southern lower socioeconomic class African American and Caucasian five-year-olds. Language, Speech & Hearing Services in the Schools, 29, 148 – 157. Fenson, L., Dale, P. S., Reznick, J. S., Bates, E., Thal, D. J., & Pethick, S. J. (1994). Variability in early communicative development. Monographs of the Society for Research in Child Development, 59(5, Serial No. 242). Fenson, L., Dale, P. S., Reznick, J. S., Thal, D., Bates, E., Hartung, J. P., et al. (1993). MacArthur Communicative Development Inventories: User’s guide and technical manual. Baltimore: Brookes. Gartstein, M. A., & Rothbart, M. K. (2003). Studying infant temperament via the Revised Infant Behavior Questionnaire. Infant Behavior & Development, 26, 64 – 86. Gartstein, M. A., Slobodskaya, H. R., & Kinsht, I. A. (2003). Cross-cultural differences in temperament in the first year of life: United States of America (U.S.) and Russia. International Journal of Behavioral Development, 27, 316 – 328. Geisinger, K. F. (1994). Cross-cultural normative assessment: Translation and adaptation issues influencing the normative interpretation of assessment instruments. Psychological Assessment, 6, 304 – 312. Greenfield, P. M. (1994). Independence and interdependence as developmental scripts: Implications for theory, research, and practice. In P. M. Greenfield & R. R. Cocking (Eds.), Cross-cultural roots of minority child development (pp. 1 – 37). Hillsdale, NJ: Erlbaum. Greenfield, P. M. (1997). You can’t take it with you: Why ability assessments don’t cross cultures. American Psychologist, 52, 1115 – 1124. Greenfield, P. M., Trumbull, E., Keller, H., Rothstein-Fisch, C., Suzuki, L., & Quiroz, B. (2006). Cultural conceptions of learning and development. In P. A. Alexander, P. R. Pintrich, & P. H. Winne (Eds.), Handbook of educational psychology (2nd ed., pp. 675 – 694). Mahwah, NJ: Erlbaum. Grisay, A. (2003). Translation procedures in OECD/PISA 2000 international assessment. Language Testing, 20, 225 – 240. Hambleton, R. K. (2001). The next generation of the ITC test translation and adaptation guidelines. European Journal of Psychological Assessment, 17, 164 – 172. Hamilton, A., Plunkett, K., & Schafer, G. (2000). Infant vocabulary development assessed with a British Communicative Development Inventory. Journal of Child Language, 27, 689 – 705. Hendrickson, S. G. (2003). Beyond translation. . . Cultural fit. Western Journal of Nursing Research, 25, 593 – 608. Jackson-Maldonado, D., Thal, D., Marchman, V. A., Bates, E., & Gutierrez-Clellen, V. (1993). Early lexical develop- 1263 ment in Spanish-speaking infants and toddlers. Journal of Child Language, 20, 523 – 549. Katzir, T., Shaul, S., Breznitz, Z., & Wolf, M. (2004). The universal and the unique in dyslexia: A cross-linguistic investigation of reading and reading fluency in Hebrewand English-speaking children with reading disorders. Reading & Writing, 17, 739 – 768. Kemnitz, A. (2003). Immigration, unemployment and pensions. Scandinavian Journal of Economics, 105, 31 – 48. Kim, M., Han, H.-R., & Phillips, L. (2003). Metric equivalence assessment in cross-cultural research: Using an example of the Center for Epidemiological Studies – Depression Scale. Journal of Nursing Measurement, 11, 5 – 18. Lew-Williams, C., & Fernald, A. (2007). Young children learning Spanish make rapid use of grammatical gender in spoken word recognition. Psychological Science, 18, 193 – 198. Livi Bacci, M. (2004). The population of the developed countries: Decreasing returns? Review of Economic Conditions in Italy, January-April (1), 27 – 50. Lyytinen, P., Poikkeus, A. M., Leiwo, M., Ahonen, T., & Lyytinen, H. (1996). Parents as informants of their child’s vocal and early language development. Early Child Development and Care, 126, 15 – 25. MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk. Mahwah, NJ: Erlbaum. Maitel, S. L., Dromi, E., Sagi, A., & Bornstein, M. H. (2000). The Hebrew Communicative Development Inventory: Language specific properties and cross-linguistic generalizations. Journal of Child Language, 27, 43 – 67. Marian, V., & Kaushanskaya, M. (2004). Self-construal and emotion in bicultural bilinguals. Journal of Memory & Language, 51, 190 – 201. McCollum, J. A., & Chen, Y.-J. (2001). Maternal roles and social competence: Parent – infant interactions in two cultures. Early Child Development and Care, 166, 119 – 133. Mercer, J. (1988). Ethnic differences in IQ scores: What do they mean? (A response to Lloyd Dunn). Hispanic Journal of Behavioral Sciences, 10, 199 – 218. Miller, J., & Iglesias, A. (2005). Systematic Analysis of Language Transcripts – SALT V9. University of Wisconsin, Language Analysis Laboratory, Waisman Center. Muñiz, J., Hambleton, R. K., & Xing, D. (2001). Small sample studies to detect flaws in item translations. International Journal of Testing, 1, 115 – 135. Pearson, B. Z., & Fernandez, S. C. (1994). Patterns of interaction in the lexical growth in two languages of bilingual infants and toddlers. Language Learning, 44, 617 – 653. Peña, E. D. (2001). Assessment of semantic knowledge: Use of feedback and clinical interviewing. Seminars in Speech and Language, 22, 51 – 63. Peña, E. D., Bedore, L. M., & Rappazzo, C. (2003). Comparison of Spanish, English, and bilingual children’s performance across semantic tasks. Language, Speech & Hearing Services in the Schools, 34, 5 – 16. Peña, E. D., Gutierrez-Clellen, V. F., Iglesias, A., Goldstein, B., & Bedore, L. M. (2007). Bilingual English Spanish assessment. Manuscript in preparation. 1264 Peña Peña, E. D., & Jackson, J. (2000). The social and cultural bases of communication. In R. Gillam, T. Marquardt, & F. Martin (Eds.), Communication sciences & disorders: From science to clinical practice (pp. 63 – 84). San Diego, CA: Singular. Posada, G., Carbonell, O. A., Alzate, G., & Plata, S. J. (2004). Through Colombian lenses: Ethnographic and conventional analyses of maternal care and their associations with secure base behavior. Developmental Psychology, 40, 508 – 518. Poulin-Dubois, D., Graham, S., & Sippola, L. (1995). Early lexical development: The contribution of parental labeling and infants categorization abilities. Journal of Child Language, 22, 325 – 343. Prewitt Diaz, J. O. (1988). Assessment of Puerto Rican children in bilingual education programs in the United States: A critique of Lloyd M. Dunn’s monograph. Hispanic Journal of Behavioral Sciences, 10, 237 – 252. Publication Manual of the American Psychological Association (5th ed.). (2001). Washington, DC: American Psychological Association. Restrepo, M. A., & Silverman, S. W. (2001). Validity of the Spanish Preschool Language Scale – 3 for use with bilingual children. American Journal of Speech-Language Pathology, 10, 382 – 393. Rogers, W. T., Gierl, M. J., Tardif, C., Lin, J., & Rinaldi, C. (2003). Differential validity and utility of successive and simultaneous approaches to the development of equivalent achievement tests in French and English. Alberta Journal of Educational Research, 49, 290 – 304. Rogler, L. H. (1999). Methodological sources of cultural insensitivity in mental health research. American Psychologist, 54, 424 – 433. Sechrest, L., Fay, T. L., & Hafeez Zaidi, S. M. (1972). Problems of translation in cross-cultural research. Journal of Cross-Cultural Psychology, 3, 41 – 56. Shah, C., & Long, M. (2003). Employment changes and job openings for new entrants in nursing and caring occupations in australia. Australian Journal of Labour Economics, 6(3), 453 – 471. Shin, H. B., & Bruno, R. (2003, October). Language use and English speaking ability: 2000. Census 2000 brief. U.S. Census Bureau. Retrieved June 4, 2007, from http://www. census.gov/prod/2003pubs/c2kbr-29.pdf. Sireci, S. G., & Berberoglu, G. (2000). Using bilingual respondents to evaluate translated-adapted items. Applied Measurement in Education, 13, 229 – 248. Slobin, D. I. (1985). Crosslinguistic evidence for the language-making capacity. In D. I. Slobin (Ed.), The crosslinguistic study of language acquisition: Vol. 2. Theoretical issues (pp. 1157 – 1256). Hillsdale, NJ: Erlbaum. Standards for educational and psychological testing. (1999). Washington, DC: American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Suizzo, M.-A. (2004). French and American mothers’ childrearing beliefs: Stimulating, responding, and longterm goals. Journal of Cross-Cultural Psychology, 35, 606 – 626. Tamayo, J. (1987). Frequency of use as a measure of word difficulty in bilingual vocabulary test construction and translation. Educational & Psychological Measurement, 47, 893 – 902. Tamayo, J. (1990). A validated translation into Spanish of the WISC – R vocabulary subtest words. Educational & Psychological Measurement, 50, 915 – 921. Tardif, T., Gelman, S., & Xu, F. (1999). Putting the ‘‘noun bias’’ in context: A comparison of English and Mandarin. Child Development, 70, 620 – 635. Towse, J., & Saxton, M. (1998). Mathematics across national boundaries: Cultural and linguistic perspectives on numerical competence. In C. Donlan (Ed.), The development of mathematical skills (pp. 129 – 150). Hove, UK: Psychology Press/Taylor & Francis. Trueba, H. T. (1988). Comments on L.M. Dunn’s bilingual Hispanic children on the U.S. mainland: A review of research on their cognitive, linguistic, and scholastic development. Hispanic Journal of Behavioral Sciences, 10, 253 – 262. Valencia, R., & Rankin, R. J. (1985). Evidence of content bias on the McCarthy scales with Mexican American children: Implications for test translation and nonbiased assessment. Journal of Educational Psychology, 77, 197 – 207. van der Veer, K., Ommundsen, R., Hak, T., & Larsen, K. S. (2003). Meaning shift of items in different language versions. A cross-national validation study of the Illegal Aliens Scale. Quality & Quantity: International Journal of Methodology, 37, 193 – 206. van der Veer, K., Ommundsen, R., Larsen, K. S., Van Le, H., Krumov, K., Pernice, R. E., et al. (2004). Structure of attitudes toward illegal immigration: Development of cross-national cumulative scales. Psychological Reports, 94, 897 – 906. Vigil, D. C. (2002). Cultural variations in attention regulation: A comparative analysis of British and Chinese populations. International Journal of Language & Communication Disorders, 37, 433 – 458. Willig, A. C. (1988). A case of blaming the victim: The Dunn monograph on bilingual Hispanic children on the U.S. Mainland. Hispanic Journal of Behavioral Sciences, 10, 219 – 236. Zambrano, I., & Greenfield, P. M. (2004). Ethnoepistemologies at home and at school. In R. J. Sternberg & E. L. Grigorenko (Eds.), Culture and competence (pp. 251 – 272). Washington, DC: American Psychological Association. Zimmerman, I., Steiner, V., & Pond, R. (1992). Preschool Language Scale – 3. San Antonio, TX: The Psychological Corporation. Zimmerman, I., Steiner, V., & Pond, R. (1993). Preschool Language Scale – 3 Spanish Edition. San Antonio, TX: The Psychological Corporation. Zimmerman, I., Steiner, V., & Pond, R. (2002). Preschool Language Scale – 4 Spanish Edition. San Antonio, TX: The Psychological Corporation. Zuckerman, H. (1988). The sociology of science. In N. J. Smelser (Ed.), Handbook of sociology. (p. 511 – 574). Thousand Oaks: Sage.
ATTITUDES AND SOCIAL COGNITION What Happens If We Compare Chopsticks With Forks? The Impact of Making Inappropriate Comparisons in Cross-Cultural Research Fang Fang Chen This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. University of Delaware It is a common practice to export instruments developed in one culture to another. Little is known about the consequences of making inappropriate comparisons in cross-cultural research. Several studies were conducted to fill in this gap. Study 1 examined the impact of lacking factor loading invariance on regression slope comparisons. When factor loadings of a predictor are higher in the reference group (e.g., United States), for which the scale was developed, than in the focal group (e.g., China), into which the scale was imported, the predictive relationship (e.g., self-esteem predicting life satisfaction) is artificially stronger in the reference group but weaker in the focal group, creating a bogus interaction effect of predictor by group (e.g., self-esteem by culture); the opposite pattern is found when the reference group has higher loadings in an outcome variable. Studies 2 and 3 examined the impact of lacking loading and intercept (i.e., point of origin) invariance on factor means, respectively. When the reference group has higher loadings or intercepts, the mean is overestimated in that group but underestimated in the focal group, resulting in a pseudo group difference. Keywords: cross-cultural comparison, measurement invariance, construct equivalence, bias in regression slopes and means, self-esteem must address an important question: Are we comparing the same constructs across different groups? Culture affects people in a variety of basic psychological domains, including self-concept, attribution and reasoning, interpersonal communication, negotiation, intergroup relations, and psychological well-being (for review, see Brewer & Chen, 2007; Fiske, Kitayama, Markus, & Nisbett, 1998, 2004; Lehman, Chiu, & Schaller; Markus & Kitayama, 1991; Oyserman, Coon, & Kemmelmeier, 2002). Suppose we were interested in studying self-esteem and life satisfaction in the People’s Republic of China and the United States. We may wish to test the mean differences between the two cultural groups on the two constructs and, further, to examine whether the relationship of self-esteem to life satisfaction is stronger in one culture than in the other. Could we simply use scales developed in one culture, such as Rosenberg’s self-esteem scale (Rosenberg, 1965), in both cultural groups and then compare the results? To make valid comparisons across different cultural or ethnic groups, we What Is Measurement Invariance and Why Is It Important in Cross-Cultural Research? When we compare scale scores, such as self-esteem, across different groups, we make a critical assumption that the scale measures the same construct in all of the groups. If that assumption is true, comparisons and analyses of those scores are valid, and subsequent interpretations are meaningful. However, if that assumption does not hold, such comparisons do not produce meaningful results. This is the general issue of measurement invariance. Measurement invariance is the equivalence of a measured construct in two or more groups, such as people from different cultures. It assures that the same constructs are being assessed in each group. Measurement invariance is an important issue if a researcher wishes to make group comparisons (e.g., Byrne & Watkins, 2003; Reise, Widaman, & Pugh, 1993; Riordan & Vanderberg, 1994; Van de Vijver & Leung, 1997; Widaman & Reise, 1997). Meaningful comparisons of statistics, such as means and regression coefficients, can only be made if the measures are comparable across different groups. Cross-cultural researchers have long recognized the importance of ensuring construct comparability in different cultural or ethnic groups (Berry, 1969; Irvine & Carroll, 1980; Poortinga, 1989; Van de Vijver & Leung, 1997). However, it is the development of measurement invariance tests (Jöreskog, 1971; Meredith, 1993; Millsap & Everson, 1993; Sörbom, 1978; Widaman & Reise, I would like to express appreciation to Donna Coffman, Larry Cohen, Samuel Gaertner, Kimberly Juliano, Shanhong Luo, Beth Morling, Kristopher Preacher, Robert Simons, Stephen West, and Zugui Zhang for their thoughtful comments. Special thanks go to Lyle Jones and Roger Millsap for their insights on scale development and measurement invariance. I am also grateful to the Quantitative Forum in the Psychology Department at the University of North Carolina at Chapel Hill for fruitful discussion at the early stage of this work. Correspondence concerning this article should be addressed to Fang Fang Chen, Department of Psychology, University of Delaware, Wolf Hall, Newark, DE 19716. E-mail: xiyu@psych.udel.edu Journal of Personality and Social Psychology, 2008, Vol. 95, No. 5, 1005–1018 Copyright 2008 by the American Psychological Association 0022-3514/08/$12.00 DOI: 10.1037/a0013193 1005 This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 1006 CHEN 1997) and the recent development of advanced statistical tools that have made it possible to perform rigorous tests of measurement invariance. Measurement invariance can be tested when a scale is composed of multiple items or subscales. With continuous variables, the most frequently used technique for testing measurement invariance is multiple-group confirmatory factor analysis (CFA; F. F. Chen, 2007; F. F. Chen, Sousa, & West, 2005; F. F. Chen & West, 2008; Meredith, 1993; Millsap & Everson, 1993; Widaman & Reise, 1997). In factor analytic terms, the items serve as indicators of the common factor (i.e., the construct that the items intend to measure) in a CFA model. The basic idea of applying multiple-group CFA to test measurement invariance is to examine the interrelations between the indicators (i.e., items or subscales) and the factors that the indicators are supposed to measure. Multiple-group CFA can be used to test the equivalence of the factor structure (i.e., number of factors), factor loadings (i.e., unit of a scale), intercepts (i.e., origin of a scale), residual variance (i.e., precision of a scale), and other aspects of a construct across different groups in a series of hierarchical models. The most basic level of measurement invariance is known as configural invariance (Horn, McArdle, & Mason, 1983) or factorform invariance (Cheung & Rensvold, 2000). It tests whether similar, but not identical, factors are measured in the groups (Widaman & Reise, 1997). The same item must be associated with the same latent factor in each group, but the factor loadings can differ across groups. The second level of invariance is factor loading or metric invariance. Factor loadings represent the strength of the relationships between each factor and its associated items (Bollen, 1989; Jöreskog & Sörbom, 1999). Factor loadings can be conceptualized as the slopes of regression lines, that is, the weights obtained by regressing the item responses on the underlying latent factors. When factor loadings are equal, the unit of the measurement is identical, and thus predictive relationships can be compared across groups. The third level of invariance is intercept or scalar invariance. It tests whether an item has the same point of origin across different groups. When invariance is achieved at both the factor loading and intercept levels, scores from different groups have the same unit of measurement (i.e., factor loading) as well as the same origin (i.e., intercept), and thus factor means can be compared across groups. Otherwise, it is not certain whether group differences on factor means are attributable to valid cultural differences or to measurement artifacts. The fourth level is the invariance of residual variance. It tests the equivalence of the precision of a scale.1 Measurement invariance can be used to test the invariance of a scale (i.e., an omnibus test in which all items are tested simultaneously) as well as the invariance of individual items (i.e., planned contrast in which one or more items are tested). When items meet the standards of measurement invariance, they are considered invariant; otherwise, they are defined as non-invariant, lacking invariance, or having measurement bias. It is possible that some of the items are invariant, whereas others are not in a given scale. For detailed procedures on testing measurement invariance and criteria on evaluating measurement invariance, see F. F. Chen (2007); F. F. Chen, Souse, and West (2005); and Widaman and Reise (1997). What Factors Can Cause Lack of Measurement Invariance? When scale scores are compared across different cultural groups, a variety of sources can affect the equivalence of the construct. Lack of configural invariance (i.e., the number of factors that underlies a construct is different) is most likely to occur when a construct is simply imported from one cultural setting to another, because a construct can be more differentiated in one culture than in another. For example, the concept of individuation (Maslach, Stapp, & Santee, 1985) is best represented by two factors in China, whereas it is unidimensional in the United States (Kwan, Bond, Boucher, Maslach, & Gan, 2002). Similarly, filial piety is also a more elaborated concept in China than in the United States (Hsieh, 1967). Lack of loading invariance (i.e., unit of a scale) is likely to arise from multiple causes. First, it can happen when a scale is imported from one culture, such as the United States, to another, such as China, but the definitions and meanings of that concept do not fully overlap across different cultures. As a result, the item content is more appropriate for one culture than for the other. For example, for North Americans, self-esteem mainly stems from having unique personal attributes and individual achievements. In contrast, for people from Eastern cultures, the self is deeply connected with family, friends, groups, etc., and thus the sense of “we” and interdependence with others may be the most important source of self-esteem. Consequently, items that tap the Western view of self-esteem, such as “I am a person of worth,” and “I feel that I have a number of good qualities,” may not be good indicators of self-esteem in an Eastern context. The association between Chinese participants’ self-esteem and endorsement of Western items (i.e., factor loadings) may be weaker than for American participants. Second, lack of loading invariance can come from inappropriate translation. When items are translated from one language to another, their meanings can change, particularly for idiomatic expressions. For example, items like “I feel blue” as a measure of depression would make Chinese participants feel that this item is out of the blue. The American participants would thus respond to the content of the item, whereas the Chinese participants would give inconsistent answers. As a result, the strength of the relationship (i.e., factor loading) between the items and the depression construct would be weaker for the Chinese participants than for their American counterparts. Third, response sets, particularly the tendency to use or avoid extreme responses, can result in lack of loading invariance. For example, evidence suggests that U.S. participants have an inclination to use the extreme ends of a response scale, whereas Chinese participants are more likely to use the middle points (C. Chen, Lee, & Stevenson, 1995; Hui & Triandis, 1985), resulting in a restricted range of responses among the Chinese participants. Accordingly, factor loadings differ across the two groups. Several factors can affect the origin of a scale, that is, the intercept of the scale. First, social desirability, a tendency to follow the social norms, can lead participants in one group to consistently give higher or lower ratings than those in other groups (Hui & 1 Residual variance consists of unique variance to that item and random error. It is assumed that the expected value of the random error is zero. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. IMPACT OF LACK OF MEASUREMENT INVARIANCE Triandis, 1985). For example, for the item “How happy were you in the past week?,” the true happy state might be 3 on a 5-point scale for participants from both the United States and China. However, the American may respond with 4 because of the need to preserve positive self-esteem (e.g., Heine, Lehman, Markus, & Kitayama, 1999). Second, when a group is preoccupied with its own defects or deficiencies, it may convey a stronger desire for these values or traits. For example, survey ratings indicate that some minority parents and students value the importance of education more than do their European and Asian counterparts. However, behavioral observations, such as the amount of time that students stay in school and study, tell a different story (cf. Peng, Nisbett, & Wong, 1997). Third, people from different cultural groups may use different reference frameworks in making judgments about themselves. For example, current trait or attitude measures of individualism and collectivism often fail to reveal the expected cultural differences. However, when participants from Japan and Canada were asked to compare themselves with either Canadians or Japanese, the expected cultural differences were enhanced when the cross-reference group was used (Heine, Lehman, Peng, & Greenholtz, 2002). Under all three scenarios, the origin of a scale would be different. A 3 in Culture A may be equal to a 4 in Culture B, resulting in lack of intercept invariance. Given the comparative nature of the studies, it is quite a challenging task to achieve measurement invariance in cross-cultural research, particularly when we simply apply instruments developed in one culture to other cultural contexts. However, this is a common practice in applied research. To what extent are these scales invariant cross-culturally, and how confident are we about the conclusions drawn from these studies? To address these issues, A literature review was conducted on the instruments used in cross-cultural studies. Are the Instruments Comparable Cross Culturally? Analysis of the Current Practice The following key words and 30 other similar words were used to search articles published from 19932 to 2006 in the PsycINFO database: “cross-cultural invariance,” “factor invariance,” “measurement invariance.” One hundred thirty comparisons3 met the following selection criteria: (a) the instrument was originally developed in North America, (b) Caucasian Americans/Canadians were used as the reference group, (c) the article was published in a peer-reviewed journal, and (d) factor loadings of each cultural or ethnic group were reported or obtained upon request. Analyses were performed to examine the pattern and severity of factor-loading differences across the cultural or ethnic comparisons. The analysis results, such as effect size, pattern of noninvariance, and sample size, were used as the basis for conducting subsequent simulated studies, in which bias in regression slopes and means resulting from lacking measurement invariance was examined. Following the convention of Holland and Thayer (1988), the mainstream cultural group is defined as the reference group (e.g., United States), and the other ethnic minority or cultural groups are defined as focal groups (e.g., China). To clarify the nature of loading differences, two patterns of non-invariance are defined in this review: (a) When all the non-invariant loadings are higher in the reference group than in the focal group, it is classified as a 1007 uniform pattern of non-invariance; (b) when some of the noninvariant loadings are higher in the reference group and some are higher in the focal group, it is classified as a mixed pattern of non-invariance. In both cases, the magnitude of the loading difference is the numerical difference between the loadings for a given item across two groups. Among the 130 cross-cultural and cross-ethnic comparisons, 9 lacked configural invariance, which means that the number of factors that underlie the items was different across groups. These cases are excluded from further analysis, because it is not meaningful to compare factor loadings when configural invariance is not achieved. In the remaining 122 comparisons, 97 were based on standardized factor loadings, and 25 were based on unstandardized factor loadings. Further analyses are based on the standardized factor loadings, because unstandardized factor loadings are subject to scaling, which prevents direct comparisons across studies. For 74 of the 97 standardized comparisons (74.2%), the average loading was higher in the reference group than in the focal group, and the average loading difference was .13 (SD ⫽ .08). Although the magnitude of the average loading difference between the groups appears small, its impact may not be trivial. Findings further indicate that 14 of the 97 comparisons (14.4%) had all loadings higher in the reference group (e.g., United States) than in the focal group (e.g., China), showing a uniform pattern of non-invariance. However, it was more common that only a proportion of the items, rather than all items, had higher loadings in the reference group than in the focal group: 26 of the comparisons (26.8%) had at least 90% of the loadings higher in the reference group, 48 of the comparisons (49.5%) had at least 75% of the loadings higher in the reference group, 81 of the comparisons (83.5%) had at least 50% of the loadings higher in the reference group, and 94 of the comparisons (96.9%) had at least 30% of the loadings higher in the reference group. It is interesting that 7 of the 97 comparisons (7.2%) had about half of the loadings higher in the reference group and the other half higher in the focal group, showing a mixed pattern of non-invariance. Given these findings, it is important to examine bias in group comparisons resulting from a proportion of non-invariant items, in addition to bias associated with the condition in which all loadings are higher in one group than in the other. It is also meaningful to investigate bias associated with the pattern of non-invariance, that is, whether the non-invariant loadings are uniformly higher in one group or the pattern is mixed. Although no studies have systematically examined the pattern of factor loadings across different cultural and ethnic groups, the findings from this review are consistent with the literature on reliabilities. For example, in compensatory education research, test scores obtained from the disadvantaged minority groups often have lower reliability, compared with those of the advantaged group (Campbell & Boruch, 1975). Reviews of self-reported measures on values indicate that higher reliability was more often reported in the American samples than in other cultural groups 2 In 1993, Meredith published the most influential article on measurement invariance. 3 The search ended in August of 2006. Thirty-two more articles met the first three criteria, but the factor loadings for each cultural/ethnic group were not available, and these articles were thus excluded from the analysis. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 1008 CHEN (Peng et al., 1997). The lower reliability in the focal groups is a reasonable indication of lower factor loadings and is thus a sign of measurement bias.4 Given that reliabilities are more routinely reported than factor loadings in published articles, a second search was conducted. To limit the scope of the search, Rosenberg’s (1965) self-esteem scale was chosen, as it is perhaps one of the most widely used scales cross-culturally. Using key words “culture and Rosenberg selfesteem,” and “cross-cultural and Rosenberg self-esteem,” a search was performed on the PsycINFO database, and it was limited to articles published from 1995 to 2006. Seventy-five comparisons met the following criteria: (a) Rosenberg’s self-esteem measure was used cross-culturally or cross ethnically, (b) Caucasian Americans/Canadians were used as the reference group, (c) the article was published in a peer-reviewed journal, and (d) reliability of the scale for each cultural or ethnic group was reported or obtained upon request. In 59 of the 75 comparisons (78.7%), reliability of the Rosenberg self-esteem scale was higher in the Caucasian Americans or Canadians than in the other cultural or ethnic group(s), and the average difference in reliability was .07 (MU.S./ Canada ⫽ .87, SD ⫽ .02 vs. MNon-U.S./Canada ⫽ .80, SD ⫽ .08). This pattern is particularly true when comparing North Americans with Asians, because in 18 of the 21 comparisons (85.7%), scores of North Americans had higher reliability than the scores of Asians, and the difference in reliability was .09 (MU.S./Canada ⫽ .87, SD ⫽ .02 vs. MNon-U.S./Canada ⫽ .78, SD ⫽ .05). This analysis also indicates that, consistent with the literature, North Americans have higher self-esteem than other cultural or ethnic groups (Cohen’s d ⫽ .31), and this difference is moderately large between North Americans and Asians (Cohen’s d ⫽ .59). However, it is possible that the lower reliability in the focal groups is, at least in part, responsible for the commonly reported cultural and ethnic difference in self-esteem. What Happens When Instruments Are Not Comparable Cross-Culturally? The Present Simulation Studies When we compare diverse groups on the basis of instruments that do not have the same psychometric properties, we may discover erroneous “group differences” that are in fact artifacts of measurement, or we may miss true group differences that have been masked by these artifacts. As a result, a harmful education program may be regarded as beneficial to the students, or an effective health intervention program may be considered of no use to depressive patients. Although measurement invariance has been increasingly tested in cross-cultural comparisons (e.g., Byrne & Campbell, 1999; Little, 1997; Rhee, Uleman, & Lee; 1996; Steenkamp & Baumgartner, 1998), it is still usually assumed, rather than tested. The author’s review of articles in the Journal of Personality and Social Psychology from 1985 to 2005 indicate that although 48 articles involved cross-cultural comparisons of attitudes, values, personality, and other self-reported surveys, only 8 studies (less than 17%) tested measurement invariance across different cultural groups, with the remainder using a sum score or mean score. The sumscore approach takes the total score of the items in a scale, and similarly, the mean score takes the average of the items. Both approaches assume that the measures under study are invariant across different groups. In addition, it is not uncommon to pool participants from different cultural or ethnic groups for evaluation, a procedure that assumes measurement invariance as well. However, as discovered in the author’s review, this assumption does not hold in many applications. To explore the consequences of making comparisons based on non-invariant measures on the conclusions drawn from a study, Millsap and Kwok (2004) conducted an important series of simulation studies. Given that school admission committees or employers often select students or employees from different ethnic or cultural backgrounds, Millsap and Kwok examined selection bias based on a criterion that is only partially invariant. Selection bias was defined by the accuracy of classifying people according to two standards: a factor score for each group and a composite score in which the group difference in factor loadings was ignored.5 Four categories were created: (a) true positive, should be selected on the basis of the factor score and was selected on the basis of the composite score; (b) true negative, should not be selected on the basis of the factor score and was not selected on the basis of the composite score; (c) false positive, should not be selected but was selected; and (d) false negative, should be selected but was not selected. It was found that even small group differences in factor structure could have substantial influence on selection accuracy, particularly for sensitivity, which is the number of individuals who were selected on the basis of both their factor score and their composite score divided by the number of individuals who were selected solely on the basis of their factor score. For example, when the proportion of non-invariance varied from 0% (control condition) to 75%, sensitivity could drop from 64.2% to 22.1%. No studies have examined the bias that lack of measurement invariance may introduce to commonly used statistics, such as means and regression slopes, in group comparisons. For example, suppose we were interested in asking whether self-esteem would predict life satisfaction to the same degree for Chinese as for Caucasian students. In what direction and to what extent would the predictive relationship (i.e., the beta weight or regression slope) be affected by lack of invariance in the self-esteem measure? How would the relationship be biased if the outcome measure, life satisfaction, also lacks invariance? In what direction would the group means for self-esteem and life satisfaction be biased? To address these issues, three simulation studies were conducted to fill in this gap. Overview of Present Simulation Studies Many researchers have discussed the importance of testing measurement invariance and are well aware that lack of invariance can lead to possible bias in conclusions (e.g., Widaman & Reise, 1997). However, this is the first investigation that examines both the direction and degree of bias resulting from various forms of non-invariance in cross-cultural research. This information could be vital to researchers when interpreting findings based on noninvariant measures, because it can warn readers by specifying the direction and degree of bias in each cultural or ethnic group, given that the requirements for measurement invariance are often difficult to meet in applied research. Second, this is also the first study 4 5 Reliability also depends on the variability of the scores. A composite score is formed by taking the sum of all items. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. IMPACT OF LACK OF MEASUREMENT INVARIANCE in which the simulation conditions are based on the empirical findings in the cross-cultural literature, and it therefore maximizes the external validity of the study. Third, this investigation is particularly relevant to the cross-cultural study of personality and social psychological phenomena. There are three major goals in the present investigation: (a) to examine bias in regression slopes (beta weights) when factor loadings are not invariant, as factor loading invariance is a prerequisite for regression slope comparisons (e.g., When using self-esteem to predict subjective well-being, how would the predictive relationship be affected if the factor loadings of self-esteem were different across groups?); (b) to explore bias in means when factor loadings are not invariant, because factor loading invariance is also a prerequisite for proper mean comparisons (e.g., How would group means be biased when factor loadings of self-esteem differ?); (c) to investigate bias in means when intercepts (i.e., point of origin) are not invariant, as intercept invariance is a prerequisite for mean comparisons, in addition to factor loading invariance (Widaman & Reise, 1997; e.g., When one group has higher intercepts in selfesteem than the other group, in what direction would the means be biased in each group?) Given the computational complexity and intensity, the Mplus software program (Muthén & Muthén, 1998) was used to conduct the simulation. Study 1: Lack of Loading Invariance and Bias in Regression Slopes As discussed earlier, lack of invariance in factor loadings can come from insufficient overlap in meaning of a construct between cultural groups, inappropriate content of the items, translation problems, the tendency to use or avoid extreme responses on a response scale, differential responses to positively versus negatively worded items, and other sources. Study 1 was conducted to examine predictive bias between two constructs when a predictor or an outcome measure lacks invariance in factor loadings. This would allow us to examine bias in a predictive relationship, such as using self-esteem to predict life satisfaction across groups. When bias is found, one may discover a bogus interaction effect of culture by predictor. For example, self-esteem may be found to be a stronger predictor of life-satisfaction for Caucasians than for Chinese, when in fact the relationship is the same for both groups. Design To systematically examine bias in regression slopes when the predictor or criterion lacks loading invariance and to maximize the external validity simultaneously, 4 (Proportion of Non-invariance: 87.5%, 75%, 50%, and 25%) ⫻ 2 (Pattern of Invariance: uniform vs. mixed) ⫻ 2 (Ratio of Sample Size: 1 vs. 1, 4 vs. 1; total N ⫽ 300) experimental conditions were generated (see Appendix for detailed model parameters and additional justification for parameter selections). The proportion of non-invariance conditions correspond approximately to the findings in the author’s literature review: 26 of the 97 comparisons had at least 90% of the loadings higher in the reference group, 48 of the comparisons had at least 75% of the loadings higher in the reference group, 81 of them had at least 50% of the loadings higher in the reference group, and 94 of them had at least 30% of the loadings higher in the reference group. 1009 The proportion of non-invariance was varied to serve two purposes: (a) to maximize the external validity of the study (as found in the author’s literature review, in many of the applications, only a proportion of the items, rather than all the items, in a scale are non-invariant); (b) to explore whether the relationship between the degree of bias corresponds monotonically to the degree of noninvariance (i.e., to examine whether a greater degree of noninvariance in factor loadings leads to a greater degree of bias in regression slopes, which is particularly important when the power of testing measurement is considered). This issue is addressed further in the discussion. In the uniform pattern of non-invariance condition, all noninvariant loadings were set higher in the reference group (e.g., United States) than in the focal group (e.g., China). In the mixed pattern of non-invariance condition, about half of the items were set higher in the reference group, whereas the other half were set higher in the focal group. This condition was designed to match the finding in the review as well, given that 7 of the 97 comparisons showed this pattern of non-invariance. The ratio of sample size (1 vs. 1 and 4 vs. 1) also reflects the findings in the review, because among 36.4% of the comparisons, the ratio of sample size was less than 1.5, and the average ratio of sample size was 4.67 across all comparisons.6 Finally, given that in applied research, both the predictor and outcome variable may lack invariance, such a condition was also examined. For simplicity, the degree and direction of bias were equivalent in both variables, and only the uniform condition was considered. The expected mean and covariance structures were generated in Version 3.01 of Mplus (Muthén & Muthén, 1998), and maximum likelihood estimation was used to estimate models. First, a population matrix was generated, corresponding to the parameterization of a target two-group model. In the target model, the factor loadings were different between the groups (except for the marker variable,7 which was set equal across the groups); all other parameters (i.e., factor variance and covariance, and residual variances) were set equal across the groups. Second, a configural invariance model was fit to the generated population matrix, in which the pattern of the factor loadings was the same (i.e., the same item loaded on the same factor[s]), whereas all loadings were freely estimated in both groups. Third, a factor loading invariance model was fit to the population matrix, in which all the loadings were equated across the groups. Regression slopes obtained from the loading invariance model with the true values in the configural invariance model were compared to determine the direction and degree of bias in the regression slopes. Results Tables 1 and 2 present bias in regression slopes when the predictor lacks loading invariance and when the criterion lacks loading invariance, respectively. Table 3 displays the results when 6 The average ratio of sample size was 1.34 and 8.01 for standardized and unstandardized comparisons, respectively. The average of these two numbers was 4.67. An outlier of 47.15 was excluded from the analysis. 7 A model is identified only if there is a unique numerical solution for each of the parameters (Ullman, 2001). One common approach is to set one of the factor loadings to 1 for each factor, and that item is called the marker variable. CHEN 1010 Table 1 When a Predictor Is Lack of Factor Loading Invariance: Bias in Regression Slopes (Study 1) Lack of loading invariance is uniform This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Slope Relative bias Non-invariance proportion (%) Ref. Focal Ref. Focal 87.5 75.0 50.0 25.0 .676 .683 .702 .725 .945 .898 .827 .780 ⫺9.87% ⫺8.93% ⫺6.40% ⫺3.33% 26.00% 19.73% 10.27% 4.00% 87.5 75.0 50.0 25.0 .725 .708 .731 .739 1.022 .940 .877 .804 ⫺3.33% ⫺5.60% ⫺2.53% ⫺1.47% 36.27% 25.33% 16.93% 7.20% Lack of loading invariance is mixed/balanced Slope diff. Slope Relative bias Slope diff. Ref. Focal Ref. Focal N ⫽ 150 vs. 150 ⫺.269 ⫺.215 ⫺.125 ⫺.055 .738 .731 .738 .744 .763 .773 .763 .756 ⫺1.60% ⫺2.53% ⫺1.60% ⫺.80% 1.73% 3.07% 1.73% .80% ⫺.025 ⫺.042 ⫺.025 ⫺.012 N ⫽ 240 vs. 60 ⫺.297 ⫺.232 ⫺.146 ⫺.065 .750 .734 .740 .745 .750 .754 .753 .751 .00% ⫺2.13% ⫺1.33% ⫺.67% .00% .53% .40% .13% .000 ⫺.020 ⫺.013 ⫺.006 Note. Ref. ⫽ reference group; Focal ⫽ focal group; diff. ⫽ difference. The true slope is .75 for both groups. A positive value in the Relative bias column indicates that the slope is overestimated, and a negative value indicates that the slope is underestimated. both the predictor and the criterion violate loading invariance. Relative bias was calculated by subtracting the estimated slope in the loading invariance model from the true regression slope and then dividing the difference by the true regression slope. A positive value indicates that the slope was overestimated, and a negative valued indicates that the slope was underestimated. Predictor or criterion lack of loading invariance— uniform. When the predictor, such as self-esteem, lacked loading invariance, the regression slope was underestimated in the reference group (e.g., United States) but overestimated in the focal group (e.g., China). For example, in the case of self-esteem predicting life satisfaction, when self-esteem is a better measure for Americans than for Chinese, the predictive relationship is weaker for Americans than for Chinese, even when the true relationship (as specified in the simulation) is the same for both groups. As a result, an artificial interaction effect of Culture ⫻ Self-Esteem is created. The degree of bias (i.e., the extent to which the slope is overestimated or underestimated, or the artificially created group difference in the slope) is affected by the proportion of noninvariant items, group membership, and ratio of sample size (i.e., sample size of the reference group vs. focal group). That is, when the proportion of non-invariant items increases, bias increases; bias is bigger in the focal group than in the reference group, especially when the proportion of non-invariance is large. When sample size increases in the reference group relative to the focal group, bias decreases in that group but increases in the focal group. When the criterion lacked loading invariance, the opposite pattern was found, that is, the regression slope was overestimated in the reference group (e.g., United States) but underestimated in the focal group (e.g., China). Given the same example, when life satisfaction is a more appropriate instrument for Americans than for Chinese, the regression slope is larger for Americans than for Table 2 When a Criterion Is Lack of Factor Loading Invariance: Bias in Regression Slopes (Study 1) Lack of loading invariance is uniform Slope Relative bias Non-invariance proportion (%) Ref. Focal Ref. 87.5 75.0 50.0 25.0 .832 .824 .802 .776 .598 .629 .682 .722 10.93% 9.87% 6.93% 3.47% 87.5 75.0 50.0 25.0 .776 .775 .769 .761 .552 .583 .644 .701 3.47% 3.33% 2.53% 1.47% Lack of loading invariance is mixed Slope diff. Focal Slope Relative bias Slope diff. Ref. Focal Ref. Focal N ⫽ 150 vs. 150 ⫺20.27% .234 ⫺16.13% .195 ⫺9.07% .120 ⫺3.73% .054 .763 .771 .763 .750 .739 .730 .738 .750 1.73% 2.80% 1.73% .00% ⫺1.47% ⫺2.67% ⫺1.60% .00% .024 .041 .025 .000 N ⫽ 240 vs. 60 .224 .192 .125 .060 .752 .768 .761 .743 .750 .746 .748 .752 .27% 2.40% 1.47% ⫺.93% .00% ⫺.53% ⫺.27% .27% .002 .022 .013 ⫺.009 ⫺26.40% ⫺22.27% ⫺14.13% ⫺6.53% Note. Ref. ⫽ reference group; Focal ⫽ focal group; diff. ⫽ difference. The true slope is .75 for both groups. A positive value in the Relative bias column indicates that the slope is overestimated, and a negative value indicates that the slope is underestimated. IMPACT OF LACK OF MEASUREMENT INVARIANCE 1011 Table 3 When Both the Predictor and Outcome Variable Are Lack of Factor Loading Invariance: Bias in Regression Slopes (Study 1) Lack of loading invariance is uniform This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Slope Relative bias Non-invariance proportion (%) Ref. Focal Ref. Focal 87.5 75.0 50.0 25.0 0.750 0.750 0.750 0.750 0.752 0.753 0.752 0.751 0.00% 0.00% 0.00% 0.00% 0.27% 0.40% 0.27% 0.13% 87.5 75.0 50.0 25.0 0.750 0.750 0.750 0.750 0.753 0.753 0.753 0.752 0.00% 0.00% 0.00% 0.00% 0.40% 0.40% 0.40% 0.27% Note. Lack of loading invariance is mixed/balanced Slope Slope diff. Relative bias Ref. Focal Ref. Focal Slope diff. N ⫽ 150 vs. 150 0.002 0.003 0.002 0.001 0.752 0.752 0.750 0.750 0.751 0.751 0.750 0.750 0.27% 0.27% 0.00% 0.00% 0.13% 0.13% 0.00% 0.00% ⫺0.001 ⫺0.001 0.000 0.000 N ⫽ 240 vs. 60 0.003 0.003 0.003 0.002 0.751 0.751 0.748 0.750 0.752 0.752 0.759 0.750 0.13% 0.13% ⫺0.27% 0.00% 0.27% 0.27% 1.20% 0.00% 0.001 0.001 0.011 0.000 Ref. ⫽ reference group; Focal ⫽ focal group; diff. ⫽ difference. Chinese, even when the predictive relationship is the same for both groups. Consequently, lack of invariance in life satisfaction creates a pseudo interaction effect of Culture ⫻ Self-Esteem. As in the case when the predictor is non-invariant, degree of bias is affected by the proportion of non-invariance, group membership, and ratio of sample size. That is, when the proportion of non-invariant items increases, bias increases; bias is bigger in the focal group than in the reference group, especially when the proportion of noninvariant items is large. When the reference group has a larger sample size, bias decreases in that group but increases in the focal group. Predictor or criterion lack of loading invariance—mixed. When the pattern of non-invariant items in the predictor was mixed, bias in the regression slope was reduced in both groups. Similarly, when the pattern of lack of loading invariance in the criterion was mixed, bias in the regression slope was also reduced in both groups. Thus, when some of the loadings are higher in the reference group and some are higher in the focal group, artificially created group difference in the predictive relationship is reduced because bias associated with the reference group and bias associated with the focal group tend to cancel each other out. However, reduced bias in regression slopes does not imply that the measures are invariant. Both predictor and criterion lack of loading invariance— uniform. When both the predictor, such as self-esteem, and the outcome variable, such as life satisfaction, lacked loading invariance, and when the direction and degree of non-invariance were comparable in both groups, bias was reduced. However, this result does not imply that using non-invariant measures simultaneously in the predictor and the criterion is the solution to lack of measurement invariance. Instead, it suggests that when lack of invariance occurs in both the predictor and the outcome variable, statistical bias associated with the non-invariant predictor and bias associated with the non-invariant outcome variable tend to cancel each other out. Summary. The results of Study 1 indicate that lack of factorloading invariance could lead to substantial bias in regression slopes. The direction of bias depends on whether a predictor or criterion lacks invariance. When the reference group had higher loadings in the predictor, the regression slope was underestimated in the reference group but overestimated in the focal group. When the reference group had higher loadings in the criterion, the opposite pattern was found. Under both conditions, a bogus interaction effect was produced. However, when some of the loadings were higher in the reference group and some were higher in the focal group, bias in the regression slopes was reduced. When lack of loading invariance occurred in both the predictor and outcome variable, bias was also reduced. However, the construct validity of the scales is still in question, as they may measure different concepts in different cultures. Study 2: Lack of Loading Invariance and Bias in Means The goal of Study 2 was to explore bias in means when factor loadings are not invariant, given that loading invariance is a prerequisite for mean comparisons. The experimental conditions were the same as in Study 1, except that the tested model was a one-factor measurement model with no predictor or criterion involved. Intercepts and residual variances were set equal across the groups in the target model. Model fitting procedures were also similar to those in Study 1, except that in Step 2, both factor loadings and intercepts were equated. Relative bias was calculated by subtracting the mean obtained from the invariance model from the true factor mean and then dividing the difference by the true factor mean. A positive value indicates that the mean was overestimated, and a negative value indicates that the mean was underestimated. Results Bias in factor means resulting from lack of loading invariance is presented in Table 4. When the reference group (e.g., United States) had higher loadings, the factor mean was overestimated in the reference group but underestimated in the focal group (e.g., China). As a result, an artificial group difference was created. The degree of bias was affected by the proportion of non-invariance, ratio of sample size, and pattern of non-invariance. That is, when CHEN 1012 Table 4 When Loadings Are Lack of Invariance: Bias in Factor Means (Study 2) Lack of loading invariance is uniform This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Mean Relative bias Non-invariance proportion (%) Ref. Focal Ref. Focal 87.5 75.0 50.0 25.0 4.336 4.331 4.302 4.110 3.556 3.564 3.619 3.888 8.40% 8.28% 7.55% 2.75% ⫺11.10% ⫺10.90% ⫺9.52% ⫺2.80% 87.5 75.0 50.0 25.0 4.086 4.086 4.084 4.031 3.013 3.077 3.314 3.872 2.15% 2.15% 2.10% .77% ⫺24.68% ⫺23.08% ⫺17.15% ⫺3.20% Note. Lack of loading invariance is mixed Mean diff. Mean Relative bias Mean diff. Ref. Focal Ref. Focal N ⫽ 150 vs. 150 .780 .767 .683 .222 4.100 3.914 3.976 3.992 3.895 4.084 4.024 4.008 2.50% ⫺2.15% ⫺.60% ⫺.20% ⫺2.63% 2.10% .60% .20% .205 ⫺.170 ⫺.048 ⫺.016 N ⫽ 240 vs. 60 1.073 1.009 .770 .159 4.059 4.033 4.013 4.005 3.703 3.865 3.949 3.982 1.48% .83% .32% .12% ⫺7.43% ⫺3.37% ⫺1.28% ⫺.45% .356 .168 .064 .023 Ref. ⫽ reference group; Focal ⫽ focal group; diff. ⫽ difference. The true mean is 4.00 for both groups. lack of loading invariance was uniform, as the proportion of non-invariant items increased, bias increased; the degree of bias was larger in the focal group than in the reference group. When sample size increased in the reference group relative to the focal group, bias decreased in that group but increased in the focal group. In contrast, when lack of loading invariance was mixed, bias in the factor mean was minimized in both groups. As discussed earlier, lack of bias in the means does not imply the construct is equivalent across groups. Study 3: Lack of Intercept Invariance and Bias in Means Study 3 was conducted to investigate the impact of lack of intercept (i.e., point of origin) invariance on factor means, given that intercept invariance is the prerequisite for factor mean comparisons. A 4 (Proportion of Non-invariance: 100%, 75%, 50%, 25%) ⫻ 2 (Pattern of Invariance: uniform vs. mixed) ⫻ 2 (Ratio of Sample Size: 1 vs. 1, 4 vs. 1; total N ⫽ 300) design was created. Factor loadings and residual variances were set equal across the groups in the target model (see Appendix for detailed model parameters). As in Studies 1 and 2, Mplus was used to generate the mean and covariance structure, and model-fitting procedures were similar to those in previous studies. Results Lack of intercept (i.e., point of origin) invariance can lead to appreciable bias in factor means (see Table 5). The direction of bias depends on the direction of intercept differences. When the reference group (e.g., United States) has higher intercepts than the focal group (e.g., China), that is, when a U.S. 4 is equal to a Chinese 3, the factor mean is overestimated in the reference group but underestimated in the focal group.8 The degree of bias depends on the degree of non-invariance and ratio of sample size. The larger the degree of non-invariance, the larger the bias is in both groups. Consistent with the findings in Studies 1–3, when the reference group had a larger sample size, bias became smaller in that group but larger in the focal group; when the pattern of intercept non-invariance was mixed, that is, when some of the intercepts were higher in the reference group, whereas others were higher in the focal group, bias in the means was substantially reduced in both groups. Once again, the reduced bias does not indicate that the measures are invariant. Discussion To make valid comparisons across different cultural or ethnic groups, we must ensure that we are not comparing chopsticks with forks. Given that researchers often import measures developed for one cultural group to other populations, the issue of measurement invariance becomes a serious challenge. Findings from Study 1 indicate that lack of factor loading invariance can produce artificial interaction effects in predictive relationships. Results of Studies 2 and 3 demonstrate that lack of loading and intercept (i.e., point of origin) invariance can lead to bogus cultural differences in means. Comparison of the Current Investigations With Millsap and Kwok’s (2004) Studies Different from the current studies, Millsap and Kwok (2004) did not examine bias in regression slopes and means due to lack of invariance in factor loadings or intercepts. However, there is some comparability between the two independent investigations. Millsap and Kwok studied selection accuracy by comparing selection rate based on the distribution of a sum score (i.e., pooled score from two different groups) and selection rate based on the distribution of the latent factor score for each group. They also studied sensitivity (i.e., the number of individuals who were selected on the basis of both the pooled sum score and their latent mean score over the number of individuals who were selected solely on the basis of their latent mean score). It was found that both the selection rate and sensitivity were artificially increased in the reference group (e.g., United States) but decreased in the focal group (e.g., China) when the reference group had higher loadings and intercepts than the focal group. The results of the success ratio (i.e., the number 8 This pattern of results holds as long as the marker variable is invariant. IMPACT OF LACK OF MEASUREMENT INVARIANCE 1013 Table 5 When Intercepts Are Lack of Invariance: Bias in Factor Means (Study 3) Lack of intercept invariance is uniform This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Mean Relative bias Non-invariance proportion (%) Ref. Focal Ref. Focal 87.5 75.0 50.0 25.0 5.370 5.350 5.280 5.120 4.640 4.650 4.720 4.880 7.40% 7.00% 5.60% 2.40% ⫺7.20% ⫺7.00% ⫺5.60% ⫺2.40% 87.5 75.0 50.0 25.0 5.100 5.100 5.090 5.030 4.210 4.280 4.520 4.870 2.00% 2.00% 1.80% .60% ⫺15.80% ⫺14.40% ⫺9.60% ⫺2.60% Lack of intercept invariance is mixed Mean diff. Mean Relative bias Mean diff. Ref. Focal Ref. Focal N ⫽ 150 vs. 150 .730 .700 .560 .240 5.270 5.000 5.000 5.000 4.740 5.000 5.000 5.000 5.40% .00% .00% .00% ⫺5.20% .00% .00% .00% .530 .000 .000 .000 N ⫽ 240 vs. 60 .890 .820 .570 .160 5.090 5.000 5.000 5.000 4.500 5.000 5.000 5.000 1.80% .00% .00% .00% ⫺10.00% .00% .00% .00% .590 .000 .000 .000 Note. Ref. ⫽ reference group; Focal ⫽ focal group; diff. ⫽ difference. The true mean is 5.00 for both groups. of individuals who were selected on the basis of the pooled sum score and their latent mean over the number of individuals who were selected solely on the basis of the pooled sum score) also favor the reference group. These findings are consistent with the results from the current studies, in which the means were overestimated in the reference group but underestimated in the focal group when factor loadings or intercepts favor the reference group. Also as found in the present studies, as the proportion of non-invariant items increased, the degree of bias increased accordingly. In addition, when the reference group had a larger sample size, bias decreased in that group but increased for the focal group, a result obtained in the current study as well. These similar patterns of findings across the two investigations provide support for the validity of the current studies. Implications in Cross-Cultural Research Given the high incidence of violating measurement invariance in cross-cultural studies, these findings cast serious doubt on the conclusions drawn from past cross-cultural research. For example, a robust cross-cultural finding is that North Americans have higher self-esteem than East Asians (e.g., Oishi & Sullivan, 2005). However, in light of the findings from the current simulation studies and the author’s review in this article on cross-cultural differences in self-esteem reliability, the discovered cultural difference in self-esteem, at least in part, is due to lower reliability, an indication of lower factor loadings, in the self-esteem scale (Rosenberg, 1965) for East Asians. In addition, East Asians’ value of modesty toward one’s personal attributes (Markus & Kitayama, 1991) could have contributed to this cultural difference. This is because the self-effacing tendency results in lower intercepts in item ratings, which in turn lead to lower means. Most of the current self-esteem measures focus on the inner aspect of self-esteem or feelings of self-competence, which might be more relevant to North Americans. For East Asians, the social aspect of self-esteem (i.e., being accepted and valued by other people) might be more important. Future research should develop scales that measure self-esteem in a culturally appropriate manner, such as by including both the inner and social aspects of self-esteem. The present literature review on existing measures encompasses a wide range of topics, including personality, depression, stress reaction, social competence, cognitive ability, emotional intelligence, life satisfaction, organizational commitment, affect, selfconcept, self-esteem, anxiety, and attachment. When these scales are used as predictors, the predictive relationship is likely to be underestimated in the reference group (e.g., United States) but overestimated in the focal group (e.g., China), and the opposite is likely to happen when these scales are used as outcome measures. Perhaps the most routine use of these scales is the comparison of means across different cultural groups. Most likely, the means are artificially inflated in the reference group but deflated in the focal group, given the lower loadings in the latter group. Particularly, for measures related to self-concept, self-esteem, and satisfaction with life, the means are likely to be underestimated for East Asians but overestimated for North Americans, given that both the loadings and intercepts (resulting from conceptual differences and the modesty tendency) are likely to be lower for East Asians. For other measures, the direction of bias in the means is difficult to predict, given the uncertainty in intercept differences. As discussed earlier, measurement invariance is still assumed, rather than tested, in many applications. When we fail to examine measurement invariance, we may uncover spurious “cultural differences” that are in fact artifacts of measurements, or we may fail to reveal true cultural differences that have been masked by measurement artifacts, which could be discovered had we used an invariant instrument. Results of the present studies also suggest that we are more likely to draw erroneous conclusions for the focal group (e.g., Asian Americans) than for the reference group (e.g., European Americans) when comparing different ethnic groups, given that the focal group often has a much smaller sample size than the reference group. If erroneous conclusions were used to guide school admission, medical diagnosis, personnel selection and promotion, clinical trials, or health and education prevention programs, serious consequences could occur. Healthy people can be falsely diagnosed and sick ones overlooked. Results of these studies highlight the importance of testing measurement invariance 1014 CHEN in cross-cultural comparisons and the significance of understanding the consequences of lack of invariance. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Implications in Testing Measurement Invariance The present investigation has important implications for testing measurement invariance. It suggests that we may need a more dynamic approach to evaluating measurement invariance. In other words, measurement invariance should be tested within the context of its impact on the statistics that a researcher is comparing. The conventional wisdom (e.g., Widaman & Reise, 1997) is that we should test measurement invariance as a first step in group comparisons. When measurement invariance is achieved at the appropriate level, we then move to the next step, which is making group comparisons. When measurement invariance is not achieved, we should avoid making group comparisons until an invariant measure is available. However, when results from this investigation are interpreted with findings from a series of recent simulation studies (F. F. Chen, 2007), the picture is much more complex: The relation between the probability of detecting non-invariance and the degree of bias in group comparisons resulting from noninvariance is not congruent. Counterintuitively, when both the degree of noninvariance and its corresponding bias in statistics are the highest, the probability of revealing non-invariance is the lowest (F. F. Chen, 2007); when the degree of non-invariance and associated bias are only moderate, the probability of detecting non-invariance is the highest. In addition, bias is larger when lack of invariance is uniform, rather than mixed; however, the likelihood of detecting lack of invariance is smaller when lack of invariance is uniform, rather than mixed. These findings indicate that meeting the standards of measurement invariance does not guard against lack of bias in group comparisons. On the other hand, the discovery of lack of invariance may not result in statistical bias in group comparisons, depending on the pattern of non-invariance. Nevertheless, lack of statistical bias in group comparisons does not imply that the constructs are comparable at the conceptual level. The reduced bias in regression slopes and means due to a mixed pattern of non-invariance also has implications for comparing constructs that are composed of common aspects (i.e., shared by different cultural or ethnic groups), as well as unique components (i.e., specific to each group). These constructs will not meet the standards of measurement invariance, as culture specific items are unique to each culture. However, the results of the present investigation suggest that if these culturally unique items are balanced across groups, it is possible to make unbiased comparisons. Conceptually, however, it is still arguable whether a construct is comparable when culturally unique components are involved. Implications of these studies go beyond cross-cultural research. Measurement invariance is an important issue whenever heterogeneous groups are involved. The groups can be gender, age in longitudinal research, or treatment and control groups in experimental and prevention studies. For example, Smith and Reise (1999) conducted a study to examine gender differences in neuroticism using the Revised NEO Personality Inventory Neuroticism scale (Costa & McCrae, 1992). It was found that several items related to being sensitive to interpersonal stress tended to inflate women’s scores, whereas several items related to tension and worry tended to inflate men’s scores. Similarly, in longitudinal studies, the meaning of a construct may change over time. For example, the way people display racism is more subtle in the 21st century than in the 1960s. An instrument developed to measure explicit racism in the 1960s may not be able to capture the more subtle and implicit nature of the construct today. In experimental studies, when a treatment is introduced, it has the potential to change the meaning of the constructs under study. Recommendations—When Invariance Fails This investigation systematically examined the direction and degree of bias under varying conditions of non-invariance. The results can be particularly useful for substantive researchers in deciding whether a comparison should be made in the face of lack of measurement invariance. As discussed earlier, the goal of testing measurement invariance is to ensure that group comparisons are valid. However, it is a challenging task to achieve measurement invariance in cross-cultural research. A variety of factors, such as translation, inappropriate item coverage, different response format and style, and social desirability, can affect the psychometric properties of instruments when different cultural or ethnic minority groups are compared (e.g., Van de Vijver & Leung, 2000). What should a researcher do when invariance fails? On the basis of current simulations, readers may be tempted to make the following inference: If we allow the non-invariant factor loadings (and/or intercepts) to vary across groups, (i.e., if we do not impose measurement invariance under the condition of non-invariance), bias in statistics (e.g., regression slopes or means) will not occur, and thus, it is appropriate to make group comparisons. However, there are two issues associated with this line of reasoning. First, when a construct does not meet the standards of measurement invariance, it implies that, conceptually, the construct conveys different meanings in different groups. Second, lack of invariance can introduce bias in statistics indirectly, even when measurement invariance is not imposed. If the construct had been measured appropriately, the regression slopes (and/or means) would be different. Dealing with non-invariant scales has become one of the unresolved questions in measurement invariance research (Millsap, 2005). As Millsap and Kwok (2004) point out, four typical approaches have been suggested in practice. The first option is to eliminate the non-invariant items, which results in many different versions of a scale for different groups (Cheung & Rensvold, 1998). It can also lead to incomplete coverage of the construct. The second choice is to keep all non-invariant items in the scale, and thus, the sum/mean score contains both invariant and non-invariant items. The assumption of this approach is that the non-invariant items may introduce little bias in group comparisons. As discovered in the present study and Millsap and Kwok’s (2004) work, this is an assumption about which we cannot be confident. As found in the present literature review, it is common to have a proportion of the items invariant and another proportion noninvariant. Researchers have proposed a partial measurementinvariance model (Byrne, Shavelson, & Muthen, 1989) to address this issue. This approach constrains the invariant items to be equal across the groups while allowing the non-invariant items to be different, and it seems less likely to introduce statistical bias, compared with the mean/sum score methods, as the non-invariant items are not forced to be invariant. However, as discussed earlier, some critical questions are still not addressed: What would the This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. IMPACT OF LACK OF MEASUREMENT INVARIANCE regression slopes and means be had the construct been measured properly in different cultures? Under what conditions should one employ a partial invariance model? As the proportion of noninvariant items increases, confidence decreases about the validity of this approach. Even when only a small proportion of the items are different, the following conceptual questions remain: Why are those items different? Is it due to specific samples or due to the scale? How could those aspects of the construct be measured differently? What are the implications for rethinking the construct? It is important to take one step further to examine the non-invariant items, as well as the conceptualization of the construct. The fourth option is to avoid making direct group comparisons. Other researchers have suggested that it seems reasonable to statistically adjust for bias introduced by non-invariant items (Cheung & Rensvold, 1998). However, there are currently no sound methods for achieving this goal. Finally, lack of measurement invariance in a one-factor model may indicate more factors or more complex loading patterns. Once additional factors or different factorloading patterns are allowed, measurement invariance can be achieved (McArdle & Cattell, 1994; Meredith, 1993). This article recommends a different approach. When measurement invariance is not achieved at an appropriate level, a researcher may still wish to draw some useful conclusions with regard to cross-cultural comparisons after spending a tremendous amount of time, effort, and resources. It is possible that the consequences of lack of invariance on the research questions are limited. To help researchers decide when it is appropriate to make group comparisons when facing lack of invariance, the following steps are proposed: (1) testing measurement invariance for each construct independently and examining the pattern of lack of invariance. The goal is to understand the degree and direction of lack of invariance. As demonstrated in the present investigation, when lack of invariance is uniform, bias in regression slopes or means tends to occur; however, when lack of invariance is mixed, bias tends to be reduced. (2) Imposing corresponding invariance constraints on invariant items (e.g., factor loading invariance for comparing regression slopes and loading and intercept invariance for comparing means). (3) Imposing corresponding invariance constraints on non-invariant items, as well as invariant items. (4) Comparing groups on statistics under study, such as regression coefficients and means, with and without imposing corresponding invariance constraints on non-invariant items (i.e., comparing results from Step 2 and Step 3 to determine the discrepancy between the statistics under study). The purpose is to understand the impact of non-invariance on these statistics. If the differences are small, it may be justifiable to make group comparisons. However, future research should examine the effect size of these group differences as well as their practical implications. The reader is warned again that trivial differences in statistics do not imply that the construct is conceptually equivalent. Limitations A major assumption in the present investigation is that all variables are continuous and are normally distributed. When the response scales are discrete categories, alternative estimation methods, such as weighted least squares (Bollen, 1989), or alternative frameworks, such as item response theory (Embretson & Reise, 2000), should be used. One direction in future research is to 1015 systematically examine bias under various levels of invariance for categorical variables. This project also focused on only two-group comparisons, and future research should expand the scope to more groups, as in many applications, several different cultural groups can be involved at the same time. Multiple group confirmatory factor analysis provides a method of testing construct equivalence (i.e., whether the same construct is measured across cultural groups). Construct equivalence, however, cannot always be tested statistically (Cheung & Rensvold, 2000), particularly when a construct has a wider scope in one culture than in another. For example, filial piety is a more highly elaborated construct in China than it is in the West (Hsieh, 1967). Measuring this kind of construct may require more items in one culture than it does in the other. Therefore, a particular set of items may be conceptually adequate for assessing a construct in one culture, be inadequate in another culture, and yet display measurement equivalence when tested across both cultures. To avoid this type of bias, both common and culturally specific features should be included in the measurement. When this is the case, one would expect that the common features are invariant, whereas the specific features are not invariant across cultural groups. Conclusion Lack of measurement invariance can have a significant impact on the conclusions drawn from group comparisons. When measurement invariance is not achieved, one may discover “spurious” group differences that are in fact artifacts of measurements, or one may miss true group differences that have been masked by measurements artifacts. The implications of these simulation studies go beyond cross-cultural research. It can have an impact on much broader situations whenever heterogeneous groups are compared. These groups include ethnicity, gender, age, measurement occasion in longitudinal research, and treatment/control groups in experimental and prevention studies. References Berry, J. W. (1969). On cross-cultural comparability. International Journal of Psychology, 2, 119 –128. Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley. Brewer, M. B., & Chen, Y.-R. (2007). Where (who) are collectives in collectivism? Toward conceptual clarification of individualism and collectivism. Psychological Review, 114, 133–151. Byrne, B. M., & Campbell, T. L. (1999). Cross-cultural comparisons and the presumption of equivalent measurement and theoretical structure: A look beneath the surface. Journal of Cross-Cultural Psychology, 30, 555–574. Byrne, B., Shavelson, R., & Muthen, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456 – 466. Byrne, B. M., & Watkins, D. (2003). The issue of measurement invariance revisited. Journal of Cross-Cultural Psychology, 34, 155–175. Campbell, D. T., & Boruch, R. F. (1975). Making the case for randomized assignment to treatments by considering the alternatives: Six ways in which quasi-experimental evaluations in compensatory education tend to underestimate effects. In C. A. Bennett & A. A. Lumsdaine (Eds.), This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 1016 CHEN Evaluation and experiment: Some critical issues in assessing social programs (pp. 195–296). New York: Academic Press. Chen, C., Lee, S. Y., & Stevenson, H. W. (1995). Response style and cross-cultural comparisons of rating scales among East Asian and North American students. Psychological Science, 6, 170 –175. Chen, F. F. (2007). Sensitivity of goodness of fit indices to lack of measurement invariance. Structural Equation Modeling, 14, 464 –504. Chen, F. F., Sousa, K. H., & West, S. G. (2005). Testing measurement invariance of second-order factor models. Structural Equation Modeling, 12, 471– 492. Chen, F. F., & West, S. G. (2008). Measuring individualism and collectivism: The importance of considering different components, reference groups, and measurement invariance. Journal of Research in Personality, 42, 259 –294. Cheung, G. W., & Rensvold, R. B. (1998). Cross-cultural comparisons using non-invariant measurement items. Applied Behavioral Science Review, 6, 93–110. Cheung, G. W., & Rensvold, R. B. (2000). Assessing extreme and acquiescence response sets in cross-cultural research using structural equation modeling. Journal of Cross-Cultural Research, 31, 187–212. Costa, P. T., & McCrae, R. R. (1992). Professional manual: Revised NEO Personality Inventory (NEO PI–R) and NEO five factor inventory (NEO–FFI). Odessa, FL: Psychological Assessment Resources. Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The Satisfaction With Life Scale. Journal of Personality Assessment, 49, 71–75. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah: NJ: Erlbaum. Fiske, A. P., Kitayama, S., Markus, H. R., & Nisbett, R. E. (1998). The cultural matrix of social psychology. In D. T. Gilbert, S. T. Fiske, & G. Linzey (Eds.), Handbook of social psychology (4th ed., pp. 915–981). Boston: McGraw-Hill. Heine, S. J., Lehman, D. R., Markus, H. R., & Kitayama, S. (1999). Is there a universal need for positive self-regard? Psychological Review, 106, 766 –794. Heine, S. J., Lehman, D. R., Peng, K., & Greenholtz, J. (2002). What’s wrong with cross-cultural comparisons of subjective Likert scales? The reference group effect. Journal of Personality and Social Psychology, 82, 903–918. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. Braum (Eds.), Test validity (pp. 129 –145). Hillsdale, NJ: Erlbaum. Horn, J. L., McArdle, J. J., & Mason, R. (1983). When is invariance not invariant: A practical scientist’s look at the ethereal concept of factor invariance. Southern Psychologist, 4, 179 –188. Hsieh, Y.-W. (1967). Filial piety and Chinese society. In C. A. Moore (Ed.), The Chinese mind: Essentials of Chinese philosophy and culture (pp. 165–187). Honolulu: University of Hawaii Press. Hui, C. H., & Triandis, H. C. (1985). Measurement in cross-cultural psychology: A review and comparison of strategies. Journal of CrossCultural Psychology, 16, 131–152. Irvine, S. H., & Carroll, W. K. (1980). Testing and assessment across cultures: Issues in methodology and theory. In H. C. Triandis & J. W. Berry (Eds.), Handbook of cross-cultural psychology (Vol. 2, pp. 181– 244). Newton, MA: Allyn & Bacon. Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36, 409 – 426. Jöreskog, K. G., & Sörbom, D. (1999). LISREL 8: User’s reference guide (2nd ed.). Chicago: Scientific Software International. Kwan, V. S. Y., Bond, M. H., Boucher, H. C., Maslach, C., & Gan, Y. (2002). The construct of individuation: More complex in collectivist than in individualist cultures. Personality and Social Psychology Bulletin, 28, 300 –310. Lehman, D., Chiu, C., & Schaller, M. (2004). Psychology and culture. Annual Review of Psychology, 55, 689 –717. Little, T. D. (1997). Mean and covariance structures (MACS) analyses of cross-cultural data: Practical and theoretical issues. Multivariate Behavioral Research, 32, 53–76. Markus, H. R., & Kitayama, S. (1991). Culture and self: Implications for cognition, emotion, and motivation. Psychological Review, 98, 224 –253. Maslach, C., Stapp, J., & Santee, R. T. (1985). Individuation: Conceptual analysis and assessment. Journal of Personality and Social Psychology, 49, 729 –738. McArdle, J. J., & Cattell, R. B. (1994). Structural equation models of factorial invariance in parallel proportional profiles and oblique confactor problems. Multivariate Behavioral Research, 29, 63–113. Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543. Millsap, R. E. (2005). Four unresolved problems in studies of factorial invariance. In A. Maydeu-Olivares & J. McArdle (Eds.), Contemporary psychometrics: A festschrift for Roderick P. McDonald: Multivariate applications book series (pp. 153–171). Mahwah, NJ: Erlbaum. Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334. Millsap, R. E., & Kwok, O.-M. (2004). Evaluating the impact of partial factorial invariance on selection of multiple populations. Psychological Methods, 9, 93–115. Moorman, R. H., & Podsakoff, P. M. (1992). A meta-analytic review and empirical test of the potential confounding effects of social desirability response sets in organizational behavior research. Journal of Occupational and Organizational Psychology, 65, 131–149. Muthén, L. K., & Muthén, B. O. (1998). Mplus user’s guide. Los Angeles: Muthén & Muthén. Oishi, S., & Sullivan, H. W. (2005). The mediating role of parental expectations in culture and well-being. Journal of Personality, 73, 1267–1294. Oyserman, D., Coon, H. M., & Kemmelmeier, M. (2002). Rethinking individualism and collectivism: Evaluation of theoretical assumptions and meta-analyses. Psychological Bulletin, 128, 3–72. Peng, K., Nisbett, R. E., & Wong, N. Y. (1997). Validity problems comparing values across cultures and possible solutions. Psychological Methods, 2, 329 –344. Poortinga, Y. H. (1989). Equivalence of cross-cultural data: An overview of basic issues. International Journal of Psychology, 24, 737–756. Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552–566. Rhee, E., Uleman, J. S., & Lee, H. K. (1996). Variations in collectivism and individualism by ingroup and culture: Confirmatory factor analysis. Journal of Personality and Social Psychology, 71, 1037–1054. Riordan, C. M., & Vanderberg, R. J. (1994). A central question in crosscultural research: Do employees of different cultures interpret work-related measures in an equivalent manner? Journal of Management, 20, 643– 671. Rosenberg, M. (1965). Society and the adolescent self-image. Princeton, NJ: Princeton University Press. Smith, L. L., & Reise, S. P. (1999). Gender differences on negative affectivity: An IRT study of differential item functioning on the Multidimensional Personality Questionnaire Stress Reaction Scale. Journal of Personality and Social Psychology, 75, 1350 –1362. Sörbom, D. (1978). An alternative to the methodology for analysis of covariance. Psychometrika, 43, 381–396. Steenkamp, J. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25, 78 –90. Ullman, J. B. (2001). Structural equation modeling. In B. G. Tabachnick & L. S. Fidell (Eds.), Using multivariate statistics (pp. 653–771). Boston: Allyn & Bacon. IMPACT OF LACK OF MEASUREMENT INVARIANCE Van de Vijver, F. J. R., & Leung, K. (1997). Methods and data analysis for comparative research. In J. Berry, Y. Poortinga, & J. Pandey (Eds.), Handbook of cross-cultural psychology (Vol. 1, pp. 259 –300). Boston: Allyn & Bacon. Van de Vijver, F., & Leung, K. (2000). Methodological issues in psychological research on culture. Journal of Cross-Cultural Psychology, 31, 33–51. 1017 Widaman, K. F., & Reise, S. P. (1997). Exploring the measurement invariance of psychological instruments: Applications in the substance use domain. In K. J. Bryant, M. Windle, & S. G. West (Eds.), The science of prevention: Methodological advances from alcohol and substance abuse research (pp. 281–324). Washington, DC: American Psychological Association. Table A1 Model Parameters for Study 1 This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Unstandardized factor loadings Uniform Non-invariance proportion (%) Mixed Reference group 87.5 75.0 50.0 25.0 1 1 1 1 .9 .9 .9 .9 Variance Residual Covariance .8 .8 .6 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 Focal group .9 .9 .9 .9 1 1 1 1 .6 .6 .6 .6 .6 .9 .9 .9 .6 .6 .6 .9 .6 .6 .9 .6 .6 .6 .6 .9 .6 .6 .9 .9 Reference group .6 .6 .6 .9 1 1 1 1 Focal group .75 .9 .75 .9 .75 .9 .75 .9 .75 .9 .75 .9 .75 .9 .9 .9 .9 .75 .9 .75 .9 .9 .9 .9 .9 .9 .75 .9 1 1 1 1 .9 .9 .9 .9 .6 .9 .9 .9 .9 .6 .9 .9 .6 .9 .9 .9 .9 .6 .6 .9 .6 .9 .9 .9 .9 .6 .6 .6 Other parameters .8 .8 .6 .8 .8 .6 .8 .8 .6 Table A2 Model Parameters for Study 2 Unstandardized factor loadings Uniform Non-invariance proportion (%) Mixed Reference group 87.5 75.0 50.0 25.0 1 1 1 1 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 Variance Residual Covariance Intercepts Latent mean .8 .6 .6 01111111 4 .9 .9 .9 .9 Focal group 1 1 1 1 .6 .6 .6 .6 .6 .9 .9 .9 .6 .6 .6 .9 .6 .6 .9 .6 .6 .6 .6 .9 .6 .6 .9 .9 Reference group .6 .6 .6 .9 Other parameters .8 .6 .6 01111111 4 (Appendixes continue) 1 1 1 1 .75 .9 .75 .9 .75 .9 .75 .9 .75 .9 .75 .9 .75 .9 .9 .9 .9 .75 .9 .75 .9 .9 .9 .9 .9 .9 .75 .9 .8 .6 .6 01111111 4 Focal group 1 1 1 1 .9 .9 .9 .9 .6 .9 .9 .9 .9 .6 .9 .9 .6 .9 .9 .9 .9 .6 .6 .9 .6 .9 .9 .9 .8 .6 .6 01111111 4 .9 .6 .6 .6 CHEN 1018 Table A3 Model Parameters for Study 3 Intercepts Uniform This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Non-invariance proportion (%) Mixed Reference group 87.5 75.0 50.0 25.0 0 0 0 0 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 Loadings Variance Residual Covariance Mean 1 .9 .9 .9 .9 .9 .9 .9 .8 .6 .6 5 1.5 1.5 1.5 1.5 Focal group 1.5 1.5 1.5 1.5 0 0 0 0 Reference group .5 .5 .5 .5 .5 .5 .5 .5 1.5 .5 .5 .5 .5 .5 .5 1.5 .5 1.5 .5 1.5 .5 1.5 1.5 .5 1.5 1.5 1.5 .5 1 .9 .9 .9 .9 .9 .9 .9 .8 .6 .6 5 0 0 0 0 1.5 1.5 1.5 1.5 Focal group .5 1.5 .5 1.5 .5 1.5 .5 1.5 .5 1.5 .5 1.5 .5 1.5 .5 1.5 .5 1.5 1.5 1.5 1.5 1.5 .5 1.5 Other parameters 1 .9 .9 .9 .9 .9 .9 .9 .8 .6 .6 5 0 0 0 0 .5 1.5 .5 1.5 .5 1.5 .5 1.5 1.5 .5 1.5 .5 1.5 .5 .5 1.5 .5 1.5 .5 1.5 .5 1.5 1.5 1.5 1.5 1.5 1.5 .5 1 .9 .9 .9 .9 .9 .9 .9 .8 .6 .6 5 Note. The standardized loading for each item was set to .67 for the reference group, which corresponds to the average standardized loading of .67 in the U.S./Caucasian group in the literature review noted earlier. Also corresponding to the findings in the literature, the standardized loading difference was set to .15 (vs. an average difference of .13 in the literature) when the loadings were higher in the reference group than in the focal group and was set to .07 (vs. an average difference of .07 in the literature) when the loadings were higher in the focal group than in the reference group. These values were transformed into a difference of .3 (.9 vs. .6) and .15 (.9 vs. .75) in the unstandardized loadings. There were two reasons for relying on standardized loadings when setting the parameters in the model: First, standardized loadings are not subject to scaling, and second, about three times more comparisons were based on standardized than on raw loadings (94 vs. 25) in the literature review, and thus, the results are less subject to sampling variation. Received September 5, 2005 Revision received May 9, 2008 Accepted May 20, 2008 䡲
Journal of Affective Disorders 190 (2016) 362–368 Contents lists available at ScienceDirect Journal of Affective Disorders journal homepage: www.elsevier.com/locate/jad Research report A comparative cross-cultural study of the prevalence of late life depression in low and middle income countries M. Guerra a,b,c, A.M. Prina b, C.P. Ferri d, D. Acosta e, S. Gallardo a,n, Y. Huang f, K.S. Jacob g, I.Z. Jimenez-Velazquez h, J.J. Llibre Rodriguez i, Z. Liu f, A. Salas j, A.L. Sosa k, J.D. Williams m, R. Uwakwe l, M. Prince b a Institute of Memory, Depression and Disease Risk, Avda Constructores 1230, Lima 12, Peru Centre for Global Mental Health, Health Service and Population Research Department, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK c Peruvian University, Cayetano, Heredia, Lima, Peru d Federal University of Sao Paulo, UNIFESP, Sao Paulo, Brasil e National University Pedro Henriquez Urena f Peking University China g Christian Medical College, Vellore, India h School of Medicine, University of Puerto Rico, San Juan, Puerto Rico i Medical University of Havana j Central University of Venezuela, Caracas, Venezuela k National Autonomous University of Mexico l Nnamdi Azikiwe Uniiversity m Department of Community Health, Voluntary Health Services, Chennai, India b art ic l e i nf o a b s t r a c t Article history: Received 6 July 2015 Received in revised form 19 August 2015 Accepted 5 September 2015 Available online 23 October 2015 Background: Current estimates of the prevalence of depression in later life mostly arise from studies carried out in Europe, North America and Asia. In this study we aimed to measure the prevalence of depression using a standardised method in a number of low and middle income countries (LMIC). Methods: A one-phase cross-sectional survey involving over 17,000 participants aged 65 years and over living in urban and rural catchment areas in 13 sites from 9 countries (Cuba, Dominican Republic, Puerto Rico, Mexico, Venezuela, Peru, China, India and Nigeria). Depression was assessed and compared using ICD-10 and EURO-D criteria. Results: Depression prevalence varied across sites according to diagnostic criteria. The lowest prevalence was observed for ICD-10 depressive episode (0.3 to 13.8%). When using the EURO-D depression scale, the prevalence was higher and ranged from 1.0% to 38.6%. The crude prevalence was particularly high in the Dominican Republic and in rural India. ICD-10 depression was also associated with increased age and being female. Limitations: Generalisability of findings outside of catchment areas is difficult to assess. Conclusions: Late life depression is burdensome, and common in LMIC. However its prevalence varies from culture to culture; its diagnosis poses a significant challenge and requires proper recognition of its expression. & 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). Keywords: Depression Prevalence ICD-10 EURO-D Older-age 1. Introduction Depression, a prevalent and extremely disabling psychiatric condition in later life (Beekman et al., 1999; Blazer, 2003), has not been studied sufficiently in low and middle income countries (LMIC) where a demographic transition, with an increasing n Corresponding author. E-mail address: mariella.guerra.1066@gmail.com (M. Guerra). number of older people is rapidly occurring (Christensen et al., 2009). In high-income countries, the prevalence of late-life depression has been extensively studied (Beekman et al., 1999; Djernes, 2006) with a considerable variation reported across studies, with the operational criteria being a main influence. To our knowledge at least 21 studies have been conducted from 1990 until 2011 in LMIC using different criteria. Most of the studies were carried out in China (Chen et al., 1999, 2004, 2005; Meng and http://dx.doi.org/10.1016/j.jad.2015.09.004 0165-0327/& 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). M. Guerra et al. / Journal of Affective Disorders 190 (2016) 362–368 363 Tang, 2000; Pan et al., 2008; Wu and Zhang, 1989), or Latin America {Zunzunegui et al., 2009 #1409; Alvarado et al., 2007 #118; Costa, 2007 #130; Blay and Marinho, 2007 #131;GarcíaPeña et al., 2008 #1343;Carvalhais et al., 2008 #1223; Tintle et al., 2011 #1299;Guerra et al., 2009 #120; Barcelos-Ferreira, 2010 #1099}. The majority of Latin American studies had a small sample size and used depression symptom scales and reported a relatively higher prevalence of depression, compared to those studies from Mainland China. One of the biggest multicentre studies (SABE) was conducted in six Latin American capital cities using the Geriatric Depression Scale, and reported a depression prevalence ranging from 16.5% to 30.1% in women and from 11.8% to 19.6% in men (Alvarado et al., 2007); results that are broadly consistent with estimates from two cross-national comparisons of late-life depression in Europe: SHARE (CastroCosta et al., 2007) and EURODEP. In the 10/66 population based study conducted in Peru, Mexico and Venezuela, the prevalence varied depending on the diagnoses criteria used being higher for GMS– AGECAT (between 30.0% and 35.9%) and EURO–D depression scale (cutpoint3/4) (between 26.1% and 31.2%). We now extend the evidence of the prevalence of late-life depression to include a wider range of settings, in Latin America, Nigeria and Asia.. 3.1.1. Depression of clinical significance The EURO-D (Prince et al., 1999) is a symptom scale that covers 12 symptom domains: depressed mood, pessimism, suicidality, guilt, sleep, interest, irritability, appetite, fatigue, concentration, enjoyment and tearfulness. Each item is scored 0 (symptom not present) or 1 (symptom present), and item scores are summed to produce a scale with a minimum score of zero and a maximum of 12. The EURO-D scale had moderately high internal consistency in the EURODEP study (Prince et al., 1999), and was reported to have good construct validity in the our 10/66 sample (Brailean et al., 2015). For this study, we determined the optimal- cutpoint in each site (as either 4 or 5), as described in the EURO-D validation paper that we have recently published (Guerra et al., 2015). In summary, the optimal cutpoint, its sensitivity and specificity were respectively: Cuba (cutpoint: 5, sensitivity at cutpoint: 97.2%, specificity 87.7%), Dominican Republic (5, 93.5%, 84%), Puerto Rico (5, 97.9%, 91.6%), urban China (6. 100%, 97.8%), rural China (5, 85.7%, 99.6%), urban India (5, 97.4%, 74.1%), India rural (4, 91.3%, 69.5%) and Nigeria (5, 100% 79.3%) 2. Methods 3.2. Socio-demographic status and other health-conditions 2.1. Setting, design and procedures Age was established during the interview from the participant using official ID documentation, informant report, and, in the case of discrepancy an event calendar was used. We also obtained information on: gender and marital status (single, married/ cohabiting, widowed, divorced/separated); education (none, did not complete primary, completed primary, secondary, tertiary); social support (living alone versus living with others; frequency of contact with relatives and friends); occupational attainment (professional, clerical or trade, skilled or semi-skilled manual worker); amount and sources of income; number of assets, and food insecurity. Other health conditions were self-reported (e.g. angina, stroke, COPD, etc.), diagnosed (e.g. dementia using the 10/66 dementia algorithm (Prince et al., 2003), or determined according to specific criteria (e.g. hypertension). The 10/66 Dementia Research Group population-based studies were all conducted according to the same standardised protocol. The full 10/66 study protocol has been published elsewhere (Prince et al., 2007). A one-phase cross-sectional population-based survey has been conducted of all those over 65 years old from defined catchments areas. Surveys were carried out in thirteen sites in nine countries (Cuba, Dominican Republic, Puerto Rico, Peru, Mexico, Venezuela, China, India and, Nigeria). Surveys in Peru, Mexico, China and India included both urban and rural catchment areas, the Nigerian catchment area was predominately rural, while in the other countries participants were recruited only from urban catchment areas. All assessments were carefully translated and adapted into the relevant local languages. Acceptability and conceptual equivalence were assessed and reviewed by local informants. Interviews were carried out in participants’ own homes and lasted on average two to three hours. Interviewers were fully trained on the 10/66 protocol by the local principal investigator (PI) and the local study coordinator (SC). The study protocol and the consent procedures, including the witnessed consent procedure, were approved by the King's College London research ethics committee and in all local countries. Funding for each group of countries was obtained at different times, therefore these baseline surveys were conducted over a six year period (2003–2009). 3. Measurements 3.1. Depression Depression was determined according to EURO-D and ICD-10 criteria, all generated from the same semi-structured clinical interview, the Geriatric Mental State (GMS), which is supported by the computerised diagnostic algorithm AGECAT (Automated Geriatric Examination for Computer Assisted Taxonomy) (Copeland et al., 1976). For all criteria, period prevalence was determined with respect to the last one month. 3.1.2. Diagnostic criteria for depression ICD-10 diagnoses were derived using a computerised algorithm applied to the GMS. For ICD-10, F32 Depressive episode, specified as mild, moderate or severe was used. 4. Statistical analysis We used the 10/66 data archive (release 3.0) and STATA (version 11 or 13) for all analyses. The prevalence of depression, accompanied by robust 95% confidence intervals (CIs), was estimated in Cuba, Dominican Republic, China, India, Nigeria and Puerto Rico. Direct standardisation estimates (for age, sex and education), using the whole sample as the standard population, were also reported for all the sites, including the sites where we previously published non-standardised estimates (Peru, Mexico and Venezuela){Guerra et al., 2009 #120}. In each setting, we report the prevalence of depression with 95% confidence intervals, by age and sex, for both ICD-10 depressive episode and EURO-D depression (4/5 cut-point). Forest plots from a random effect meta-analysis were generated using the metaprop command in STATA for both ICD-10 and EURO-D estimates, and reported with their pooled estimates. In order to explore the risk of age and gender on prevalent ICD10 depression, we used Poisson regressions to calculate mutually adjusted prevalence ratios (PRs). We then used a fixed-effect meta-analysis to pool the PRs across sites, also reporting an I2 Higgins score to highlight the heterogeneity across sites. 364 M. Guerra et al. / Journal of Affective Disorders 190 (2016) 362–368 Table 1 Socio-dtemographic characteristics of the sample. Cuba n¼2944 Dominican Republic n ¼2011 Puerto Rico n ¼1918 China Urban n¼ 1160 China Rural n¼ 1002 India Urban n ¼1003 India Rural n ¼999 Nigeria n¼914 Age (years) Mean age 65–69 70–74 75–79 80þ Missing values 74.8 760 (25.8) 789 (26.8) 639 (21.7) 749 (25.5) 7 75.2 533 (26.5) 520 (25.8) 397 (19.7) 561 (27.9) 0 76.1 406 (21.1) 439 (22.8) 456 (23.7) 618 (32.1) 2 73.9 316 (27.2) 362 (31.2) 254 (21.9) 228 (19.6) 0 72.4 383 (38.2) 296 (29.5) 202 (20.1) 121 (12.0) 0 71.2 415 (41.4) 318 (31.7) 144 (14.3) 124 (12.3) 2 72.5 331 (33.1) 350 (35.0) 177 (17.7) 141 (14.1) 0 72.6 386 (42.2) 222 (24.2) 121 (13.2) 185 (20.2) 0 Gender Female Missing values 1913 (64.9) 0 1325 (65.9) 2 1289(67.2) 4 661 (56.9) 0 556 (55.4) 0 571 (57.6) 15 545 (54.5) 0 539 (58.9) 0 Marital status Never married Currently married Widowed Separated/divorced Missing values 275 (9.3) 1271 (43.2) 928 (31.6) 462 (15.7) 8 139 (6.9) 586 (29.3) 806 (40.3) 465 (23.3) 15 118 (6.1) 931 (48.5) 640 (33.3) 228 (11.8) 4 3 (0.2) 829 (71.4) 326 (28.1) 2 (0.1) 0 22 (2.2) 585 (58.3) 394 (39.3) 1 (0.1) 0 21 (2.1) 523 (52.2) 426 (42.5) 32 (3.1) 3 5 (0.5) 481 (48.1) 497 (49.7) 16 (1.6) 0 41 (4.8) 581 (68.6) 225 (26.5) 0 (0.0) 67 Education level None Minimal Primary Secondary Tertiary Missing values 75 (2.5) 655 (22.3) 979 (33.3) 728 (24.8) 499 (17.0) 8 392 (19.6) 1022 (51.3) 370 (18.5) 135 (6.7) 73 (3.6) 19 70 (3.6) 376 (19.5) 395 (20.5) 686 (35.7) 388 (20.2) 0 232 (20.0) 153 (13.1) 303 (26.1) 335 (28.8) 137 (11.8) 0 579 (57.7) 114 (11.3) 259 (25.8) 45 (4.4) 5 (0.5) 0 428 (42.6) 234 (23.3) 212 (21.1) 87 (8.6) 42 (4.1) 2 660 (66.0) 195 (19.5) 116 (11.6) 26 (2.6) 2 (0.2) 0 543 (59.4) 135 (14.7) 126 (13.7) 20 (2.1) 18 (1.9) 0 261 (8.8) 445 (15.2) 1422 (48.3) 816 (27.7) 7 254 (12.6) 135 (6.7) 963 (47.8) 659 (32.7) 0 472 (23.5) 666 (33.2) 548 (27.3) 323 (16.1) 0 54 (4.6) 415 (35.7) 446 (38.4) 245 (21.2) 10 49 (4.8) 194 (19.3) 679 (67.7) 80 (7.9) 11 44(4.3) 108 (10.7) 719 (71.5) 134 (13.3) 2 120 (12.0) 140 (14.0) 625 (62.5) 114 (11.4) 0 944(32.2) 11 357(17.8) 3 428(22.3) 0 19(1.6) 0 12(1.2) 0 24(2.4) 1 22(2.2) 0 Living arrangements Alone With spouse only With adult children Any other Missing values Past depression Missing values The prevalence of ‘sub-syndromal depression’ was also reported. This was defined as those not meeting criteria for ICD-10 depressive episode, but scoring above the optimal cut-point on the EURO-D scale. No data 14(1.7) 0 5. Results 5.1. General characteristics Overall, 17,852 interviews were completed. Response Table 2 Prevalence of depression (%) in each site, according to ICD-10 depressive episode criterion, stratified by age and sex. Age groups (years) Cuba Men Women Dominican Republic Men Women Puerto Rico Men Women China urban Men Women China rural Men Women India urban Men Women India rural Men Women Nigeria Men Women 65–69 70–74 75–79 80 þ All ages Crude prevalence 1.1 (0.0–2.3) 6.8 (4.5–9.0) 2.4 (0.6–4.2) 4.8 (2.9–6.7) 3.0 (0.8–5.3) 8.1 (5.4–10.7) 4.3 (1.7–6.9) 5.2 (3.3–7.2) 2.6 (1.6–3.6) 6.1 (5.0–7.2) 4.9 (4.1–5.7) 8.5 (4.5–12.5) 13.4 (10.3–17.6) 6.7 (3.1–10.2) 13.9 (10.0–17.7) 15.9 (9.6–22.2) 16.2 (11.8–20.7) 15.4 (9.9–20.9) 16.8 (13.1–20.6) 11.1 (8.8–13.5) 15.2 (13.3–17.2) 13.8 (12.3–15.3) 0.4 (0.0–7.5) 2.6 (0.8–4.5) No cases 1.4 (0.0–2.7) 1.3 (0.5–3.1) 2.0 (0.4–3.6) 0.9 (0.4–2.2) 4.2 (2.3–6.2) 1.2 (0.4–2.1) 2.8 (1.9–3.7) 2.3 (1.7-3.0) No cases No cases No cases 0.5 (0.0–1.5) No cases No cases No cases 1.7 (0.0–4.0) No cases 0.5 (0.0–1.0) 0.3 (0.0–0.6) 0.5 (0.0–1.5) 0.5 (0.0–1.6) 1.5 (0.0–3.7) 0.6 (0.0–1.8) 1.3 (0.0–3.9) No cases No cases 1.3 (0.0–3.9) 0.9 (0.0–1.8) 0.5 (0.0–1.1) 0.7 (0.2–1.2) 4.0 (1.1–7.0) 4.6 (1.9–7.3) 2.4 (0.0–5.1) 4.8 (1.7–7.8) 5.9 (0.1–11.8) 1.3 (0.0–3.9) 7.7 (0.2–15.2) No cases 4.3 (2.4–6.2) 3.7 (2.1–5.2) 3.9 (2.7–5.1) 12.2 (6.7–17.7) 10.9 (6.5–15.4) 14.9 (8.2–20.6) 14.8 (9.–19.8) 12.5 (5.5–19.5) 10.1 (3.7–16.5) 12.3 (4.6–20.1) 10.3 (2.9–17.7) 13.2 (10.1–16.3) 12.1 (9.4–14.8) 12.6 (10.5–14.7) No cases 0.8 (0.0–1.9) 1.3 (0.0–3.9) 0.7 (0.0–2.0) No cases No cases 1.1 (0.0–3.1) No cases 0.5 (0.0–1.3) 0.6 (0.0-1.2) 0.5 (0.1–1.0) M. Guerra et al. / Journal of Affective Disorders 190 (2016) 362–368 365 Table 3 Prevalence of depression (%) in each site, according to EURO-D criterion (cutpoint 4/5), stratified by age and sex. Age groups (years) Cuba Men Women Dominican Republic Men Women Puerto Rico Men Women China urban Men Women China rural Men Women India urban Men Women India rural Men Women Nigeria Men Women 65-69 70-74 75-79 80 þ All ages Crude prevalence 9.5 (6.0–13.0) 21.8 (18.1–25.4) 8.9 (5.6–12.2) 18.1 (14.7–21.5) 10.9 (6.8–14.9) 22.0 (17.9–26.1) 14.2 (9.7–18.7) 25.4 (21.6–29.1) 9.5 (7.7–11.3) 20.3 (18.4–22.1) 16.5 (15.1–17.9) 17.6 (12.1–23.0) 28.8 (23.9–33.6) 15.4 (10.3–20.5) 27.2 (22.2–32.1) 25.0 (17.5–32.5) 31.7 (26.1–37.3) 24.9 (18.3–31.4) 36.7 (31.9–41.6) 19.6 (16.6–22.5) 30.6 (28.1–33.2) 26.8 (24.8–28.8) 14.2 (7.4–30.9) 7.1 (12.8–21.5) 7.4 (3.1–11.6) 11.7 (7.9–15.4) 8.9 (4.4–13.5) 13.3 (9.5–17.2) 13.8 (9.2–18.5) 23.4 (19.3–27.6) 6.3 (4.4–8.2) 12.6 (10.8–14.5) 10.6 (9.2–12.0) 2.7 (0.0–5.7) 4.4 (1.6–7.3) 3.5 (0.9–6.0) 3.1 (0.4–5.8) 3.4 (0.0–6.8) 5.1 (1.4–8.8) 11.0 (5.0–16.9) 12.6 (6.6–18.7) 1.9 (0.7–3.1) 3.0 (1.7–4.3) 2.5 (1.6–3.4) 3.1 (0.6–5.6) 1.6 (0.0–3.4) 3.8 (0–7.1) 3.0 (0.4–5.7) 7.8 (1.7–13.9) 3.2 (0.0–6.3) 8.7 (0.2–17.2) 6.7 (0.9–12.4) 1.4 (0.3–2.5) 0.7 (0.0–1.5) 1.0 (0.3–1.7) 17.3 (11.6–23.0) 35.7 (29.5–41.9) 18.3 (11.4–25.1) 36.5 (29.5–43.5) 29.9 (18.6–41.1) 36.0 (24.9–47.1) 32.7 (19.5–45.9) 24.2 (13.6–34.9) 20.9 (17.0–24.8) 34.6 (30.6–38.5) 28.6 (25.7–31.5) 36.7 (28.6–44.8) 36.5 (29.6–43.3) 38.3 (30.5–46.1) 46.9 (39.9–53.9) 39.8 (29.3–50.2) 46.1 (35.3–56.8) 42.5 (30.9–54.1) 50.0 (37.8–62.2) 36.7 (32.2–41.2) 40.2 (35.9–44.4) 38.6 (35.3–41.9) 16.9 (10.5–23.3) 15.6 (11.1–20.1) 23.7 (13.9–33.5) 26.0 (18.8–33.2) 14.7 (6.1–23.3) 23.1 (11.2–34.9) 23.2 (14.5–31.8) 43.3 (32.9–53.8) 18.8 (14.9–22.7) 22.7 (19.5–26.0) 21.1 (18.8–23.5) Fig. 1. Prevalence of depression (%) using different operational criteria, standardised by age, gender and education. proportions ranged from 72% (urban India) to 98% (rural India). General characteristics of the respondents in each country are summarised in Table 1. Women predominate over men in all sites, with nearly two- thirds of participants being women in Latin American sites, and just over a half in China, India and Nigeria. Higher levels of education were registered in Latin America and in urban areas in comparison to rural areas. Participants in rural locations also reported fewer household assets, more food insecurity, and lower personal income, compared to those living in urban locations. Between 1.2% (rural China) and 34.9% (urban Peru) reported a past history of depression. 5.2. Prevalence of depression The largest source of variation in the prevalence of depression was the criterion used for assessment. The prevalence of ICD-10 depressive episode varied between 0.3% and 13.8% by location (Table 2), whereas the prevalence of EURO-D depression ranged between 1.0% and 38.6% (Table 3). However, for each of these criteria, there was also substantial heterogeneity in prevalence among sites (supplementary fig. 1). The meta-analysed pooled estimate for ICD-10 depression was 4.7 (95% CI: 3.1-6.3) and for EURO-D depression 18.2 (96% CI: 12.3-24.0). Direct standardisation had some effect on the estimates, as shown in Fig. 1 which reports the prevalence for both criteria using direct standardisation for age, gender and education. The prevalence in Dominican Republic, with all diagnostic criteria, was high with respect to that observed in other Latin American sites. The prevalence was exceptionally low in urban and rural China with all criteria. In all sites with exception of rural Peru, rural China and both Indian sites, the prevalence of depression was higher in women than among men. In Latin America, the prevalence of ICD-10 depression increased with age in men, but not in women, whereas an increasing trend in EURO-D prevalence was seen across both genders and sites. When we adjusted for both age and gender and pooled our estimates across sites, we found that men, and younger individuals had lower PRs of ICD-10 depression (pooled estimates: 0.62, 95% CI: 0.53– Table 4 Prevalence of sub-syndromal depression (EURO-D depression not confirmed by ICD1-10). Centre Crude prevalence (95% CI) Cuba Dominican Republic Puerto Rico Mexico (urban) Mexico (rural) Peru (urban) Peru (rural) China (urban) China (rural) India (urban) India (rural) Nigeria 11.4 (10.3–12.7) 13.7 (12.2–15.3) 7.8 (6.7–9.1) 15.0 (13.0–17.4) 12.2 (10.3–14.4) 14.0 (12.2–15.9) 12.5 (10.0–15.5) 2.2 (1.5–3.2) 0.4 (0.2–1.1) 24.8 (22.1–27.6) 25.3 (22.6–28.2) 20.4 (18.1–22.8) 366 M. Guerra et al. / Journal of Affective Disorders 190 (2016) 362–368 0.71, I2 ¼ 0.0% and 1.07, 95% CI¼ 1.02–1.12, I2 ¼45.2% respectively). Given the higher prevalence of EURO-D depression compared with ICD-10 depressive episode we explored the the concept of sub-syndromal depression (EURO –D depression not confirmed as a depressive episode by the ICD-10). The prevalence of sub-syndromal depression varied across sites with urban China having the lowest (0.4%) and rural India the highest (25.3%) (Table 4). 5.3. Depression clinical aspects Overall, 35.3% of ICD-10 depression cases were mild, 51.9% were moderate, and 12.7% severe. The proportion of current ICD-10 depressive episode cases with past history of depression varied between 25.6% and 71.8%, with rural India constituting a low outlier with only 2/126 cases (1.6%) reporting a past history of depression. In general a past history was more frequently reported in urban than rural sites. In Latin American sites, where a past history of depression was relatively frequently reported, around 20% to 60% of these individuals reported having previously been treated by a doctor, with higher proportions in Cuba, Puerto Rico, and Venezuela than in Dominican Republic, Mexico and Peru. The median age for first onset of depression exceeded 60 years for most sites. 6. Discussion In this study, we reported a wide variation of estimates according to the depression criterion that we used. Across all sites, the prevalence of ICD-10 depressive episode was higher than EURO-D depression (a score of 5 or higher on the EURO-D scale). However, for each of these criteria, there was also substantial variation in prevalence among sites. Therefore it is important to compare results between studies, where possible, based on the use of the same or similar criteria. On this basis, our results suggested a higher prevalence of late-life depression, in at least some sites in Latin America, and in urban India, than is typically recorded in studies in high income countries. Conversely, the prevalence in China was very low. 6.1. Strengths and weaknesses of the study To our knowledge this is the first large-scale community-based depression-prevalence study conducted in LMIC that, with the same methodology, has evaluated a large number of older persons, in nine LMIC located in three continents, using rigorous research diagnostic criteria such as the ICD-10 and the EURO-D. Unlike HIC, an important advantage in our study is the relatively high response rate, at least 80% in all sites, and exceeding 90% in several sites. Rather than a comprehensive clinical diagnostic interview depression was determined according to two different criteria (ICD-10; EURO-D). While the findings of this study may be to some extent generalisable to other similar urban or rural sites, they may not be generalised to the whole city, or country where the study was conducted. Comparison of findings with studies that systematically sampled whole cities, or conducted national surveys may be particularly difficult. 6.2. Depression prevalence Other than the relatively high prevalence of ICD-10 depressive episode in Dominican Republic and rural India, and the low estimates of China and Nigeria, our findings are broadly consistent with those reported in high income countries. A review from Djernes and colleagues (Djernes, 2006) reported an ICD-10 prevalence of 3.3% in Australia and 7.7% in Denmark; More recently, a Brazilian community-based survey of older adults (Costa et al., 2007) reported an unusually high prevalence of ICD-10 depressive episode (19.2%). However, it is difficult to compare our findings with this study, since their sample size was small (n¼ 413), people with dementia were excluded, the age range was 75 years and older, and a two phase design (Symptom scale & semi-structured SCAN interview) was used. The prevalence of EURO-D depression was generally six times higher than that of ICD-10 depressive episode. These ratios are consistent with earlier reviews and studies regarding the ratio of depression identified with such screening scales, as compared to clinical diagnoses (Castro-Costa et al., 2007; Prince et al., 2004). A large community study, carried out in ten European countries in persons aged 55 and above, using EURO-D measure reported prevalence rates between 19% and 33% (Castro-Costa et al., 2007). Our results are congruent with these results even though methods differences between studies and there is much more variability in prevalence among sites in our study, mainly arising from the low prevalence in China. Unlike rigid criteria-based instruments (ICD-10 and DSMIV), identification as a probable case of depression using EURO-D depends only on the overall load of reported symptoms, rather than requiring the presence of particular symptoms and combination of symptoms, and is without regard to their duration, persistence or pervasiveness. As such, it is important to recognise that not all of these individuals would be considered to be ‘cases for treatment’ since current evidence-based recommendations are exclusively for those with moderate to severe case level depression (Patel, 2009). The discrepancy in prevalence between the two approaches is explained by the less than perfect specificity of the EURO-D, which, given the low prevalence of DSM-IV and ICD-10 depression in population settings results in a low positive predictive value. The disparity is striking, particularly in Nigeria, where very few if any clinical diagnoses were recorded, but there was a relatively high prevalence of most depression symptoms, and a high prevalence of EURO-D depression. The generally much higher prevalence of EURO-D depression raises the question “what constitutes a case?”. This issue was discussed in an earlier review of late-life depression in which the disparity between prevalence according to clinical diagnostic criteria (1.8%) and using symptoms scales and other less restrictive criteria (13.5%) was first highlighted (Beekman et al., 1999). Although not all EURO-D cases may be ‘cases for treatment’, reliance upon clinical diagnoses may significantly underestimate the population burden of depression symptoms, much of which may arise from the larger number of individuals with less severe ‘sub-syndromal’ depression. 6.3. Variation of prevalence among sites As can be appreciated from the above, the pattern of variation of prevalence among sites was generally similar for the two diagnostic criteria. Estimates were generally high, and fairly consistent in Latin American sites, lower in urban India than in rural India (where prevalence was similar to that of the highest prevalence Latin American site, the Dominican Republic) and very low in the two Chinese sites. Nigeria was unusual in this respect, with a very low prevalence of ICD-10 depression, but a comparatively high prevalence of EURO-D depression, similar to that in Latin American sites. The low prevalence of depression in China might be partly explained by contextual factors including the influence of culture on ascertainment of depression. In China the once popular and prevalent diagnosis of shenjing shuairuo, a neurasthenia like syndrome comprising weakness, fatigue, concentration problems, headache and other somatic symptoms seems in recent years to have been supplanted as the most common diagnosis in epidemiological surveys and clinical practice by depressive and anxiety M. Guerra et al. / Journal of Affective Disorders 190 (2016) 362–368 disorders (Lee, 1999). This has led some to allege an inappropriate importation of western nosologies that do not match well with Chinese cultural idioms of expression of psychological distress (Lee, 1999). In this context, it is perhaps noteworthy that depression was not a common symptom in either urban or rural Chinese sites, and the sleep disturbance, fatigue and irritability were the three commonest symptoms in the urban site, and tearfulness, lack of concentration and loss of interest in the rural site. More work needs to be done to establish the validity of the GMS interview, across cultures as a tool for generating ICD-10 and EURO-D diagnoses. None of the systematic reviews of studies we used as a guide of the prevalence of depression (Cole, 2003; Djernes, 2006) considered the effect of urban or rural residence. In this study there was a trend towards a lower prevalence of late-life depression in rural than urban sites in Latin America, with the opposite trend seen in India. Findings elsewhere in the literature are inconsistent. Some community cross-sectional studies reported a higher prevalence in urban residence (Carpiniello and Rudas, 1989; Chiu et al., 2005; Gureje et al., 2007), associated with a higher prevalence of chronic medical conditions and functional impairment, and lack of, or poor social support. Others did not find any association (St John et al., 2006). 7. Conclusion Overall our findings are congruent with those previously reported in the literature and given the pattern of findings, we can conclude that late-life depression prevalence varied depending on the criterion used for assessment. Wide variation in prevalence among sites needs to be evaluated. More work needs to be done to understand adequately the expression of depression in different cultures. This must be the focus of further analysis. Prospective longitudinal studies are needed in order to clarify aetiological factors and to disentangle those factors that influence prevalence through increasing the duration of depressive episodes (maintenance of depression) and those that increase the incidence (onset) of depression. Given the high burden of this condition, prioritisation of recognition and treatment of depression in older adults should be on the agenda of policy-makers across the world. This goes together with the urgent need to strengthen primary care settings, development of locally appropriate support services as an important component of ensuring social protection and finally to develop primary and secondary prevention strategies using evidence from appropriate studies. Acknowledgements The 10/66 Dementia Research Group population based surveys were supported by the Wellcome Trust (UK) (GR066133); the World Health Organization; the US Alzheimer’s Association (IIRG – 04–1286); and the Fondo Nacional de Ciencia Y Tecnologia, Consejo de Desarrollo Cientifico Y Humanistico, Universidad Central de Venezuela (Venezuela). Matthew Prina is funded by the Medical Research Council [Grant number ¼MR/K021907/1]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Appendix A. Supplementary material Supplementary data associated with this article can be found in the online version at http://dx.doi.org/10.1016/j.jad.2015.09.004. References Alvarado, B.E., Zunzunegui, M.V., Beland, F., Sicotte, M., Tellechea, L., 2007. Social 367 and gender inequalities in depressive symptoms among urban older adults of latin america and the Caribbean. J. Gerontol. B Psychol. Sci. Soc. Sci. 62B, S226–S236. Beekman, A.T., Copeland, J.R., Prince, M.J., 1999. Review of community prevalence of depression in later life. Br J psychiatry. 174, 307–311. Blay, S.L., Marinho, V., 2007. Depression in the elderly. [Portuguese]. Rev. bras. De. Med. 64, 150–155. Blazer, D., 2003. Depression in late life: review and commentary. J. Gerontol. A Biol. Sci. Med. Sci. 58, 249–265. Brailean, A., Guerra, M., Chua, K.C., Prince, M., Prina, M.A., 2015. A multiple indicators multiple causes model of late-life depression in Latin American countries. J. Affect. Disord. 184, 129–136. Carpiniello, B., Rudas, C.M., 1989. Depression among elderly people. A psychosocial study of urban and rural populations. Acta Psychiatr. Scand. 80, 445–450. Carvalhais, S.M.M., Peixoto, L.-C.M., Firmo, S.V., Castro-Costa, J.O.A., Uchoa E, E., 2008. The Influence of socioeconomic conditions on the prevalence of depressive symptoms and its covariates in an eldery population with slight income differences: the Bambuí Health and Aging Study (Bhas). Int. J. Soc. Psychiatry 54, 447–456. Castro-Costa, E., Dewey, M., Stewart, R., Banerjee, S., Huppert, F., Mendonca-Lima, C., Bula, C., Reisches, F., Wancata, J., Ritchie, K., Tsolaki, M., Mateos, R., Prince, M., 2007. Prevalence of depressive symptoms and syndromes in later life in ten European countries: the SHARE study. Br. J. Psychiatry 191, 393–401. Chen, R., Copeland, J.R., Wei, L., 1999. A meta-analysis of epidemiological studies in depression of older people in the People's Republic of China. Int. J. Geriatr. Psychiatry 14, 821–830. Chen, R., Qin, H.Z., Xu, X., Copeland JRM, X., 2004. A community-based study of depression in older people in Hefei, China – the GMS-AGECAT prevalence, case validation and socio-economic correlates. Int. J. Geriatr. Psychiatry 19, 407–413. Chen, R., Hu, W.L., Qin, Z., Copeland JRM., X., Hemiingway, H., 2005. Depression in older people in rural China. Arch. Intern Med. 165, 2019–2025. Chiu, H., Huang Ch., C.C., Mau, L., 2005. Depressive symptoms, chronic medical conditions and functional status: a comparison of urban and rural elders in Taiwan. Int. J. Geriatr. Psychiatry 20, 635–644. Christensen, K., Doblhammer, G., Rau, R., Vaupel, J.W., Christensen, K., Doblhammer, G., Rau, R., Vaupel, J.W., 2009. Ageing populations: the challenges ahead. Lancet 374, 1196–1208. Cole MG, D.N., 2003. Risk factors for depression among elderly community subjects. A Syst. Rev. Meta-Analysis. Am. J. Psychiatry 160, 1147–1156. Copeland, J.R.M., Kelleher, M.J., Kellett, J.M., Gourlay, A.J., Gurland, B.J., Fleiss, J.L., Sharpe, L., 1976. A semi-structured clinical interview for the assessment of diagnosis and mental state in the elderly: the Geriatric Mental State Schedule. I. Development and reliability. Psychol. Med. 6, 439–449. Costa, E., Barreto, S.M., Uchoa, E., Firmo, J.O., Lima-Costa, M.F., Prince, M., Costa, E., Barreto, S.M., Uchoa, E., Firmo, J.O.A., Lima-Costa, M.F., Prince, M., 2007. Prevalence of International Classification of Diseases, 10th Revision common mental disorders in the elderly in a Brazilian community: the Bambui Health Ageing Study. Am. J. Geriatr. Psychiatry 15, 17–27. Djernes, J.K., 2006. Prevalence and predictors of depression in populations of elderly: a review. Acta Psychiatr. Scand. 113, 372–387. García-Peña C, W.F., Sánchez-García, S., Júarez-Cedillo, T., Espinel-Bermudez, C., García-Gonzalez, J.J., Gallegos-Carrillo, K., Franco-Marina, F., Gallo, J.J., 2008. Depressive symptoms among older adults in Mexico City. J. Gen. Intern. Med. 23 (12), 1973–1980. Guerra, M., Ferri, C., Llibre, J., Prina, A.M., Prince, M., 2015. Psychometric properties of EURO-D, a geriatric depression scale: a cross-cultural validation study. BMC Psychiatry 15, 12. Guerra, M., Ferri, C.P., Sosa, A.L., Salas, A., Gaona, C., Gonzales, V., de la Torre, G.R., Prince, M., 2009. Late-life depression in Peru, Mexico and Venezuela: the 10/66 population-based study. Br. J. Psychiatry 195, 510–515. Gureje, O., Kola, L., Afolabi, E., 2007. Epidemiology of major depressive disorder in elderly Nigerians in the Ibadan Study of Ageing. A community-based survey. Lancet 370, 957–964. Lee, S., 1999. Diagnosis Postponed: Shenjing Shuairuo and the Transformation of Psychiatry in Post-Mao China. Cult. Med. Psychiatry 23, 349–380. Meng, C., Tang, Z., 2000. Analysis and comparison urban and rural elderly depressive symptoms in Beijing. Chin. J. Gerontol. 20, 196–199. Pan, A., Franco, O.H., Wan, Y., Yu, Z., Ye, X., Lin, X., 2008. Prevalence and geographic disparity of depressive symptoms among middle-aged and elderly in China. J. Affect. Disord. 105, 167–175. Patel, V.,T.G., 2009. Packages of care for mental, neurological, and substance use disorders in low- and middle-income countries. PLoS Med. 6, e1000160. Prince, M., Acosta, D., Chiu, H., Scazufca, M., Varghese, M., 2003. Dementia diagnosis in developing countries: A cross-cultural validation study. Lancet 361, 909–917. Prince, M., Ferri, C., Acosta, D., Albanese, E., Arizaga, R., Dewey, M., et al., 2007. The protocols for the 10/66 dementia research group population-based research programme. BMC Public Health 7, 165. Prince, M., Acosta, D., Chiu, H., Copeland, J., Dewey, M., Scazufca, M., Varghese, M., Dementia Research, G., Prince, M., Acosta, D., Chiu, H., Copeland, J., Dewey, M., Scazufca, M., Varghese, M., 2004. Effects of education and culture on the validity of the Geriatric Mental State and its AGECAT algorithm. Br. J. Psychiatry 185, 429–436. Prince, M.J., Reischies, F., Beekman, A.T.F., Fuhrer, C., Jonker, S.L., Kivela, B.A., Lawlor, A., Lobo, H., Magnusson, M., Fichter, H., van Oyen, H., Roelands, M., Skoog, I., Turrina, C., Copeland, J.R.M., 1999. Development of the EURO-D scale–a European, Union initiative to compare symptoms of depression in 14 European 368 M. Guerra et al. / Journal of Affective Disorders 190 (2016) 362–368 centres. Br. J. Psychiatry 174, 330–338. St John, P.D., Strain, B.A., 2006. Depressive symptoms among older adults in urban and rural areas. Int. J. Geriatr. Psychiatry, 1175–1180. Tintle, N.B., Kostyushenko, B., Gutkovish, S., Bromet, Z., 2011. Depression and its correlates in older adults in Ukraine. Int. J. Geriatr. Psychiatry 26, 1292–1299. Wu, W., Zhang, M.Y., 1989. Application of depression scale CES-D among the elderly people in the community, Shangai. Arch. Psychiatry 7, 139–142. Zunzunegui, M.V., Alvarado, B.E., Beland, F., Vissandjee, B., 2009. Explaining health differences between men and women in later life: a cross-city comparisson in Latin America and the Caribbean. Soc. Sci. Med. 68, 235–242.
Journal of Affective Disorders 190 (2016) 362–368 Contents lists available at ScienceDirect Journal of Affective Disorders journal homepage: www.elsevier.com/locate/jad Research report A comparative cross-cultural study of the prevalence of late life depression in low and middle income countries M. Guerra a,b,c, A.M. Prina b, C.P. Ferri d, D. Acosta e, S. Gallardo a,n, Y. Huang f, K.S. Jacob g, I.Z. Jimenez-Velazquez h, J.J. Llibre Rodriguez i, Z. Liu f, A. Salas j, A.L. Sosa k, J.D. Williams m, R. Uwakwe l, M. Prince b a Institute of Memory, Depression and Disease Risk, Avda Constructores 1230, Lima 12, Peru Centre for Global Mental Health, Health Service and Population Research Department, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK c Peruvian University, Cayetano, Heredia, Lima, Peru d Federal University of Sao Paulo, UNIFESP, Sao Paulo, Brasil e National University Pedro Henriquez Urena f Peking University China g Christian Medical College, Vellore, India h School of Medicine, University of Puerto Rico, San Juan, Puerto Rico i Medical University of Havana j Central University of Venezuela, Caracas, Venezuela k National Autonomous University of Mexico l Nnamdi Azikiwe Uniiversity m Department of Community Health, Voluntary Health Services, Chennai, India b art ic l e i nf o a b s t r a c t Article history: Received 6 July 2015 Received in revised form 19 August 2015 Accepted 5 September 2015 Available online 23 October 2015 Background: Current estimates of the prevalence of depression in later life mostly arise from studies carried out in Europe, North America and Asia. In this study we aimed to measure the prevalence of depression using a standardised method in a number of low and middle income countries (LMIC). Methods: A one-phase cross-sectional survey involving over 17,000 participants aged 65 years and over living in urban and rural catchment areas in 13 sites from 9 countries (Cuba, Dominican Republic, Puerto Rico, Mexico, Venezuela, Peru, China, India and Nigeria). Depression was assessed and compared using ICD-10 and EURO-D criteria. Results: Depression prevalence varied across sites according to diagnostic criteria. The lowest prevalence was observed for ICD-10 depressive episode (0.3 to 13.8%). When using the EURO-D depression scale, the prevalence was higher and ranged from 1.0% to 38.6%. The crude prevalence was particularly high in the Dominican Republic and in rural India. ICD-10 depression was also associated with increased age and being female. Limitations: Generalisability of findings outside of catchment areas is difficult to assess. Conclusions: Late life depression is burdensome, and common in LMIC. However its prevalence varies from culture to culture; its diagnosis poses a significant challenge and requires proper recognition of its expression. & 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). Keywords: Depression Prevalence ICD-10 EURO-D Older-age 1. Introduction Depression, a prevalent and extremely disabling psychiatric condition in later life (Beekman et al., 1999; Blazer, 2003), has not been studied sufficiently in low and middle income countries (LMIC) where a demographic transition, with an increasing n Corresponding author. E-mail address: mariella.guerra.1066@gmail.com (M. Guerra). number of older people is rapidly occurring (Christensen et al., 2009). In high-income countries, the prevalence of late-life depression has been extensively studied (Beekman et al., 1999; Djernes, 2006) with a considerable variation reported across studies, with the operational criteria being a main influence. To our knowledge at least 21 studies have been conducted from 1990 until 2011 in LMIC using different criteria. Most of the studies were carried out in China (Chen et al., 1999, 2004, 2005; Meng and http://dx.doi.org/10.1016/j.jad.2015.09.004 0165-0327/& 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). M. Guerra et al. / Journal of Affective Disorders 190 (2016) 362–368 363 Tang, 2000; Pan et al., 2008; Wu and Zhang, 1989), or Latin America {Zunzunegui et al., 2009 #1409; Alvarado et al., 2007 #118; Costa, 2007 #130; Blay and Marinho, 2007 #131;GarcíaPeña et al., 2008 #1343;Carvalhais et al., 2008 #1223; Tintle et al., 2011 #1299;Guerra et al., 2009 #120; Barcelos-Ferreira, 2010 #1099}. The majority of Latin American studies had a small sample size and used depression symptom scales and reported a relatively higher prevalence of depression, compared to those studies from Mainland China. One of the biggest multicentre studies (SABE) was conducted in six Latin American capital cities using the Geriatric Depression Scale, and reported a depression prevalence ranging from 16.5% to 30.1% in women and from 11.8% to 19.6% in men (Alvarado et al., 2007); results that are broadly consistent with estimates from two cross-national comparisons of late-life depression in Europe: SHARE (CastroCosta et al., 2007) and EURODEP. In the 10/66 population based study conducted in Peru, Mexico and Venezuela, the prevalence varied depending on the diagnoses criteria used being higher for GMS– AGECAT (between 30.0% and 35.9%) and EURO–D depression scale (cutpoint3/4) (between 26.1% and 31.2%). We now extend the evidence of the prevalence of late-life depression to include a wider range of settings, in Latin America, Nigeria and Asia.. 3.1.1. Depression of clinical significance The EURO-D (Prince et al., 1999) is a symptom scale that covers 12 symptom domains: depressed mood, pessimism, suicidality, guilt, sleep, interest, irritability, appetite, fatigue, concentration, enjoyment and tearfulness. Each item is scored 0 (symptom not present) or 1 (symptom present), and item scores are summed to produce a scale with a minimum score of zero and a maximum of 12. The EURO-D scale had moderately high internal consistency in the EURODEP study (Prince et al., 1999), and was reported to have good construct validity in the our 10/66 sample (Brailean et al., 2015). For this study, we determined the optimal- cutpoint in each site (as either 4 or 5), as described in the EURO-D validation paper that we have recently published (Guerra et al., 2015). In summary, the optimal cutpoint, its sensitivity and specificity were respectively: Cuba (cutpoint: 5, sensitivity at cutpoint: 97.2%, specificity 87.7%), Dominican Republic (5, 93.5%, 84%), Puerto Rico (5, 97.9%, 91.6%), urban China (6. 100%, 97.8%), rural China (5, 85.7%, 99.6%), urban India (5, 97.4%, 74.1%), India rural (4, 91.3%, 69.5%) and Nigeria (5, 100% 79.3%) 2. Methods 3.2. Socio-demographic status and other health-conditions 2.1. Setting, design and procedures Age was established during the interview from the participant using official ID documentation, informant report, and, in the case of discrepancy an event calendar was used. We also obtained information on: gender and marital status (single, married/ cohabiting, widowed, divorced/separated); education (none, did not complete primary, completed primary, secondary, tertiary); social support (living alone versus living with others; frequency of contact with relatives and friends); occupational attainment (professional, clerical or trade, skilled or semi-skilled manual worker); amount and sources of income; number of assets, and food insecurity. Other health conditions were self-reported (e.g. angina, stroke, COPD, etc.), diagnosed (e.g. dementia using the 10/66 dementia algorithm (Prince et al., 2003), or determined according to specific criteria (e.g. hypertension). The 10/66 Dementia Research Group population-based studies were all conducted according to the same standardised protocol. The full 10/66 study protocol has been published elsewhere (Prince et al., 2007). A one-phase cross-sectional population-based survey has been conducted of all those over 65 years old from defined catchments areas. Surveys were carried out in thirteen sites in nine countries (Cuba, Dominican Republic, Puerto Rico, Peru, Mexico, Venezuela, China, India and, Nigeria). Surveys in Peru, Mexico, China and India included both urban and rural catchment areas, the Nigerian catchment area was predominately rural, while in the other countries participants were recruited only from urban catchment areas. All assessments were carefully translated and adapted into the relevant local languages. Acceptability and conceptual equivalence were assessed and reviewed by local informants. Interviews were carried out in participants’ own homes and lasted on average two to three hours. Interviewers were fully trained on the 10/66 protocol by the local principal investigator (PI) and the local study coordinator (SC). The study protocol and the consent procedures, including the witnessed consent procedure, were approved by the King's College London research ethics committee and in all local countries. Funding for each group of countries was obtained at different times, therefore these baseline surveys were conducted over a six year period (2003–2009). 3. Measurements 3.1. Depression Depression was determined according to EURO-D and ICD-10 criteria, all generated from the same semi-structured clinical interview, the Geriatric Mental State (GMS), which is supported by the computerised diagnostic algorithm AGECAT (Automated Geriatric Examination for Computer Assisted Taxonomy) (Copeland et al., 1976). For all criteria, period prevalence was determined with respect to the last one month. 3.1.2. Diagnostic criteria for depression ICD-10 diagnoses were derived using a computerised algorithm applied to the GMS. For ICD-10, F32 Depressive episode, specified as mild, moderate or severe was used. 4. Statistical analysis We used the 10/66 data archive (release 3.0) and STATA (version 11 or 13) for all analyses. The prevalence of depression, accompanied by robust 95% confidence intervals (CIs), was estimated in Cuba, Dominican Republic, China, India, Nigeria and Puerto Rico. Direct standardisation estimates (for age, sex and education), using the whole sample as the standard population, were also reported for all the sites, including the sites where we previously published non-standardised estimates (Peru, Mexico and Venezuela){Guerra et al., 2009 #120}. In each setting, we report the prevalence of depression with 95% confidence intervals, by age and sex, for both ICD-10 depressive episode and EURO-D depression (4/5 cut-point). Forest plots from a random effect meta-analysis were generated using the metaprop command in STATA for both ICD-10 and EURO-D estimates, and reported with their pooled estimates. In order to explore the risk of age and gender on prevalent ICD10 depression, we used Poisson regressions to calculate mutually adjusted prevalence ratios (PRs). We then used a fixed-effect meta-analysis to pool the PRs across sites, also reporting an I2 Higgins score to highlight the heterogeneity across sites. 364 M. Guerra et al. / Journal of Affective Disorders 190 (2016) 362–368 Table 1 Socio-dtemographic characteristics of the sample. Cuba n¼2944 Dominican Republic n ¼2011 Puerto Rico n ¼1918 China Urban n¼ 1160 China Rural n¼ 1002 India Urban n ¼1003 India Rural n ¼999 Nigeria n¼914 Age (years) Mean age 65–69 70–74 75–79 80þ Missing values 74.8 760 (25.8) 789 (26.8) 639 (21.7) 749 (25.5) 7 75.2 533 (26.5) 520 (25.8) 397 (19.7) 561 (27.9) 0 76.1 406 (21.1) 439 (22.8) 456 (23.7) 618 (32.1) 2 73.9 316 (27.2) 362 (31.2) 254 (21.9) 228 (19.6) 0 72.4 383 (38.2) 296 (29.5) 202 (20.1) 121 (12.0) 0 71.2 415 (41.4) 318 (31.7) 144 (14.3) 124 (12.3) 2 72.5 331 (33.1) 350 (35.0) 177 (17.7) 141 (14.1) 0 72.6 386 (42.2) 222 (24.2) 121 (13.2) 185 (20.2) 0 Gender Female Missing values 1913 (64.9) 0 1325 (65.9) 2 1289(67.2) 4 661 (56.9) 0 556 (55.4) 0 571 (57.6) 15 545 (54.5) 0 539 (58.9) 0 Marital status Never married Currently married Widowed Separated/divorced Missing values 275 (9.3) 1271 (43.2) 928 (31.6) 462 (15.7) 8 139 (6.9) 586 (29.3) 806 (40.3) 465 (23.3) 15 118 (6.1) 931 (48.5) 640 (33.3) 228 (11.8) 4 3 (0.2) 829 (71.4) 326 (28.1) 2 (0.1) 0 22 (2.2) 585 (58.3) 394 (39.3) 1 (0.1) 0 21 (2.1) 523 (52.2) 426 (42.5) 32 (3.1) 3 5 (0.5) 481 (48.1) 497 (49.7) 16 (1.6) 0 41 (4.8) 581 (68.6) 225 (26.5) 0 (0.0) 67 Education level None Minimal Primary Secondary Tertiary Missing values 75 (2.5) 655 (22.3) 979 (33.3) 728 (24.8) 499 (17.0) 8 392 (19.6) 1022 (51.3) 370 (18.5) 135 (6.7) 73 (3.6) 19 70 (3.6) 376 (19.5) 395 (20.5) 686 (35.7) 388 (20.2) 0 232 (20.0) 153 (13.1) 303 (26.1) 335 (28.8) 137 (11.8) 0 579 (57.7) 114 (11.3) 259 (25.8) 45 (4.4) 5 (0.5) 0 428 (42.6) 234 (23.3) 212 (21.1) 87 (8.6) 42 (4.1) 2 660 (66.0) 195 (19.5) 116 (11.6) 26 (2.6) 2 (0.2) 0 543 (59.4) 135 (14.7) 126 (13.7) 20 (2.1) 18 (1.9) 0 261 (8.8) 445 (15.2) 1422 (48.3) 816 (27.7) 7 254 (12.6) 135 (6.7) 963 (47.8) 659 (32.7) 0 472 (23.5) 666 (33.2) 548 (27.3) 323 (16.1) 0 54 (4.6) 415 (35.7) 446 (38.4) 245 (21.2) 10 49 (4.8) 194 (19.3) 679 (67.7) 80 (7.9) 11 44(4.3) 108 (10.7) 719 (71.5) 134 (13.3) 2 120 (12.0) 140 (14.0) 625 (62.5) 114 (11.4) 0 944(32.2) 11 357(17.8) 3 428(22.3) 0 19(1.6) 0 12(1.2) 0 24(2.4) 1 22(2.2) 0 Living arrangements Alone With spouse only With adult children Any other Missing values Past depression Missing values The prevalence of ‘sub-syndromal depression’ was also reported. This was defined as those not meeting criteria for ICD-10 depressive episode, but scoring above the optimal cut-point on the EURO-D scale. No data 14(1.7) 0 5. Results 5.1. General characteristics Overall, 17,852 interviews were completed. Response Table 2 Prevalence of depression (%) in each site, according to ICD-10 depressive episode criterion, stratified by age and sex. Age groups (years) Cuba Men Women Dominican Republic Men Women Puerto Rico Men Women China urban Men Women China rural Men Women India urban Men Women India rural Men Women Nigeria Men Women 65–69 70–74 75–79 80 þ All ages Crude prevalence 1.1 (0.0–2.3) 6.8 (4.5–9.0) 2.4 (0.6–4.2) 4.8 (2.9–6.7) 3.0 (0.8–5.3) 8.1 (5.4–10.7) 4.3 (1.7–6.9) 5.2 (3.3–7.2) 2.6 (1.6–3.6) 6.1 (5.0–7.2) 4.9 (4.1–5.7) 8.5 (4.5–12.5) 13.4 (10.3–17.6) 6.7 (3.1–10.2) 13.9 (10.0–17.7) 15.9 (9.6–22.2) 16.2 (11.8–20.7) 15.4 (9.9–20.9) 16.8 (13.1–20.6) 11.1 (8.8–13.5) 15.2 (13.3–17.2) 13.8 (12.3–15.3) 0.4 (0.0–7.5) 2.6 (0.8–4.5) No cases 1.4 (0.0–2.7) 1.3 (0.5–3.1) 2.0 (0.4–3.6) 0.9 (0.4–2.2) 4.2 (2.3–6.2) 1.2 (0.4–2.1) 2.8 (1.9–3.7) 2.3 (1.7-3.0) No cases No cases No cases 0.5 (0.0–1.5) No cases No cases No cases 1.7 (0.0–4.0) No cases 0.5 (0.0–1.0) 0.3 (0.0–0.6) 0.5 (0.0–1.5) 0.5 (0.0–1.6) 1.5 (0.0–3.7) 0.6 (0.0–1.8) 1.3 (0.0–3.9) No cases No cases 1.3 (0.0–3.9) 0.9 (0.0–1.8) 0.5 (0.0–1.1) 0.7 (0.2–1.2) 4.0 (1.1–7.0) 4.6 (1.9–7.3) 2.4 (0.0–5.1) 4.8 (1.7–7.8) 5.9 (0.1–11.8) 1.3 (0.0–3.9) 7.7 (0.2–15.2) No cases 4.3 (2.4–6.2) 3.7 (2.1–5.2) 3.9 (2.7–5.1) 12.2 (6.7–17.7) 10.9 (6.5–15.4) 14.9 (8.2–20.6) 14.8 (9.–19.8) 12.5 (5.5–19.5) 10.1 (3.7–16.5) 12.3 (4.6–20.1) 10.3 (2.9–17.7) 13.2 (10.1–16.3) 12.1 (9.4–14.8) 12.6 (10.5–14.7) No cases 0.8 (0.0–1.9) 1.3 (0.0–3.9) 0.7 (0.0–2.0) No cases No cases 1.1 (0.0–3.1) No cases 0.5 (0.0–1.3) 0.6 (0.0-1.2) 0.5 (0.1–1.0) M. Guerra et al. / Journal of Affective Disorders 190 (2016) 362–368 365 Table 3 Prevalence of depression (%) in each site, according to EURO-D criterion (cutpoint 4/5), stratified by age and sex. Age groups (years) Cuba Men Women Dominican Republic Men Women Puerto Rico Men Women China urban Men Women China rural Men Women India urban Men Women India rural Men Women Nigeria Men Women 65-69 70-74 75-79 80 þ All ages Crude prevalence 9.5 (6.0–13.0) 21.8 (18.1–25.4) 8.9 (5.6–12.2) 18.1 (14.7–21.5) 10.9 (6.8–14.9) 22.0 (17.9–26.1) 14.2 (9.7–18.7) 25.4 (21.6–29.1) 9.5 (7.7–11.3) 20.3 (18.4–22.1) 16.5 (15.1–17.9) 17.6 (12.1–23.0) 28.8 (23.9–33.6) 15.4 (10.3–20.5) 27.2 (22.2–32.1) 25.0 (17.5–32.5) 31.7 (26.1–37.3) 24.9 (18.3–31.4) 36.7 (31.9–41.6) 19.6 (16.6–22.5) 30.6 (28.1–33.2) 26.8 (24.8–28.8) 14.2 (7.4–30.9) 7.1 (12.8–21.5) 7.4 (3.1–11.6) 11.7 (7.9–15.4) 8.9 (4.4–13.5) 13.3 (9.5–17.2) 13.8 (9.2–18.5) 23.4 (19.3–27.6) 6.3 (4.4–8.2) 12.6 (10.8–14.5) 10.6 (9.2–12.0) 2.7 (0.0–5.7) 4.4 (1.6–7.3) 3.5 (0.9–6.0) 3.1 (0.4–5.8) 3.4 (0.0–6.8) 5.1 (1.4–8.8) 11.0 (5.0–16.9) 12.6 (6.6–18.7) 1.9 (0.7–3.1) 3.0 (1.7–4.3) 2.5 (1.6–3.4) 3.1 (0.6–5.6) 1.6 (0.0–3.4) 3.8 (0–7.1) 3.0 (0.4–5.7) 7.8 (1.7–13.9) 3.2 (0.0–6.3) 8.7 (0.2–17.2) 6.7 (0.9–12.4) 1.4 (0.3–2.5) 0.7 (0.0–1.5) 1.0 (0.3–1.7) 17.3 (11.6–23.0) 35.7 (29.5–41.9) 18.3 (11.4–25.1) 36.5 (29.5–43.5) 29.9 (18.6–41.1) 36.0 (24.9–47.1) 32.7 (19.5–45.9) 24.2 (13.6–34.9) 20.9 (17.0–24.8) 34.6 (30.6–38.5) 28.6 (25.7–31.5) 36.7 (28.6–44.8) 36.5 (29.6–43.3) 38.3 (30.5–46.1) 46.9 (39.9–53.9) 39.8 (29.3–50.2) 46.1 (35.3–56.8) 42.5 (30.9–54.1) 50.0 (37.8–62.2) 36.7 (32.2–41.2) 40.2 (35.9–44.4) 38.6 (35.3–41.9) 16.9 (10.5–23.3) 15.6 (11.1–20.1) 23.7 (13.9–33.5) 26.0 (18.8–33.2) 14.7 (6.1–23.3) 23.1 (11.2–34.9) 23.2 (14.5–31.8) 43.3 (32.9–53.8) 18.8 (14.9–22.7) 22.7 (19.5–26.0) 21.1 (18.8–23.5) Fig. 1. Prevalence of depression (%) using different operational criteria, standardised by age, gender and education. proportions ranged from 72% (urban India) to 98% (rural India). General characteristics of the respondents in each country are summarised in Table 1. Women predominate over men in all sites, with nearly two- thirds of participants being women in Latin American sites, and just over a half in China, India and Nigeria. Higher levels of education were registered in Latin America and in urban areas in comparison to rural areas. Participants in rural locations also reported fewer household assets, more food insecurity, and lower personal income, compared to those living in urban locations. Between 1.2% (rural China) and 34.9% (urban Peru) reported a past history of depression. 5.2. Prevalence of depression The largest source of variation in the prevalence of depression was the criterion used for assessment. The prevalence of ICD-10 depressive episode varied between 0.3% and 13.8% by location (Table 2), whereas the prevalence of EURO-D depression ranged between 1.0% and 38.6% (Table 3). However, for each of these criteria, there was also substantial heterogeneity in prevalence among sites (supplementary fig. 1). The meta-analysed pooled estimate for ICD-10 depression was 4.7 (95% CI: 3.1-6.3) and for EURO-D depression 18.2 (96% CI: 12.3-24.0). Direct standardisation had some effect on the estimates, as shown in Fig. 1 which reports the prevalence for both criteria using direct standardisation for age, gender and education. The prevalence in Dominican Republic, with all diagnostic criteria, was high with respect to that observed in other Latin American sites. The prevalence was exceptionally low in urban and rural China with all criteria. In all sites with exception of rural Peru, rural China and both Indian sites, the prevalence of depression was higher in women than among men. In Latin America, the prevalence of ICD-10 depression increased with age in men, but not in women, whereas an increasing trend in EURO-D prevalence was seen across both genders and sites. When we adjusted for both age and gender and pooled our estimates across sites, we found that men, and younger individuals had lower PRs of ICD-10 depression (pooled estimates: 0.62, 95% CI: 0.53– Table 4 Prevalence of sub-syndromal depression (EURO-D depression not confirmed by ICD1-10). Centre Crude prevalence (95% CI) Cuba Dominican Republic Puerto Rico Mexico (urban) Mexico (rural) Peru (urban) Peru (rural) China (urban) China (rural) India (urban) India (rural) Nigeria 11.4 (10.3–12.7) 13.7 (12.2–15.3) 7.8 (6.7–9.1) 15.0 (13.0–17.4) 12.2 (10.3–14.4) 14.0 (12.2–15.9) 12.5 (10.0–15.5) 2.2 (1.5–3.2) 0.4 (0.2–1.1) 24.8 (22.1–27.6) 25.3 (22.6–28.2) 20.4 (18.1–22.8) 366 M. Guerra et al. / Journal of Affective Disorders 190 (2016) 362–368 0.71, I2 ¼ 0.0% and 1.07, 95% CI¼ 1.02–1.12, I2 ¼45.2% respectively). Given the higher prevalence of EURO-D depression compared with ICD-10 depressive episode we explored the the concept of sub-syndromal depression (EURO –D depression not confirmed as a depressive episode by the ICD-10). The prevalence of sub-syndromal depression varied across sites with urban China having the lowest (0.4%) and rural India the highest (25.3%) (Table 4). 5.3. Depression clinical aspects Overall, 35.3% of ICD-10 depression cases were mild, 51.9% were moderate, and 12.7% severe. The proportion of current ICD-10 depressive episode cases with past history of depression varied between 25.6% and 71.8%, with rural India constituting a low outlier with only 2/126 cases (1.6%) reporting a past history of depression. In general a past history was more frequently reported in urban than rural sites. In Latin American sites, where a past history of depression was relatively frequently reported, around 20% to 60% of these individuals reported having previously been treated by a doctor, with higher proportions in Cuba, Puerto Rico, and Venezuela than in Dominican Republic, Mexico and Peru. The median age for first onset of depression exceeded 60 years for most sites. 6. Discussion In this study, we reported a wide variation of estimates according to the depression criterion that we used. Across all sites, the prevalence of ICD-10 depressive episode was higher than EURO-D depression (a score of 5 or higher on the EURO-D scale). However, for each of these criteria, there was also substantial variation in prevalence among sites. Therefore it is important to compare results between studies, where possible, based on the use of the same or similar criteria. On this basis, our results suggested a higher prevalence of late-life depression, in at least some sites in Latin America, and in urban India, than is typically recorded in studies in high income countries. Conversely, the prevalence in China was very low. 6.1. Strengths and weaknesses of the study To our knowledge this is the first large-scale community-based depression-prevalence study conducted in LMIC that, with the same methodology, has evaluated a large number of older persons, in nine LMIC located in three continents, using rigorous research diagnostic criteria such as the ICD-10 and the EURO-D. Unlike HIC, an important advantage in our study is the relatively high response rate, at least 80% in all sites, and exceeding 90% in several sites. Rather than a comprehensive clinical diagnostic interview depression was determined according to two different criteria (ICD-10; EURO-D). While the findings of this study may be to some extent generalisable to other similar urban or rural sites, they may not be generalised to the whole city, or country where the study was conducted. Comparison of findings with studies that systematically sampled whole cities, or conducted national surveys may be particularly difficult. 6.2. Depression prevalence Other than the relatively high prevalence of ICD-10 depressive episode in Dominican Republic and rural India, and the low estimates of China and Nigeria, our findings are broadly consistent with those reported in high income countries. A review from Djernes and colleagues (Djernes, 2006) reported an ICD-10 prevalence of 3.3% in Australia and 7.7% in Denmark; More recently, a Brazilian community-based survey of older adults (Costa et al., 2007) reported an unusually high prevalence of ICD-10 depressive episode (19.2%). However, it is difficult to compare our findings with this study, since their sample size was small (n¼ 413), people with dementia were excluded, the age range was 75 years and older, and a two phase design (Symptom scale & semi-structured SCAN interview) was used. The prevalence of EURO-D depression was generally six times higher than that of ICD-10 depressive episode. These ratios are consistent with earlier reviews and studies regarding the ratio of depression identified with such screening scales, as compared to clinical diagnoses (Castro-Costa et al., 2007; Prince et al., 2004). A large community study, carried out in ten European countries in persons aged 55 and above, using EURO-D measure reported prevalence rates between 19% and 33% (Castro-Costa et al., 2007). Our results are congruent with these results even though methods differences between studies and there is much more variability in prevalence among sites in our study, mainly arising from the low prevalence in China. Unlike rigid criteria-based instruments (ICD-10 and DSMIV), identification as a probable case of depression using EURO-D depends only on the overall load of reported symptoms, rather than requiring the presence of particular symptoms and combination of symptoms, and is without regard to their duration, persistence or pervasiveness. As such, it is important to recognise that not all of these individuals would be considered to be ‘cases for treatment’ since current evidence-based recommendations are exclusively for those with moderate to severe case level depression (Patel, 2009). The discrepancy in prevalence between the two approaches is explained by the less than perfect specificity of the EURO-D, which, given the low prevalence of DSM-IV and ICD-10 depression in population settings results in a low positive predictive value. The disparity is striking, particularly in Nigeria, where very few if any clinical diagnoses were recorded, but there was a relatively high prevalence of most depression symptoms, and a high prevalence of EURO-D depression. The generally much higher prevalence of EURO-D depression raises the question “what constitutes a case?”. This issue was discussed in an earlier review of late-life depression in which the disparity between prevalence according to clinical diagnostic criteria (1.8%) and using symptoms scales and other less restrictive criteria (13.5%) was first highlighted (Beekman et al., 1999). Although not all EURO-D cases may be ‘cases for treatment’, reliance upon clinical diagnoses may significantly underestimate the population burden of depression symptoms, much of which may arise from the larger number of individuals with less severe ‘sub-syndromal’ depression. 6.3. Variation of prevalence among sites As can be appreciated from the above, the pattern of variation of prevalence among sites was generally similar for the two diagnostic criteria. Estimates were generally high, and fairly consistent in Latin American sites, lower in urban India than in rural India (where prevalence was similar to that of the highest prevalence Latin American site, the Dominican Republic) and very low in the two Chinese sites. Nigeria was unusual in this respect, with a very low prevalence of ICD-10 depression, but a comparatively high prevalence of EURO-D depression, similar to that in Latin American sites. The low prevalence of depression in China might be partly explained by contextual factors including the influence of culture on ascertainment of depression. In China the once popular and prevalent diagnosis of shenjing shuairuo, a neurasthenia like syndrome comprising weakness, fatigue, concentration problems, headache and other somatic symptoms seems in recent years to have been supplanted as the most common diagnosis in epidemiological surveys and clinical practice by depressive and anxiety M. Guerra et al. / Journal of Affective Disorders 190 (2016) 362–368 disorders (Lee, 1999). This has led some to allege an inappropriate importation of western nosologies that do not match well with Chinese cultural idioms of expression of psychological distress (Lee, 1999). In this context, it is perhaps noteworthy that depression was not a common symptom in either urban or rural Chinese sites, and the sleep disturbance, fatigue and irritability were the three commonest symptoms in the urban site, and tearfulness, lack of concentration and loss of interest in the rural site. More work needs to be done to establish the validity of the GMS interview, across cultures as a tool for generating ICD-10 and EURO-D diagnoses. None of the systematic reviews of studies we used as a guide of the prevalence of depression (Cole, 2003; Djernes, 2006) considered the effect of urban or rural residence. In this study there was a trend towards a lower prevalence of late-life depression in rural than urban sites in Latin America, with the opposite trend seen in India. Findings elsewhere in the literature are inconsistent. Some community cross-sectional studies reported a higher prevalence in urban residence (Carpiniello and Rudas, 1989; Chiu et al., 2005; Gureje et al., 2007), associated with a higher prevalence of chronic medical conditions and functional impairment, and lack of, or poor social support. Others did not find any association (St John et al., 2006). 7. Conclusion Overall our findings are congruent with those previously reported in the literature and given the pattern of findings, we can conclude that late-life depression prevalence varied depending on the criterion used for assessment. Wide variation in prevalence among sites needs to be evaluated. More work needs to be done to understand adequately the expression of depression in different cultures. This must be the focus of further analysis. Prospective longitudinal studies are needed in order to clarify aetiological factors and to disentangle those factors that influence prevalence through increasing the duration of depressive episodes (maintenance of depression) and those that increase the incidence (onset) of depression. Given the high burden of this condition, prioritisation of recognition and treatment of depression in older adults should be on the agenda of policy-makers across the world. This goes together with the urgent need to strengthen primary care settings, development of locally appropriate support services as an important component of ensuring social protection and finally to develop primary and secondary prevention strategies using evidence from appropriate studies. Acknowledgements The 10/66 Dementia Research Group population based surveys were supported by the Wellcome Trust (UK) (GR066133); the World Health Organization; the US Alzheimer’s Association (IIRG – 04–1286); and the Fondo Nacional de Ciencia Y Tecnologia, Consejo de Desarrollo Cientifico Y Humanistico, Universidad Central de Venezuela (Venezuela). Matthew Prina is funded by the Medical Research Council [Grant number ¼MR/K021907/1]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Appendix A. Supplementary material Supplementary data associated with this article can be found in the online version at http://dx.doi.org/10.1016/j.jad.2015.09.004. References Alvarado, B.E., Zunzunegui, M.V., Beland, F., Sicotte, M., Tellechea, L., 2007. Social 367 and gender inequalities in depressive symptoms among urban older adults of latin america and the Caribbean. J. Gerontol. B Psychol. Sci. Soc. Sci. 62B, S226–S236. Beekman, A.T., Copeland, J.R., Prince, M.J., 1999. Review of community prevalence of depression in later life. Br J psychiatry. 174, 307–311. Blay, S.L., Marinho, V., 2007. Depression in the elderly. [Portuguese]. Rev. bras. De. Med. 64, 150–155. Blazer, D., 2003. Depression in late life: review and commentary. J. Gerontol. A Biol. Sci. Med. Sci. 58, 249–265. Brailean, A., Guerra, M., Chua, K.C., Prince, M., Prina, M.A., 2015. A multiple indicators multiple causes model of late-life depression in Latin American countries. J. Affect. Disord. 184, 129–136. Carpiniello, B., Rudas, C.M., 1989. Depression among elderly people. A psychosocial study of urban and rural populations. Acta Psychiatr. Scand. 80, 445–450. Carvalhais, S.M.M., Peixoto, L.-C.M., Firmo, S.V., Castro-Costa, J.O.A., Uchoa E, E., 2008. The Influence of socioeconomic conditions on the prevalence of depressive symptoms and its covariates in an eldery population with slight income differences: the Bambuí Health and Aging Study (Bhas). Int. J. Soc. Psychiatry 54, 447–456. Castro-Costa, E., Dewey, M., Stewart, R., Banerjee, S., Huppert, F., Mendonca-Lima, C., Bula, C., Reisches, F., Wancata, J., Ritchie, K., Tsolaki, M., Mateos, R., Prince, M., 2007. Prevalence of depressive symptoms and syndromes in later life in ten European countries: the SHARE study. Br. J. Psychiatry 191, 393–401. Chen, R., Copeland, J.R., Wei, L., 1999. A meta-analysis of epidemiological studies in depression of older people in the People's Republic of China. Int. J. Geriatr. Psychiatry 14, 821–830. Chen, R., Qin, H.Z., Xu, X., Copeland JRM, X., 2004. A community-based study of depression in older people in Hefei, China – the GMS-AGECAT prevalence, case validation and socio-economic correlates. Int. J. Geriatr. Psychiatry 19, 407–413. Chen, R., Hu, W.L., Qin, Z., Copeland JRM., X., Hemiingway, H., 2005. Depression in older people in rural China. Arch. Intern Med. 165, 2019–2025. Chiu, H., Huang Ch., C.C., Mau, L., 2005. Depressive symptoms, chronic medical conditions and functional status: a comparison of urban and rural elders in Taiwan. Int. J. Geriatr. Psychiatry 20, 635–644. Christensen, K., Doblhammer, G., Rau, R., Vaupel, J.W., Christensen, K., Doblhammer, G., Rau, R., Vaupel, J.W., 2009. Ageing populations: the challenges ahead. Lancet 374, 1196–1208. Cole MG, D.N., 2003. Risk factors for depression among elderly community subjects. A Syst. Rev. Meta-Analysis. Am. J. Psychiatry 160, 1147–1156. Copeland, J.R.M., Kelleher, M.J., Kellett, J.M., Gourlay, A.J., Gurland, B.J., Fleiss, J.L., Sharpe, L., 1976. A semi-structured clinical interview for the assessment of diagnosis and mental state in the elderly: the Geriatric Mental State Schedule. I. Development and reliability. Psychol. Med. 6, 439–449. Costa, E., Barreto, S.M., Uchoa, E., Firmo, J.O., Lima-Costa, M.F., Prince, M., Costa, E., Barreto, S.M., Uchoa, E., Firmo, J.O.A., Lima-Costa, M.F., Prince, M., 2007. Prevalence of International Classification of Diseases, 10th Revision common mental disorders in the elderly in a Brazilian community: the Bambui Health Ageing Study. Am. J. Geriatr. Psychiatry 15, 17–27. Djernes, J.K., 2006. Prevalence and predictors of depression in populations of elderly: a review. Acta Psychiatr. Scand. 113, 372–387. García-Peña C, W.F., Sánchez-García, S., Júarez-Cedillo, T., Espinel-Bermudez, C., García-Gonzalez, J.J., Gallegos-Carrillo, K., Franco-Marina, F., Gallo, J.J., 2008. Depressive symptoms among older adults in Mexico City. J. Gen. Intern. Med. 23 (12), 1973–1980. Guerra, M., Ferri, C., Llibre, J., Prina, A.M., Prince, M., 2015. Psychometric properties of EURO-D, a geriatric depression scale: a cross-cultural validation study. BMC Psychiatry 15, 12. Guerra, M., Ferri, C.P., Sosa, A.L., Salas, A., Gaona, C., Gonzales, V., de la Torre, G.R., Prince, M., 2009. Late-life depression in Peru, Mexico and Venezuela: the 10/66 population-based study. Br. J. Psychiatry 195, 510–515. Gureje, O., Kola, L., Afolabi, E., 2007. Epidemiology of major depressive disorder in elderly Nigerians in the Ibadan Study of Ageing. A community-based survey. Lancet 370, 957–964. Lee, S., 1999. Diagnosis Postponed: Shenjing Shuairuo and the Transformation of Psychiatry in Post-Mao China. Cult. Med. Psychiatry 23, 349–380. Meng, C., Tang, Z., 2000. Analysis and comparison urban and rural elderly depressive symptoms in Beijing. Chin. J. Gerontol. 20, 196–199. Pan, A., Franco, O.H., Wan, Y., Yu, Z., Ye, X., Lin, X., 2008. Prevalence and geographic disparity of depressive symptoms among middle-aged and elderly in China. J. Affect. Disord. 105, 167–175. Patel, V.,T.G., 2009. Packages of care for mental, neurological, and substance use disorders in low- and middle-income countries. PLoS Med. 6, e1000160. Prince, M., Acosta, D., Chiu, H., Scazufca, M., Varghese, M., 2003. Dementia diagnosis in developing countries: A cross-cultural validation study. Lancet 361, 909–917. Prince, M., Ferri, C., Acosta, D., Albanese, E., Arizaga, R., Dewey, M., et al., 2007. The protocols for the 10/66 dementia research group population-based research programme. BMC Public Health 7, 165. Prince, M., Acosta, D., Chiu, H., Copeland, J., Dewey, M., Scazufca, M., Varghese, M., Dementia Research, G., Prince, M., Acosta, D., Chiu, H., Copeland, J., Dewey, M., Scazufca, M., Varghese, M., 2004. Effects of education and culture on the validity of the Geriatric Mental State and its AGECAT algorithm. Br. J. Psychiatry 185, 429–436. Prince, M.J., Reischies, F., Beekman, A.T.F., Fuhrer, C., Jonker, S.L., Kivela, B.A., Lawlor, A., Lobo, H., Magnusson, M., Fichter, H., van Oyen, H., Roelands, M., Skoog, I., Turrina, C., Copeland, J.R.M., 1999. Development of the EURO-D scale–a European, Union initiative to compare symptoms of depression in 14 European 368 M. Guerra et al. / Journal of Affective Disorders 190 (2016) 362–368 centres. Br. J. Psychiatry 174, 330–338. St John, P.D., Strain, B.A., 2006. Depressive symptoms among older adults in urban and rural areas. Int. J. Geriatr. Psychiatry, 1175–1180. Tintle, N.B., Kostyushenko, B., Gutkovish, S., Bromet, Z., 2011. Depression and its correlates in older adults in Ukraine. Int. J. Geriatr. Psychiatry 26, 1292–1299. Wu, W., Zhang, M.Y., 1989. Application of depression scale CES-D among the elderly people in the community, Shangai. Arch. Psychiatry 7, 139–142. Zunzunegui, M.V., Alvarado, B.E., Beland, F., Vissandjee, B., 2009. Explaining health differences between men and women in later life: a cross-city comparisson in Latin America and the Caribbean. Soc. Sci. Med. 68, 235–242.

Tutor Answer

smithwiliams
School: UT Austin

Attached.

1
Running head: CROSS CULTURAL RESEARCH

Cross-Cultural Research
Student’s name:
Institution:
Course:

2
Running head: CROSS CULTURAL RESEARCH

Summary of a research study
The research involves a comparative study of late-life depression experienced in low
and middle-income countries, which according to diagnostic criteria, the depression levels
vary in the states. Economic, social, political, religious and other common factors attribute to
the variation. In conclusion, the researchers assert that culture and the ways of living of the
residents in the affected regions lead to the occu...

flag Report DMCA
Review

Anonymous
Thanks, good work

Similar Questions
Hot Questions
Related Tags

Brown University





1271 Tutors

California Institute of Technology




2131 Tutors

Carnegie Mellon University




982 Tutors

Columbia University





1256 Tutors

Dartmouth University





2113 Tutors

Emory University





2279 Tutors

Harvard University





599 Tutors

Massachusetts Institute of Technology



2319 Tutors

New York University





1645 Tutors

Notre Dam University





1911 Tutors

Oklahoma University





2122 Tutors

Pennsylvania State University





932 Tutors

Princeton University





1211 Tutors

Stanford University





983 Tutors

University of California





1282 Tutors

Oxford University





123 Tutors

Yale University





2325 Tutors