Author's personal copy
Innov High Educ (2015) 40:291–303
DOI 10.1007/s10755-014-9313-4
What’s in a Name: Exposing Gender Bias in Student
Ratings of Teaching
Lillian MacNell & Adam Driscoll & Andrea N. Hunt
Published online: 5 December 2014
# Springer Science+Business Media New York 2014
Abstract Student ratings of teaching play a significant role in career outcomes for higher
education instructors. Although instructor gender has been shown to play an important role in
influencing student ratings, the extent and nature of that role remains contested. While difficult
to separate gender from teaching practices in person, it is possible to disguise an instructor’s
gender identity online. In our experiment, assistant instructors in an online class each operated
under two different gender identities. Students rated the male identity significantly higher than
the female identity, regardless of the instructor’s actual gender, demonstrating gender bias.
Given the vital role that student ratings play in academic career trajectories, this finding
warrants considerable attention.
Keywords gender inequality . gender bias . student ratings of teaching . student evaluations of
instruction
Lillian MacNell is a doctoral candidate in Sociology at North Carolina State University. She received her
Master’s degree in Sociology at the University of Central Florida. Her research and teaching interests include
food access, food justice, and the environment.
Adam Driscoll is Assistant Professor of Sociology at the University of Wisconsin-La Crosse. He received his
Master’s degree in Sociology at East Carolina University and his Ph.D. in Sociology at North Carolina State
University. His research and teaching focus upon the environmental impacts of industrial agriculture and effective
online pedagogy.
Andrea N. Hunt has a Ph.D. in Sociology from North Carolina State University and is currently Assistant
Professor in Sociology and Family Studies at the University of North Alabama. Her research interests include
gender, race and ethnicity, mentoring in undergraduate research, engaging teaching practices, and the role of
academic advising in student retention.
L. MacNell (*)
Department of Sociology and Anthropology, 334 1911 Building, Campus Box 8107,
Raleigh, North Carolina 27695, USA
e-mail: loconne@ncsu.edu
A. Driscoll
University of Wisconsin-La Crosse, La Crosse, WI, USA
e-mail: adriscoll@uw.lax.edu
A. N. Hunt
University of North Alabama, Florence, AL, USA
e-mail: ahunt3@una.edu
Author's personal copy
292
Innov High Educ (2015) 40:291–303
Student ratings of teaching are often used as an indicator of the quality of an instructor’s
teaching and play an important role in tenure and promotion decisions (Abrami, d’Apollonia,
& Rosenfield, 2007; Benton & Cashin, 2014). Gender bias in these ratings constitutes an
important form of inequality facing women in academia that is often unaccounted for in such
decisions. Students perceive, evaluate, and treat female instructors quite differently than they
do male instructors (Basow, 1995; Centra & Gaubatz, 2000; Feldman, 1992; Young, Rush, &
Shaw, 2009). While a general consensus exists that gender plays a vital role in how students
perceive and interact with their instructors, there is conflicting evidence as to whether or not
this translates into a bias in student ratings due to variations in several mediating factors such
as teaching styles and subject material.
Prior studies of student ratings of instruction have been limited in their ability to test for the
existence of gender bias because it is difficult to separate the gender of an instructor from their
teaching practices in a face-to-face classroom.In online courses, however, students usually base
the categorization of their instructor’s gender on the instructor’s name and, if provided, photograph. It is possible for students to believe that their instructor is actually a man, based solely on a
name or photograph, when in reality she is a woman, or vice versa. Therefore, the online
environment affords researchers a unique opportunity to assign one instructor two different
gender identities in order to understand whether or not differences in student ratings are a result
of differences in teaching or simply based on unequal student expectations for male and female
instructors. Such experimentation allows researchers to control for potentially confounding
factors and therefore attribute observed differences solely to the variable of interest—in this
case, the perceived gender of the instructor (Morgan & Winship, 2007).
This study analyzed differences in student ratings of their instructors1 from an online
course, independent of actual gender. The course professor randomly assigned students to
one of six discussion groups, two of which the professor taught directly. The other four were
taught by one of two assistant instructors—one male and one female. Each instructor was
responsible for grading the work of students in their group and interacting with those students
on course discussion boards. Each assistant instructor taught one of their groups under their
own identity and the second group under the other assistant instructor’s identity. Thus, of the
two groups who believed they had the female assistant instructor, one actually had the male.
Similarly, of the two groups who believed they had the male assistant instructor, one actually
had the female (see Table 1). At the end of the course, the professor asked students to rate their
instructor through the use of an online survey. This design created a controlled experiment that
allowed us to isolate the effects of the gender identity of the assistant instructors, independent
of their actual gender. If gender bias was present, than the students from the two groups who
believed they had a female assistant instructor should have given their instructor significantly
lower evaluations than the two groups who believed they had a male assistant instructor.
Student Ratings of Teaching
Though far from perfect, student ratings of teaching provide valuable feedback about an
instructor’s teaching effectiveness (Svinicki & McKeachie, 2010). They may be reliably
interpreted as both a direct measure of student satisfaction with instruction and as an indirect
1
To clarify the language we use throughout the paper, we refer to all three persons responsible for grading and
directly interacting with students as “instructors.” The course “professor” was the person responsible for course
design and content preparation, while the two “assistant instructors” worked under the professor’s direction to
manage and teach their respective discussion groups.
Author's personal copy
Innov High Educ (2015) 40:291–303
293
Table 1 Experimental Design.
Discussion Group
Instructor’s Perceived Gender
Instructor’s Actual Gender
Group A (n =8)
Female
Female
Group B (n =12)
Female
Male
Group C (n =12)
Male
Female
Group D (n =11)
Male
Male
measure of student learning (Marsh, 2007; Murray, 2007). They also play an important role in the
selection of teaching award winners, institutional reviews of programs, and student course
selection (Benton & Cashin, 2014). More importantly to the careers of educators, these ratings
are “used by faculty committees and administrators to make decisions about merit increases,
promotion, and tenure” (Davis, 2009, p. 534). In particular, quantitative evaluations of instructors’ overall teaching effectiveness are frequently emphasized in personnel decisions (Centra &
Gaubatz, 2000). Given the widespread reliance on student ratings of teaching and their effect on
career advancement, any potential bias in those ratings is a matter of great consequence.
Gender Bias in Academia
Sociological studies of gender and gender inequality are careful to distinguish between sex (a
biological identity) and gender (a socially constructed category built around cultural expectations of male- and female-appropriate behavior). Gender is part of an ongoing performance
based on producing a configuration of behaviors that are seen by others as normative. West and
Zimmerman (1987) suggested that people engage in gendered behaviors not only to live up to
normative standards, but also to minimize the risk of accountability or gender assessment from
others. Thus, gender is a process that is accomplished at the interactional level and reinforced
through the organization of social institutions such as academia (Lorber, 1994). Gender then
contributes to a hierarchal system of power relations that is embedded within the interactional
and institutional levels of society and shapes gendered expectations and experiences in the
workplace (Risman, 2004).
An examination of gender bias in student ratings of teaching must be framed within the
broader context of the pervasive devaluation of women, relative to men, that occurs in
professional settings in the United States (Monroe, Ozyurt, Wrigley, & Alexander, 2008). In
general, Western culture accords men an automatic credibility or competence that it does not
extend to women (Johnson, 2006). Stereotypes that women are less logical, less confident, and
occupy lower positions still pervade our organizational structures (Acker, 1990). Conversely,
men are automatically assumed to have legitimate authority, while women must prove their
expertise to earn the same level of respect. This disparity has been well documented in the field
of academia, where men tend to be regarded as “professors” and women as “teachers” (Miller
& Chamberlin, 2000) and women face a disparate amount of gender-based obstacles, relative
to men (Morris, 2011).
In experiments where researchers gave students identical articles to evaluate—half of which
bore a man’s name and half of which bore a woman’s—the students rated the research they
thought had been done by men more highly (Goldberg, 1968; Paludi & Strayer, 1985). In a
similar study, college students evaluated two hypothetical applicants for a faculty position and
tended to judge the male candidate as more qualified despite the fact that both applicants had
identical credentials (Burns-Glover & Veith, 1995). Additionally, a study of student
Author's personal copy
294
Innov High Educ (2015) 40:291–303
evaluations of instructors’ educational attainment revealed that students misattribute male
instructors’ education upward and female instructors’ education downward (Miller & Chamberlin, 2000). Overall, women in academia tend to be regarded as less capable and less
accomplished than men, regardless of their actual achievements and abilities.
Gender Role Expectations
Students often expect their male and female professors to behave in different ways or to
respectively exhibit certain “masculine” and “feminine” traits. Commonly held masculine, or
“effectiveness,” traits include professionalism and objectivity; feminine, or “interpersonal,”
traits include warmth and accessibility. Students hold their instructors accountable to these
gendered behaviors and are critical of instructors who violate these expectations (Bachen,
McLoughlin, & Garcia, 1999; Chamberlin & Hickey, 2001; Dalmia, Giedeman, Klein, &
Levenburg, 2005; Sprague & Massoni, 2005). Consequently, instructors who adhere to gendered expectations are viewed more favorably by their students (Andersen & Miller, 1997;
Bennet, 1982). When female instructors exhibit strong interpersonal traits, they are viewed
comparably to their male counterparts. When female instructors fail to meet these gendered
expectations, however, they are sanctioned, while male instructors who do not exhibit strong
interpersonal traits are not (Basow & Montgomery, 2005; Basow, Phelan, & Capotosto, 2006).
At the same time, students are less tolerant of female instructors whom they perceive as lacking
professionalism and objectivity than they are of male instructors who lack the same qualities
(Bennet, 1982). In general, “students’ perceptions and evaluations of female faculty are tied
more closely to their gender expectations than for male faculty” (Bachen et al., 1999, p. 196).
These different standards can place female instructors in a difficult “double-bind,” where
gendered expectations (that women be nurturing and supportive) conflict with the professional
expectations of a higher-education instructor (that they be authoritative and knowledgeable)
(Sandler, 1991; Statham, Richardson, & Cook, 1991). On the one hand, students expect female
instructors to embody gendered interpersonal traits by being more accessible and personable.
However, these same traits can cause students to view female instructors as less competent or
effective. On the other hand, female instructors who are authoritative and knowledgeable are
violating students’ gendered expectations, which can also result in student disapproval.
Therefore, female instructors are expected to be more open and accessible to students as well
as to maintain a high degree of professionalism and objectivity. Female instructors who fail to
meet these higher expectations are viewed as less effective teachers than men (Basow, 1995).
Male instructors, however, are rated more highly when they exhibit interpersonal characteristics in addition to the expected effectiveness characteristics (Andersen & Miller, 1997). In
other words, female instructors who fail to exhibit an ideal mix of traits are rated lower for not
meeting expectations, while male instructors are not held to such a standard. Consequently,
gendered expectations represent a greater burden for female than male instructors (Sandler,
1991; Sprague & Massoni, 2005). An important manifestation of that disparity is bias in
student ratings of instructors, where female instructors may receive lower ratings than males,
not because of differences in teaching but for failing to meet gendered expectations.
Methodological Concerns with Previous Studies of Gender Bias
Studies of gender bias in student ratings of instruction have presented complicated and
sometimes contradictory results. Sometimes men received significantly higher ratings (Basow
& Silberg, 1987; Sidanius & Crane, 1989), sometimes women (Bachen et al., 1999; Rowden
& Carlson, 1996), and sometimes neither (Centra & Gaubatz, 2000; Feldman, 1993). The
Author's personal copy
Innov High Educ (2015) 40:291–303
295
variety of results in these studies suggests that gender does play a role in students’ ratings of
their instructors, but that it is a complex and multifaceted one (Basow et al., 2006).
One reason why prior research on gender bias in student ratings of teaching has provided
such inconclusive results may lie in the research design of these previous studies. A large
portion of research on student ratings of teaching directly utilized those ratings for their data
(e.g. Basow, 1995; Bennett, 1982; Centra, 2007; Centra & Gaubatz, 2000; Marsh, 2001). This
strategy allows for the analysis of a large amount of data, but it does not control for differences
in actual teaching and therefore may fail to capture gender bias in student ratings. Studies that
compare student ratings of instructors explore whether or not there are differences—not
whether or not those differences are the result of gender bias (Feldman, 1993). For example,
a study of ratings may find that a female instructor received significantly lower scores than a
male peer, but it could not assess whether that indicates a true difference in teaching quality.
Perhaps she was not perceived as warm and engaging; failing to meet the gendered expectations of the students, she may have been rated more poorly than her male peer despite being an
equally effective instructor. Similarly, the lack of a gender disparity in student ratings of
instruction could actually obscure a gender bias if at a particular institution the female faculty
members were, on average, stronger instructors than the males, yet were being penalized by
the students due to bias (Feldman, 1993).
Additionally, a number of situational elements may serve to sway student ratings of male
verses female instructors as male and female professors tend to occupy somewhat different
teaching situations. Men are overrepresented in the higher ranks of academic positions as well
as in STEM fields. They are also more likely to teach upper-level courses whereas women are
more likely to teach introductory courses (Simeone, 1987; Statham et al., 1991). Women are
also more likely than men to be employed in full-time non-tenure track positions as well as in
part-time positions (Curtis, 2011). These factors are highly relevant because instructor rank,
academic area, and class level of the course have all been found to directly impact student
ratings of instruction (Feldman, 1993; Liu, 2012). All of these factors serve to complicate the
relationship between instructor gender and student ratings of instruction and obfuscate the
conclusions that can be drawn from direct studies of such ratings. Studies of actual student
ratings of instruction may tell us more about women’s position in academia than about actual
gender bias in student ratings. In contrast, experimental studies allow the researcher to control
for both the quality and character of the teaching as well as the academic position of the
instructor; ensuring that any differences registered in student ratings indicate, as much as
possible, a bias rather than an actual difference in teaching (Feldman, 1993).
Research Question and Related Hypotheses
The fundamental question examined in this study is whether or not students rate their
instructors differently on the basis of what they perceive those instructors’ gender to be. We
expected that there would be no difference between the ratings for the actual male and female
instructors in the course as every attempt was made to minimize any differences in interaction
and teaching. However, we expected that student ratings of instructors would reflect the
different expectations for male and female instructors discussed above. Instructors whom
students perceived to be male would be afforded an automatic credibility on their competence
and professionalism. Furthermore, they would not be penalized for any perceived deficiency in
their interpersonal skills. Therefore, we expected that students would rate the instructors they
believed to be male more highly than ones they believed to be female, regardless of the
instructors’ actual gender.
Author's personal copy
296
Innov High Educ (2015) 40:291–303
The Study and Methodology
This study examined gender bias in student ratings of teaching by falsifying the gender of
assistant instructors in an online course and asking students to evaluate them along a number
of instructional criteria. By using a 2-by-2 experimental design (see Table 1), we were able to
compare student evaluations of a perceived gender while holding the instructor’s actual gender
(and any associated differences in teaching style) constant. Any observed differences in how
students rated one perceived gender versus the other must have therefore derived from bias on
the students’ part, given that the exact same two instructors (one of each gender) were being
evaluated in both cases.
Subjects
Data were collected from an online introductory-level anthropology/sociology course offered
during a five-week summer session at a large (20,000+), public, 4-year university in North
Carolina. The University’s institutional review board had approved this study (IRB# 2640). The
course fulfilled one of the university’s general education requirements, and the students
represented a range of majors and grade levels. The majority of the participants were traditional
college-aged students with a median age of 21 years. The instructors taught the course entirely
through a learning management system and students’ only contact with their instructors was
either through e-mail or comments posted on the learning management system. The professor
delivered course content through assigned readings and written PowerPoint slideshow lectures.
The course was broken up into nine different content sections. For each section, students were
required to read the assigned material and make a series of posts on a structured discussion
board. The course had 72 students who were randomly divided into six discussion groups for
the entirety of the course. All discussion board activity took place within the assigned
discussion group. Each discussion group had one instructor responsible for moderating the
discussion boards and grading all assignments for that group. The course professor took two
groups and divided the remaining four between the two assistant instructors, each taking one
group under their own identity and a second under their fellow assistant instructor’s identity (see
Table 1). All instructors were aware of the study being conducted and cooperated fully.
The section discussion boards were the primary source of interaction between students and
the course instructors and, as such, represented 30% of the students’ final grades. The
discussion boards were also an important part of student learning because they were the main
arena in which students could analyze and voice questions about course concepts and material.
The instructor assigned to each discussion group maintained an active presence on each
discussion board, offering comments and posing questions. The instructor also graded students’ posts and provided detailed feedback on where students had lost points. The two
assistant instructors for the four discussion groups employed a wide range of strategies so as
to maintain consistency in teaching style and grading. The two assistant instructors composed
personal introduction posts that indicated similar biographical information and background
credentials. They posted on the discussion boards and graded assignments at the same time of
day three days each week to ensure that no group received significantly faster or slower
feedback than others. The professor provided detailed grading rubrics for the discussion
boards, and the instructors coordinated their grading to ensure that these rubrics were applied
to students’ work equitably.2
2
A one-way ANOVA test confirmed that there was no significant variation among all six groups’ discussion
board grades and overall grades for the course.
Author's personal copy
Innov High Educ (2015) 40:291–303
297
Toward the end of the course the professor sent students reminder e-mails requesting that
they complete an online evaluation of their instructor. These evaluations were explained as
serving the purpose of providing the professor with feedback about the instructors’ performance. The survey asked students to rate their instructor on various factors such as accessibility, effectiveness, and overall quality. Over 90% of the class completed the evaluation. For
the purpose of this study, we only analyzed data from the discussion groups assigned to the
assistant instructors, leaving us with 43 subjects.
Instrument
The instructor evaluation consisted of 15 closed-ended questions that ask students to rate their
instructors on a variety of measures using a five-point Likert scale (1 = Strongly disagree, 2 =
Disagree, 3 = Neither Agree nor Disagree, 4 = Agree, 5 = Strongly agree). The survey had six
questions designed to measure effectiveness traits (e.g. professionalism, knowledge, and
objectivity) and six questions designed to measure interpersonal traits (e.g. respect, enthusiasm, and warmth). In addition, there were two questions designed to measure communication
skills and one question that asked students to evaluate the instructor’s overall quality as a
teacher. We also asked students to indicate which discussion group they were in and to provide
basic demographic and academic background information including gender, age, year in
school, and number of credit hours currently being taken. All students fully completed the
evaluation, leaving us with no missing data.
We performed all analyses with the 13th version of the Stata statistical analysis program. We
used exploratory factor analysis to test how well the separate questions reflected a common
underlying dimension. Principal component factor analysis revealed that 12 of our items
characterized a single factor for which the individual factor loadings ranged from .7370 to
.9489; sufficiently high to justify merging them into a single index (Hair, Anderson, Tatham, &
Black, 1998). This indicates that those 12 questions on our survey were all measuring the same
latent variable, which we interpret to be a general evaluation of the instructor’s teaching. A
reliability test yielded a Cronbach’s alpha above .950 for the 12 questions. In order to confirm
the factor structure, we used structural equation modeling to test a single latent variable
indicated by our 12 separate questions. Our model was a strong fit to the data (N =43, χ2(47)
=59.18 (not significant), RMSEA =0.078, CFI =0.980, SRMR =0.043) with all loadings
significant at the p
Purchase answer to see full
attachment