Introverted Elves & Conscientious Gnomes:
The Expression of Personality in World of Warcraft
Nick Yee1, Nicolas Ducheneaut1, Les Nelson1, Peter Likarish2
1
2
Palo Alto Research Center
3333 Coyote Hill Road, Palo Alto, CA
[nyee, nicolas, lnelson]@parc.com
ABSTRACT
Personality inference can be used for dynamic
personalization of content or system customization. In this
study, we examined whether and how personality is
expressed in Virtual Worlds (VWs). Survey data from
1,040 World of Warcraft players containing demographic
and personality variables was paired with their VW
behavioral metrics over a four-month period. Many
behavioral cues in VWs were found to be related to
personality. For example, Extraverts prefer group activities
over solo activities. We also found that these behavioral
indicators can be used to infer a player’s personality.
Author Keywords
Virtual worlds, online games, personality, Big 5, inference.
ACM Classification Keywords
H5.m. Information interfaces and presentation (e.g., HCI):
Miscellaneous.
General Terms
Human Factors
INTRODUCTION
Games can be character revealing. One of the author’s
fathers once noted that he enjoys playing golf with his
business partners because it lets him see which of them
cheats on the golf course. The underlying implication, of
course, is that how someone behaves on a golf course says
something about how they may behave during a business
transaction. And online gamers who have developed
romantic relationships in virtual worlds [34] often say
something similar:
“The game WAS the reason we fell in love. Going through
all the adventures and quests together really built our
relationship. We found out how the other person is when
they are mad, tired, sad, happy, excited, annoyed, etc.”
[City of Villains, Female, Age 25]
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
CHI 2011, May 7–12, 2011, Vancouver, BC, Canada.
Copyright 2011 ACM 978-1-4503-0267-8/11/05....$10.00.
University of Iowa
Iowa City, IA
peter-likarish@uiowa.edu
The unique affordances of virtual worlds offer an
unparalleled platform for examining the intersections
between personality and behaviors in virtual environments.
On the other hand, unlike personality expression in physical
settings, online games allow, or even encourage, users to
behave in a manner inconsistent with their everyday
identities. Thus, in this study, we ask:
Is it true that a person’s personality can be inferred
from how they behave in a virtual world?
And if so, what specific virtual cues are highly
indicative
of
a
person’s
introversion
or
conscientiousness (for example)?
Being able to infer a user’s personality from online cues has
direct relevance to HCI research, given the field’s longstanding interest in interface personalization and system
customization [e.g., 16, 24]. Indeed, knowing more about a
user’s personality could help design systems more
responsive to users’ needs in areas as diverse as ecommerce, social software, and recommender systems, to
name a few.
In this paper, we use data from the widely popular
massively multiplayer online game (MMOG), World of
Warcraft (WoW), to answer the two questions above. We
then use our results to discuss how personality data could
be used in the design of future online systems, being
mindful of some important limitations and potential pitfalls
also suggested by our research.
The Expression of Personality
Studies in personality psychology have repeatedly shown
that judgments of personality at zero acquaintance (i.e., by
strangers) are moderately accurate. More importantly, the
specific cues used to infer different personality traits are
consensually shared. In other words, personality is readily
expressed in specific cues in everyday life. This has been
shown to be true for brief face-to-face encounters [9, 15].
For example, in an earlier study involving video-taped faceto-face conversations [9], Extraverted individuals spoke
louder, with more enthusiasm and energy, and were more
expressive with gestures.
Other studies have researched personality inference by
examining an individual’s bedroom or office [12], or their
music collection [23]. For example, in the study of personal
spaces, Conscientious individuals had well-lit, neat, and
well-organized bedrooms. And individuals who scored high
on Openness to Experience had more varied books and
magazines.
This line of research has also extended to computermediated communication (CMC). In particular, studies have
shown that moderately accurate personality impressions can
be formed based on an individual’s personal website [18,
28], Facebook profile [3], email content [10], blog content
[32], and even an individual's email address—the thinnest
slice of CMC possible [2]. For example, in terms of
linguistic output on blogs, Agreeable individuals were more
likely to use the first person singular, words related to
family, and words related to positive emotions (e.g., happy,
joy). Conscientious individuals were more likely to use
words related to achievement.
These studies illustrate that we leave behind personality
traces in both the physical and digital spaces that we
inhabit. Given that the average online gamer spends over 20
hours a week in a virtual world [31, 33], it is not difficult to
imagine that some amount of personality traces could be
gleaned from logs of their virtual interactions as well.
Limits to Personality Expression?
On the other hand, there are also reasons to believe that
personality may not be readily expressed in virtual worlds.
First of all, previous studies have largely focused on
personality expression in everyday settings or linguistic
output online. It is unclear how or whether personality is
expressed via non-human bodies doing non-human things
in a fantasy world (e.g., gnomish priests resurrecting the
dead with magical light rays).
Related to this point, some scholars, like Turkle, have
suggested that VWs allow us to constantly reinvent
ourselves [27]. If the strongest interpretation of this notion
were true, it would imply that there might be a clean break
between personality and how a person behaves in a VW. In
other words, people could express or reinvent themselves
idiosyncratically in VWs, such that shared cues of
personality would not exist. And finally, there is also
evidence that users do alter their behaviors in online games.
For example, studies have shown that role-players tend to
be more imaginative and thus willing to experiment with
their online personas [5, 26]. And studies of online dating
[13] and online gaming [4] have shown that users in both
settings tend to idealize their online personas to some
degree. In particular, some studies have revealed that
tendencies to idealize self-representation online are
moderated by poor self-esteem [4, 7]. Thus, identity
experimentation and individual variations in that
experimentation may suppress stable personality expression
cues in virtual worlds.
The Collection of Personality Data
Previous studies of personality expression have tended to
rely on linguistic output or behavioral traces. These traces
are often artifacts of behaviors over time. For example, a
person who is low on Conscientiousness may often forget
to water their plant. A withered plant in a bedroom is an
example of a behavioral trace. Of course, as some
researchers have suggested [20], we should also study
actual behaviors as they occur. These researchers argue that
observations of individuals in their natural settings and
"humdrum lives" (pg. 862) may yield a better understanding
of the link between personality and behavior.
The problem is that the recording of behaviors in natural
settings and the subsequent coding are daunting tasks using
traditional tools. Shadowing and video recording
individuals is a laborious method that significantly
constrains sample size. Recent technology has begun to
offset the daunting nature of behavioral data collection,
however. For example, in a study of how personality is
related to everyday linguistic output [20], researchers used
an electronically activated recorder which was programmed
to record a participant’s acoustic space for 30 seconds every
12 minutes. A dictionary-based software tool was then used
to generate quantitative linguistic metrics of these
recordings.
Behavioral Data Collection in Virtual Worlds
Virtual Worlds (VWs) offer unique affordances for
studying the link between personality and behavior. For the
purposes of this paper, we define VWs as graphical
environments that enable geographically-distant individuals
to interact via avatars (i.e., digital representations of users).
It is also important to note that VWs are no longer
academic prototypes or niche cultures, but have become
mainstream interaction platforms. For example, WoW has
over 11 million active monthly subscribers [30], and the
Facebook game FarmVille has over 80 million active
monthly users.
VWs offer three unique features in terms of collection of
natural behavioral data. First, unlike the physical world
where it would be unfeasible to follow everyone around
with video cameras, VWs come inherently instrumented.
The computer systems running the VWs already track the
movement and behavior of every avatar to make
interactions possible (e.g., orienting avatars so that they can
look at each other). Second, these high-precision sensors
operate at all times. Thus, it is possible to generate not only
snapshot data, but longitudinal behavior profiles for every
user in a particular VW (e.g., see [8]). And finally, all these
observations can be performed unobtrusively, thereby
significantly reducing the observer effect [29]—participants
cannot react to the camera if the camera is invisible.
Indeed, a recent study has illustrated that there are
connections between personality and virtual behaviors in
the VW Second Life [35]. In that study, 76 students were
asked to participate in Second Life for six weeks while
“wearing” a scripted virtual tracking device that captured
some of their behaviors and linguistic output. The findings
revealed some interesting correlations. For example, high
Conscientiousness was positively correlated with
geographical movement.
There were several weaknesses in that study, however. First
and foremost, it is difficult to capture natural behavior by
assigning users to participate in a VW not of their own
choosing. Being able to observe actual users would likely
yield more reliable data. Second, only data from one VW
was collected. Given that much of SL resembles suburban
America [1], it would be helpful to gather data from
additional VWs (and in particular fantasy-based online
games) to see if the results generalize. Third, participants in
that study were only asked to spend six hours each week in
Second Life. On the other hand, we know that players of
other VWs spend on average 20 hours a week (without
being asked to) in games like WoW [33]. In other words,
the participant sample may not be representative of VW
users in either demographics or usage patterns. And finally,
many of the correlations found in that study did not align
with trait definitions of the personality variables used—e.g.,
virtual behaviors that correlated with Agreeableness were
not related in obvious ways to Agreeableness. Thus, a
replication in a different VW with existing users may help
clarify whether the results are an artifact of the nature of
Second Life or how people behave in VWs in general.
Research Questions
We focus on two research questions in this paper. While
previous studies have examined personality expression in
everyday settings, we were interested in examining whether
and how personality is expressed in online games. To
clarify and expand upon previous findings of personality
correlates in VWs, we focus on the online game World of
Warcraft in this paper with a sample of active players. Our
first research question is thus:
RQ1. What are the behavioral correlates of personality in an
online game?
If indeed personality is expressed in consistent cues in
VWs, a pertinent question is whether these cues can be used
specifically for personality inference. Thus, our second
research question is:
RQ2. How well can we infer someone’s personality from
only observing their virtual behaviors?
METHOD
Given our focus on the online game WoW, we will begin
by first briefly describing the game context to lay the
foundation for understanding the variables we use as
behavioral indicators.
World of Warcraft
WoW is currently one of the more popular online games
available commercially [30]. Unlike Second Life (SL) where
users create most of the in-world content (including
buildings, clothing, hair styles, and avatar bodies), content in
WoW is almost entirely created and designed by the
company running the game. And unlike the open sandbox
nature of SL, WoW uses a typical “leveling up” formula seen
in computer role-playing games. Specifically, players start at
level 1 and kill monsters to become higher level and acquire
better weapons and armor in order to kill bigger monsters.
Along the way, the game encourages players in different
ways to collaborate with other players. Users can also create
characters with different skill sets that complement each
other. For example, heavily-armored tank classes shield the
group from enemy attacks while lightly-armored damage
dealing DPS (damage per second) classes deal damage to
enemies and healing classes restore health lost in combat. In
short, WoW is a collaborative virtual environment [22].
WoW draws from an established lore from the Warcraft
franchise. Briefly, players must choose to belong to one of
two primary factions—the Alliance or the Horde. Each
faction has five distinct races, e.g., Night Elves or Trolls. A
variety of rules dictate where and when players may attack
and kill each other. Thus, a distinction is made between PvP
(player-vs-player) activities and PvE (player-vs-environment)
activities. PvP activites can range from one-to-one duels to
large 40 vs. 40 battlegrounds (BGs). And in general, it is a
player’s choice as to how much PvP activity they want to
engage in.
Players in WoW communicate via typed chat and might also
use VoIP tools to communicate via speech. The game also
provides a modest set of emotes (e.g., /hug). Players are also
able to specialize in crafting professions and convert
collected raw ingredients into finished goods, such as in
tailoring or cooking.
There is also a system of Achievements that keeps track of a
wide variety of combat and non-combat based objectives.
There are Achievements for zones explored, for dungeons
completed, for number of hugs given, and for cooking
proficiency. These Achievement scores provide a good sense
of how a player chooses to spend their time in WoW.
Thus, overall, WoW offers a wide and varied set of rich
behavioral cues to draw from. From class choice to amount
of PvP activity, from number of emotes used to amount of
world exploration, the game context offers a range of
measurable behaviors. This is also a point of differentiation
from SL. Due to the open nature of SL, most higher-level
conceptual behaviors are not defined in the environment and
it is up to individual users to define their creations. Thus,
there is no overarching set of metrics beyond fairly low level
behaviors, whereas in WoW, the game keeps track of many
behaviors and activities using a standardized lexicon.
The World of Warcraft Armory
Indeed, the standardized lexicon and data format inherently
lends itself to automated data collection. Blizzard, the
developer of WoW, is unique in that they have provided
public access to much of their internally-collected data at a
website known as the Armory. In short, by searching for a
character’s name, anyone can view details about their past
activities, including how many hugs they have given, the
quality of their equipment, the class they prefer to play, etc.
More importantly, these metrics have been tracked since the
character was first created. With a few clicks, we can gather a
character profile that has cumulative data over many months
of game play. It bears emphasizing the tremendous social
science research opportunities that are made possible by this
publicly-available database of longitudinal behavioral
metrics. It is from the Armory that we gathered the
behavioral metrics for this study.
Participants
1,040 WoW players were included in the study. We recruited
participants from forums dedicated to WoW, publicity on
popular gaming sites (e.g., WoW.com), word-of-mouth on
social media like Twitter, and mailing lists from previous
studies of online gamers. We note that due to human subjects
regulations, minors were excluded from participating in the
study. Nevertheless, we were still able to gather data from a
very wide age range (18-65). The average age of our sample
was 27.03 (SD = 8.21). 26% of participants were women.
Procedure
Participants began by completing a web-based survey that
gathered their demographic and personality information.
Participants were also asked to list up to 6 WoW characters
they were actively playing. Once these characters were in our
database, an automated data collection system was activated.
The system launches a web scraper that gathers character
profiles (large XML files) from the WoW Armory. The
Armory updates itself once per day (in the early morning) if a
character has been active the previous day. Thus, our script
follows this schedule with a daily interval and collects any
updated profiles. For the results, we analyzed data from a
contiguous 4-month period in the spring and summer of
2010.
Personality Measures
In personality psychology, the Big-5 model is the gold
standard. The model measures five traits: Extraversion,
Agreeableness, Conscientiousness, Emotional Stability, and
Openness to Experience.
For comparability, we also used an inventory that measured
these 5 factors. A 20-item scale measuring the Big-Five
Factor structure was drawn from the International Personality
Item Pool [11]. Participants rated themselves on the
inventory items using a scale that ranged from 1 (Very
Inaccurate) to 5 (Very Accurate).
Behavioral Measures in WoW
There are two main complexities we encountered when
dealing with the Armory data. First, Armory profiles consist
of hundreds of variables, oftentimes in a hierarchy. For
example, there is a system of Achievements in WoW that
tracks progress in a variety of defined goals, such as
Exploration Achievements and Dungeon Achievements.
Under Exploration Achievements, there is a category for
each continent. Under each continent, there is a listing for
each zone. To avoid being inundated by low-level variables
or including overlapping variables, we adopted an analytic
strategy of looking at or generating high level variables
where possible. This in turn produces more stable variables
that map to psychologically meaningful concepts. For
example, a notion of geographical exploration would seem to
be better tracked by the overall count of zones explored
rather than looking at any one particular zone.
A second complexity is that most players have multiple
active characters at the same time and it is not at first clear
how to combine metrics across characters to derive
participant-level aggregates. For example, a level 80
character can do much more damage than a level 60 character
(and the function is non-linear). Thus, there is no way to
easily combine damage done across characters. While these
metrics needed to be normalized, there wasn’t one single
variable they could all be normalized against.
We therefore adopted the following normalization and
variable generation strategies:
1) Static character attributes were normalized against total
number of characters. E.g., ratio of male characters = male
characters / total characters.
2) Variable character attributes were normalized against
overall time played. E.g., for combat roles, we calculated
how often each character was a tank/healer/DPS, and then
calculated a participant-level ratio for each of those roles. A
0.24 tank ratio meant that across all of a participant’s
characters, they spent 0.24 of their total playing time as a
tank.
3) Metrics that could be normalized against another variable
were normalized accordingly. E.g., the score of Exploration
Achievements could be divided by the score of All
Achievements to generate an Exploration ratio. This thus
filtered out the raw difference between someone with many
and someone with few achievements, and focused instead on
how they focus their game-play.
4) For metrics that could not be normalized and were highly
dependent on character level, we extracted the maximum.
E.g., it is very different having one character that has 80
vanity pets compared with having 4 characters with 20 vanity
pets each. In these cases, we found the maximum number of
vanity pets across a participant’s characters.
5) For metrics that could not be normalized and were not
dependent on character level, we calculated the sum. E.g.,
any level character can emote /hug as often as they’d like. In
these cases, we summed up the count of hugs across all of
their characters.
It is important to mention that we are not claiming to have
extracted all possible variables for analysis in this paper, but
rather, that we have extracted a meaningful and manageable
subset of higher-level variables that covers a wide range of
behaviors in WoW. A description of each derived variable,
along with its mean and standard deviation are presented in
Table 1 below. Note that we excluded outliers that were more
than 2 standard deviations away from the mean when
#
Variable
deriving these metrics. For brevity we will only describe
high-level trends in the text, but for ease of reference, we will
include the table row index in round brackets after each
mentioned correlate.
Description
M (SD)
E
A
C
ES
O
1
Ratio of Alliance Characters
= Alliance Chars / Total Chars
0.53 (0.47)
0.00
0.05
0.07
0.04
0.02
2
Ratio of Opposite Gender
Characters
Total Character Count
= Opposite Gender Chars / Total
Chars
Count of all active characters
reported by participant
0.27 (0.36)
-0.07
-0.14
-0.03
0.07
0.00
2.79 (1.51)
-0.12
0.03
0.07
0.02
0.10
4
Number of Days Played Since
Start of Study
Count of unique active days since
start of study
65.47 (34.89)
-0.04
0.00
0.01
-0.03
-0.01
5
Total Realm Count
Count of realms participant has
active characters on
1.11 (0.31)
-0.05
0.06
0.01
-0.03
0.09
6
Max of Guild Changes
.78 (1.05)
0.07
-0.01
0.00
-0.05
0.03
7
Sum of Kills
162353.84
(108633.20)
-0.03
-0.03
0.00
-0.00
-0.07
8
Sum of Kills in BGs
Highest number of guild change
events
Includes both kills against
computer monsters and other
players
Number of kills in battlegrounds
2705.70
(3589.28)
-0.01
-0.06
0.00
0.04
0.00
9
Sum of PvP Kills
Number of all PvP-related kills
10437.22
(12026.80)
-0.04
-0.08
0.05
0.09
-0.05
10
Sum of Deaths
Total number of deaths from any
cause
1849.12
(1440.63)
0.05
-0.07
0.00
0.05
-0.04
11
Number of deaths in dungeons
1018.84
(899.94)
32.64 (69.10)
0.06
-0.07
0.01
0.02
-0.08
12
Sum of Deaths in Raid
Dungeons
Sum of Deaths from Falling
0.02
-0.05
-0.07
-0.02
0.00
13
Sum of Hugs
Number of /hug emote
38.57 (69.10)
-0.02
0.11
0.10
-0.03
0.09
14
Sum of LOLs
Number of /lol emote
0.01
0.01
-0.03
-0.02
0.05
15
Sum of Cheers
Number of /cheer emote
63.73
(147.57)
47.05 (90.40)
-0.09
0.13
0.07
0.04
0.13
16
Sum of Waves
Number of /wave emote
-0.06
0.10
0.09
0.08
0.14
17
Max Number of Mounts
Mounts increase travel speed and
are both functional and collectible
79.77
(140.21)
32.08 (29.62)
-0.05
0.03
0.05
-0.02
0.01
18
Max Number of Vanity Pets
39.45 (31.80)
-0.07
0.07
0.08
-0.05
0.07
19
Ratio of Need Rolls
Vanity pets are small nonfunctional and largely decorative
companions
= Need Rolls / Total Rolls
0.17 (0.11)
0.10
-0.14
-0.08
-0.06
-0.09
20
Max Equipment Score
Sum of all equipment item levels
0.02
-0.10
-0.04
0.01
-0.06
21
Sum of Count of Respecs
Number of times player has
changed skill specializations
3867.90
(813.20)
27.02 (28.06)
0.03
-0.09
-0.02
0.03
-0.05
22
Max of Achievement Score
Total Achievement score
-0.01
-0.04
0.02
0.00
-0.03
23
Ratio of Quest Achievements
= Quest Achs / Total Achs (based
on Sums)
413.06
(195.44)
.07 (.02)
-0.10
0.07
0.02
0.01
-0.01
3
Number of deaths from falling
from high places
Table Continued
#
24
Variable
Ratio of Exploration
Achievements
Description
= Exploration Achs / Total Achs
(based on Sums)
M (SD)
.10 (.05)
E
-0.04
A
0.09
C
0.06
ES
0.02
O
0.13
25
Ratio of PvP Achievements
.10 (.05)
0.00
-0.12
-0.03
0.07
-0.01
26
Ratio of Dungeon
Achievements
= PvP Achs / Total Achs (based on
Sums)
= Dungeons Achs / Total Achs
(based on Sums)
.36 (.12)
0.12
-0.17
-0.12
0.01
-0.17
27
Ratio of Profession
Achievements
= Profession Achs / Total Achs
(based on Sums)
.10 (.06)
-0.04
0.13
0.07
-0.02
0.12
28
Ratio of Reputation
Achievements
= Reputation Achs / Total Achs
(based on Sums)
.03 (.01)
-0.03
-0.02
-0.03
-0.01
-0.12
29
Ratio of World Event
Achievements
= World Achs / Total Achs (based
on Sums)
.13 (.07)
-0.08
0.16
0.10
-0.04
0.11
30
Max of Cooking Achievements
Highest cooking score
6.33 (4.85)
-0.07
0.07
0.07
-0.01
0.05
31
Max of Fishing Achievements
Highest fishing score
7.26 (6.02)
-0.06
0.07
0.07
0.01
0.05
32
Total 10-man end-game raids
completed
Total 25-man end-game raids
completed
= Healing Done / Damage Done
(based on Sums)
16.78 (17.83)
0.06
-0.11
-0.05
0.00
-0.13
18.13 (22.99)
0.08
-0.09
-0.05
0.00
-0.12
34
Sum of End Game 10-man
Raids Done
Sum of End Game 25-man
Raids Done
Ratio of Healing Done
.32 (.46)
0.00
0.00
-0.03
-0.02
0.01
35
Sum of Arenas Played
Number of Arenas entered
-0.01
-0.09
0.01
0.06
0.01
36
Sum of BGs Played
Number of BGs entered
-0.07
-0.07
0.02
0.05
0.04
37
Sum of Duels Played
Number of Duels entered
55.57
(155.31)
98.36
(147.11)
52.80 (94.73)
0.11
-0.07
-0.04
-0.05
-0.03
38
Ratio of Arena Wins
= Arena Wins / Arenas Entered
.33 (.18)
-0.10
-0.12
0.03
0.08
-0.01
39
Ratio of BG Wins
= BG Wins / BGs Entered
.48 (.18)
-0.02
-0.06
-0.01
-0.01
0.01
-0.06
0.02
0.07
0.04
-0.01
-0.02
-0.01
33
40
Ratio of Duel Wins
= Duel Wins / Duels Entered
.46 (.21)
41
Sum of Flight Paths Taken
Flight paths are used to fly from
one fixed location to another
1424.42
(1117.06)
-0.08
0.07
0.05
42
Sum of Hearths
454.08
(310.49)
0.00
0.02
-0.01
-0.03
-0.03
43
Ratio of Melee DPS Role
.30 (.30)
-0.08
-0.01
0.03
0.02
0.05
44
Ratio of Ranged DPS Role
Hearthstones allow a character to
teleport to a pre-determined
location
Ratio of time spent in hand-to-hand
DPS role (e.g., fury warriors,
rogues)
Ratio of time spent in ranged DPs
role (e.g, hunters, mages)
.38 (.32)
0.06
0.05
0.01
-0.08
0.04
45
Ratio of Healing Role
.20 (.24)
0.04
-0.05
-0.05
-0.01
-0.05
46
Ratio of Tank Role
Ratio of time spent in healing role
(e.g., holy priests, restoration
druids)
Ratio of time spent in tanking role
(e.g., protection warrior, protection
paladin)
.13 (.20)
-0.01
-0.01
-0.01
0.11
-0.07
Table 1. Means, standard deviations, and correlation coefficients of VW behavioral measures. Correlation coefficients
in bold are p < .05.
RESULTS
To analyze how personality is expressed in VWs, we
examined the correlations between the virtual behaviors and
the personality factors. Given the increased risk of
experiment-wise error in large correlation tables with 46
variables against the five personality factors, we used an
analytic method developed by Sherman and Funder [25] to
address this specific issue. The method employs a Monte
Carlo simulation of repeatedly randomized data within each
participant. Thus, the method preserves the statistical
properties of the data gathered. The method creates 1,000
instances of these randomized data sets and tabulates the
number of observed significant correlations (at alpha of
.05). The probability of the actual number of significant
correlations is then calculated based on where it lies on the
distribution of these 1,000 randomizations. In other words,
this technique answers whether we found a significantly
higher number of significant correlations in our data set
than would be expected by chance alone. In our case, using
an alpha of .05, we had 83 observed significant correlations
where only 11.50 would be expected by chance based on
the simulations. According to this Monte Carlo method, the
probability of this number of observed correlations is p <
.001. This provides assurance that the observed
correlations, as a whole, are non-random.
We will now describe each of the Big 5 personality factors
and the virtual behaviors they were correlated with. We will
not discuss every significant correlation, but instead try to
find clusters of correlations that trace out the bigger picture.
Extraversion
According to the trait definition, individuals who score high
on Extraversion tend to be outgoing, gregarious, and
energetic, while those who score low on Extraversion tend
to be reserved, shy, and quiet.
In terms of behavioral indicators in VWs, individuals who
score high on Extraversion tend to prefer group activities.
They have a higher ratio of Dungeon Achievements (26),
which requires collaboration with other players. They have
also completed a higher number of end-game 25-man raid
dungeons (33). Their higher number of guild changes also
implies social promiscuity (6).
On the other hand, players who score low on Extraversion
prefer solo activities, such as questing (23), cooking (30),
and fishing (33). They also are more likely to have more
vanity pets (18), which are silent pet-like companions.
We also see that players who score low on Extraversion
have a preference and higher win ratios for some PvP
activities (36, 37, 38, & 40), but it is less obvious what the
connection is. The same is true for the higher ratio of
opposite gender characters (2) among those who score low
on Extraversion.
Agreeableness
According to the trait definition, individuals who score high
on Agreeableness tend to be friendly, caring, and
cooperative, while those who score low on Agreeableness
tend to be suspicious, antagonistic, and competitive.
In terms of behavioral indicators in VWs, individuals who
score high on Agreeableness give out more positive emotes
(13. 15, 16), i.e., hugs, cheers, and waves, and prefer noncombat activities such as exploration (24), crafting (13),
world events (29), cooking (30), and fishing (31).
On the other hand, players who score low on Agreeableness
prefer the more competitive and antagonistic aspects of
game-play. They enjoy killing other players (8 & 9). They
also have more deaths (10), focus more on getting better
equipment (20), and have engaged in more PvP activities
(25), including BGs (35), Arenas (36), and duels (37). Their
competitive edge also translates to a higher winning ratio in
Arenas (38) and BGs (39).
The negative correlation with ratio of need rolls (19) is also
telling. Valuable equipment drops from monsters are given
to players according to dice rolls. Players select to roll
based on “Need” or “Greed”, of which the former is given
higher priority. We found that players who are low on
Agreeableness often insist on being given higher priority
over others by rolling “Need”. While this is tolerated in
some cases, abusing Need rolls is often seen as anti-social
(there is even a specific epithet used by the community to
describe these players: ninja looters).
Conscientiousness
According to the trait definition, individuals who score high
on Conscientiousness are organized, self-disciplined, and
dutiful, while those who score low on Conscientiousness
are careless, spontaneous, and easy-going.
In terms of behavioral indicators in VWs, individuals who
score high on Conscientiousness seem to enjoy disciplined
collections in non-combat settings. This is reflected in
having a large number of vanity pets (18) which must be
collected one at a time, and having high cooking (30) and
fishing scores (31) which reflect self-discipline in collecting
unique recipes and visiting unique fishing locations (as well
as patiently staying put for significant amounts of time in
these locations, since fishing in the game is surprisingly
close to its real-world equivalent: catches can be few and
far between). The same is true for world event
achievements (29) which often require disciplined
collections of items and visiting a set of locations around
the world.
On the other hand, individuals who score low on
Conscientiousness seem to be more careless and are more
likely to die from falling from high places (12).
Emotional Stability
According to the trait definition, individuals who score high
on Emotional Stability are calm, secure, and confident,
while those who score low on Emotional Stability are
nervous, sensitive, and vulnerable.
While there were significant correlations between
behavioral metrics and this personality trait, these
correlations were more difficult to interpret as a whole.
Individuals who score low on Emotional Stability prefer
PvP related activities, including having a higher PvP
achievement score (25) and higher wins in the Arena (38).
Individuals who score higher on Emotional Stability are
more likely to have characters of the opposite gender (2).
It is worth noting that previous studies have also had
difficulty identifying meaningful behavioral correlates for
Emotional Stability [12, 17], so our findings here may
reflect an overall weaker behavioral expression of this trait.
Openness to Experience
According to the trait definition, individuals who score high
on Openness to Experience are abstract thinkers,
imaginative, and intellectually curious, while those who
score low on Openness to Experience are down-to-earth,
conventional, and traditional.
In terms of behavioral indicators in VWs, we see a cluster
of correlates that reflect exploration and curiosity. For
example, individuals who score higher on Openness have
more characters (3). They also have characters on more
realms (5), i.e., game servers or parallel worlds that each
character resides on. And they spend more of their playtime exploring the world (reflected by the higher
exploration achievement ratio, 24). They also spend more
time participating in non-combat activities, such as crafting
professions (27) and world events (29).
On the other hand, individuals who score low on Openness
prefer the more traditional, combat-oriented aspects of
game-play, spending more time in dungeons and raids (26,
32, & 33).
Personality Inference from Behavioral Metrics
To examine how well personality can be inferred from
virtual behavioral metrics alone, we conducted a series of
multiple regressions on each of the personality factors using
the respective ten highest behavioral correlates. We note
that this method is imperfect and creates a “double-dipping”
concern, but provides a rough sense of how well personality
can be inferred. The results are shown in Table 2.
All of the multiple regressions were significant at p < .05;
four were significant at p < .001. This suggests that virtual
behavioral metrics can be used to provide statistically
significant models of a player’s personality. According to
Cohen [6], an R of .30 is a medium effect size, while an R
of .10 is a small effect size. Thus, many of our regression
models had around medium effect sizes.
Variable
R
R2
Adj. R2
STE
F
p
Extrav.
0.30
0.09
0.07
0.93
4.73
< .001
Agreeable.
0.30
0.09
0.07
0.67
4.67
< .001
Conscient.
0.20
0.04
0.03
0.79
4.86
< .001
Emo. Sta.
0.21
0.04
0.02
0.79
2.13
0.03
Openness
0.26
0.07
0.06
0.75
4.93
< .001
Table 2. Multiple regressions on each of the personality
factors.
DISCUSSION
The availability of fine-grained virtual behavioral metrics in
the WoW Armory allowed us to gather longitudinal profiles
of actual VW users. While studies in the past have
examined links between personality and linguistic output
online (in emails or blogs), our study is the first to examine
the links between personality and virtual behavior in an
online game. Our findings reveal that our personalities are
expressed in VWs via consistent cues, and that most of
these cues reflect trait definitions of standard personality
factors. For example, players who score high on
Extraversion prefer group-oriented activities. And players
who score high on Agreeableness use more positive emotes
and prefer non-combat activities. More importantly, our
multiple regressions reveal that behavioral cues in VWs can
be used to infer an individual’s personality. These findings
suggest that while some degree of identity experimentation
is occurring in virtual worlds, basic personality is still being
readily expressed.
While an earlier study of personality expression in VWs
[35] had trouble finding trait-aligned behavioral correlates
in Second Life, we were able to find much more coherent
behavioral clusters that were consistent with personality
trait definitions in our study. Findings in the earlier study
may have been impacted by participants with no prior
experience with the VW. Also, it bears pointing out that the
WoW Armory allowed us to gather a set of more
conceptually meaningful variables. Due to constraints in the
scripting language and sandbox nature of Second Life, there
is no standardized set of high-level behavioral variables that
are shared. Thus the earlier study relied on lower-level
variables such as distance walked or ratio of time sitting
down, which may be less powerful in capturing personality
expression, as opposed to behaviors such as hugging
someone.
Knowing the specific behavioral correlates for personality
expression in virtual worlds is also important for several
reasons. First, it helps researchers triage the large number
of behavioral variables gathered in future studies, and helps
prioritize where to start looking. Second, it helps
psychologists understand whether certain personality traits
are more easily predicted via behavioral indicators. And
finally, comparing the findings across these studies will
help us understand whether these behavioral correlates are
consistent or idiosyncratic among different virtual worlds.
Implications for CHI
Personalized interfaces and system customization have long
been of interest to the HCI community [16, 24]. It is
reasonable to assume that information needs vary based on
a user’s personality – for instance, extroverts using an
online shopping website might be more interested in other
customers’ reviews, while introverts might prefer seeing
mostly technical data about the product instead. Our paper
points at the possibility of inferring users’ personalities
based on their activity traces (which need not come from
online games) and customizing their experience based on
the results.
Another possibility directly applicable to online games but
also other forms of social software would be to use inferred
personality information to assist in the formation of groups,
perhaps by recommending compatible partners based on the
task to be accomplished. For instance, groups requiring a
diversity of opinions might benefit from the inclusion of a
wide range of personality types [14]. In other contexts, a
more homogeneous mix could be beneficial. And it is worth
pointing out that we are not suggesting an automated
system that would kick some players out of groups because
they are low on Agreeableness. After all, the competitive
nature of these players can be an asset in PvP settings, and
an assertive nature can also be an asset for raid leaders. In a
related fashion, personality data could also be used in
recommender systems: recommendations from other users
with similar personality profiles could be given more or less
weight, depending on the user’s desire for more
homogeneity or diversity in the options they are presented
with [19].
Limitations and Future Research Directions
There were several limitations to our study. First, we only
collected data from one VW. It is unclear whether the
behavioral cues we identified generalize to other similar
online games. Moreover, it is difficult to say how our
indicators translate to VWs that do not employ the dragonslaying role-playing paradigm. Nevertheless, our findings
hint at potential metrics to collect and analyze in future
studies. For example, emotes (for Agreeableness) or
geographical movement (for Openness) have analogous
metrics across many types of VWs.
A related limitation to generalizability is that WoW users
are highly-engaged users who spend on average 20 weeks
producing behaviorally-rich metrics. This usage profile is
likely atypical of normal website or mobile app usage.
Whether the more typical casual engagement with websites
and mobile apps would allow personality inference is
certainly an avenue for future research.
Third, while the correlation coefficients appear to be quite
low (ranging from .06-.17), a similar large-scale study (i.e.,
>500 participants) of linguistic output among bloggers
yielded similar effect sizes [32]. Given the larger variances
in demographics (with an age range of 16-85 in our sample)
and unavoidable noise among natural setting samples, these
smaller effect sizes are probably not surprising in hindsight.
And finally, we relied on the set of variables that Blizzard
shares publicly via the Armory. It is possible that other
unshared variables, such as logged chat, may be even more
predictive of personality. Given the existing work on
linguistic predictors of personality [21, 32], it would be
interesting to be able to directly compare the predictive
power of linguistic and behavioral cues.
Overall, it is important to continue exploring how
personality is expressed across a range of VWs (using a
variety of metrics) to understand how generalizable these
findings are.
Ending Thoughts
VWs provide a novel research platform with unique
affordances and challenges. The automated longitudinal
data collection across a wide range of behaviors is
impossible to mirror using traditional data collection
techniques, and similar techniques could also be used to
study other social phenomena, such as the emergence of
group norms or leadership.
On the other hand, VWs come with unique challenges as
well. Above all, the ability to create tracking systems that
essentially shadow a user wherever they go in a VW raises
privacy concerns. In our study, the consent process spelled
out the data collection scripts to participants, but given that
VWs like WoW are a kind of pseudonymous public space,
data collection studies (without a survey component like
ours) largely fall into the exempt category for human
subjects Institutional Review Board (IRB) review. The gray
area arises due to the fact that the public space of WoW is
unlike any physical public space we know--with
microphones and video cameras that could follow every
user unobtrusively.
This becomes even more complicated when the game
developer makes public what would otherwise be private
data. Such is the case with the WoW Armory. After all,
before the WoW Armory, players could make the case that
they had a reasonable expectation of privacy in WoW (with
regard to IRB review). This expectation is no longer
reasonable with the release of the Armory. In short, VWs
create new research platforms, but at the same time, force
us to address our role as researchers in the face of such
powerful data collection tools.
It is easy to imagine that VWs allow us to become whatever
we want to be, but our findings show that our personalities
remain even when we don virtual bodies. These findings of
personality expression in VWs suggest that our first lives
still play an important role even when we are in Second
Life. And our personalities are readily expressed even when
we are Elves and Gnomes.
ACKNOWLEDGMENTS
This research is sponsored by the Air Force Research
Laboratory.
REFERENCES
1.
2.
3.
Au, W. Linden Suburban Home Owners More Likely To
Treat Their Place As Extension of the Real Life Self,
Academic Suggests. New World Notes.
http://nwn.blogs.com/nwn/2010/04/linden-homesstudy.html
Back, M., Schukle, S. and Egloff, B. How extraverted is
honey.bunny77@hotmail.de? Inferring personality from
e-mail addresses. Journal of Research in Personality, 42
(2008), 1116-1122.
Back, M., Stopfer, J., Vazire, S., Gaddis, S., Schmukle,
S., Egloff, B. and Gosling, S. Facebook profiles reflect
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
actual personality not self-idealization. Psychological
Science, 21 (2010), 372-374.
Bessiere, K., Seay, A. and Kiesler, S. The Ideal Elf:
Identity Exploration in World of Warcraft.
CyberPsychology and Behavior, 10 (2007), 530-535.
Carroll, J. and Carolin, P. Relationship between game
playing and personality. Psychological Reports, 64
(1989), 705-706.
Cohen, J. Statistical Power Analysis. Lawrence Erlbaum
Associates, 1988.
Ducheneaut, N., Wen, M., Yee, N. and Wadley, G. Body
and mind: a study of avatar personalization in three
virtual worlds. Proceedings of CHI, 1 (2009), 11511160.
Ducheneaut, N., Yee, N., Nickell, E. and Moore, R. The
life and death of online gaming communities: a look at
guilds in World of Warcraft. CHI 2007 Proceedings
(2007), 839-848.
Funder, D. and Sneed, C. Behavioral Manifestations of
Personality: An Ecological Approach to Judgmental
Accuracy. Journal of Personality and Social
Psychology, 64 (1993), 479-490.
Gill, A., Oberlander, J. and Austin, E. Rating E-mail
Personality at Zero Acquaintance. Personality and
Individual Differences, 40 (2006), 497-507.
Goldberg, L. A broad-bandwidth, public domain,
personality inventory measuring the lower-level facets of
several five-factor models. in Mervielde, I., Deary, I., De
Fruyt, F. and Ostendorf, F. eds. Personality Psychology
in Europe, Tilburg University Press, Tilburg, The
Netherlands, 1999, 7-28.
Gosling, S., Ko, S., Mannarelli, T. and Morris, M. A
Room with a cue: Judgments of personality based on
offices and bedrooms. Journal of Personality and Social
Psychology, 82 (2002), 379-398.
Hancock, J., Toma, C. and Ellison, N. The truth about
lying in online dating profiles. Proceedings of CHI 2007,
1 (2007), 449-452.
Harper, F., Frankowski, D., Drenner, S., Ren, Y.Q.,
Kiesler, S., Terveen, L., Kraut, R. and Riedl, J. Talk
Amongst Yourselves: Inviting Users to Participate in
Online Conversations. IUI 2007 (2007), 62-71.
Kenny, D., Horner, C., Kashy, D. and Chu, L.
Consensus at zero acquaintance: Replication, behavioral
cues, and stability. Journal of Personality and Social
Psychology, 62 (1992), 88-97.
Mackay, W. Triggers and Barriers to Customizing
Software. Proceedings of SIGCHI 1991, 1 (1991), 153160.
Mairesse, F. and Walker, M. Automatic Recognition of
Personality in Conversation. Proceedings of the Human
Language Technology Conference, 1 (2006), 85-88.
Marcus, B., Machilek, F. and Schutz, A. Personality in
Cyberspace: Personal Web Sites as Media for
Personality Expressions and Impressiosn. Journal of
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
Personality and Social Psychology, 90 (2006), 10141031.
McNee, S., Riedl, J. and Konstan, J. Making
Recommendations Better: An Analytic Model for
Human-Recommender Interaction. CHI 2006 (2006),
1103-1108.
Mehl, M., Gosling, S. and Pennebaker, J. Personality in
Its Natural Habitat: Manifestations and Implicit Folk
Theories of Personality in Daily Life. Journal of
Personality and Social Psychology, 90 (2006), 862-877.
Mehl, M. and Pennebaker, J. The Sounds of Social Life:
A Psychometric Analysis of Students' Daily Social
Environment and Natural Conversations. Journal of
Personality and Social Psychology, 84 (2003), 857-870.
Nardi, B. and Harris, J. Strangers and Friends:
Collaborative Play in World of Warcraft. CSCW 2006
(2006), 149-158.
Rentfrow, P. and Gosling, S. Message in a Ballad: The
Role of Music Preferences in Interpersonal Perception.
Psychological Science, 17 (2006), 236-242.
Riecken, D. Personalized Views of Personalization.
Communications of the ACM, 43 (2000), 26-28.
Sherman, R. and Funder, D. Evaluating correlations in
studies of personality and behavior: Beyond the number
of significant findings to be expected by chance. Journal
of Research in Personality, 43 (2009), 1053-1063.
Simon, A. Emotional stability pertaining to the game of
Dungeons and Dragons. Psychology in the Schools, 24
(1987), 329-332.
Turkle, S. Life on the Screen: Identity in the Age of the
Internet. New York: Simon and Schuster., 1995.
Vazire, S. and Gosling, S. e-Perceptions: Personality
Impressions Based on Personal Websites. Journal of
Personality and Social Psychology, 87 (2004), 123-132.
Webb, E., Campbell, D., Schwartz, R. and Sechrest, L.
Unobtrusive measures: non-reactive research in the
social sciences. Rand McNalley, Chicago, 1966.
White, P. MMOGData: Charts. 2009 (2008).
Williams, D., Yee, N. and Caplan, S. Who plays, how
much, and why? Debunking the stereotypical gamer
profile. Journal of Computer-Mediated Communication,
13 (2008), 993-1018.
Yarkoni, T. Personality in 100,000 Words: A large scale
analysis of personality and word use among bloggers.
Journal of Research in Personality (in press).
Yee, N. The demographics, motivations, and derived
experiences of users of massively multi-user online
graphical environments. Presence: Teleoperators and
Virtual Environments, 15 (2006), 309-329.
Yee, N. The "Impossible" Romance. The Daedalus
Project.
http://www.nickyee.com/daedalus/archives/001534.php
Yee, N., Harris, H., Jabon, M. and Bailenson, J. The
Expression of Personality in Virtual Worlds. Social
Psychological & Personality Science (in press).
Predicting Player Behavior in Tomb Raider: Underworld
Tobias Mahlmann, Anders Drachen, Julian Togelius, Alessandro Canossa and Georgios N. Yannakakis
Abstract—This paper presents the results of an explorative
study on predicting aspects of playing behavior for the major
commercial title Tomb Raider: Underworld (TRU). Various
supervised learning algorithms are trained on a large-scale
set of in-game player behavior data, to predict when a player
will stop playing the TRU game and, if the player completes
the game, how long will it take to do so. Results reveal that
linear regression models and other non-linear classification
techniques perform well on the tasks and that decision tree
learning induces small yet well-performing and informative
trees. Moderate performance is achieved from the prediction
models, which indicates the complexity of predicting player
behavior based on a constrained set of gameplay metrics and
the noise existent in the dataset examined, a generic problem
in large-scale data collection from millions of remote clients.
Keywords: Player modeling, supervised learning, classification, Tomb Raider: Underworld
I. I NTRODUCTION
User-oriented testing is a crucial phase of modern game
development with the scope of iteratively enhancing the final
game product that will be published [1], [2], [3]. Usually a
carefully selected set of subjects, representative of the target
audience, as well as professional testers are involved in a
labor-intensive procedure testing the games and evaluating
the quality of the gaming experience [1], [3]. One of the key
components of user-oriented testing both during production
and after game launch, is to evaluate if people play the
game as intended and investigate how gameplay and game
design impact the playing experience [1], [2]. The increasing
focus on increasing player affordability in digital games [1]
- freedom, choice - emphasizes the need for the development
of reliable and effective user-testing procedures [2].
Being able to predict certain aspects of gameplay and
playing experiences defines a vital component of the user
testing procedure within game development [1], [3]. Prediction of playing patterns may rely on both qualitative and
quantitative approaches to user testing [2], [4]. This paper
examines the latter. Within the last five years, instrumentation
data derived from player-game interaction — or gameplay
metrics as they are referred to in game development —
has gained increasing attention in the game industry as a
source of detailed information about in-game player behavior
[2], comprising detailed numerical data extracted from the
interaction of the player with the game [5].
The application of machine learning and data mining
on such data, with datasets often in the terabyte scale,
and the inference of playing patterns from the data [6]
Authors are with the Center for Computer Games Research,
IT University of Copenhagen, Rued Langgaards Vej 7, DK-2300
Copenhagen S, Denmark (email: {tmah, drachen, juto, alec,
yannakakis}@itu.dk).
can provide an alternative quantitative approach to and
supplement traditional qualitative approaches of user- and
playability testing [3]. Notably, the application of gameplay
metrics permits much larger sample sizes to be used, and the
data can potentially be collected outside of the laboratory
environment. Furthermore, game metrics are highly detailed,
permitting tracking and logging of the second-by-second
behavior of players. Understanding patterns of game-playing
behavior, and more specifically gameplay aspects such as
where players encounter problems with progressing through a
game, permits re-engineering of the game design and ensures
the enhancement of playing experience.
In this paper, we explore the possibility of predicting
particular aspects of playing behavior in the commercial
game title Tomb Raider: Underworld1 (TRU) via supervised
learning. In particular we attempt to predict when a player
will stop playing and, if alternatively the player completes
the game, how long will it take the player to do so. The
generated predictors are trained on player metrical data of
the first two levels of the TRU game.
One of the perennial challenges of game design is to
ensure inclusiveness — i.e. that as many different types
or classes of players are facilitated in the design. Being
able to predict when specific classes of players will stop
playing a game is of interest in game development because it
assists with locating problematic aspects of game design, i.e.
features that hinder different classes or types of players from
progressing through specific segments of the game, and ultimately complete the game. The ability to predict completion
time for the players who do complete the game is of similar
interest. For example, if a particular type of player completes
a game very fast, there is a risk of disappointment with the
game product. Identifying the different types of completion
strategies and accounting for them in the game design is an
important element ensuring customer satisfaction.
Earlier work on TRU metrics data has focused on
the investigation of dissimilar playing patterns via selforganization in a moderate data set of 1365 players [6].
The experiments presented here are based on a large data
set derived from 10000 players. Data was collected via the
Square Enix Europe (SQE) Metrics Suite. The data collection
process is completely unobtrusive since data was gathered
directly via the Xbox Live!2 service, with subjects playing
TRU in their natural habitat.
Several features that correspond to various key aspects of
playing behavior, are extracted from the data, e.g. information about causes of player deaths. The specific features are
1 http://www.tombraider.com
2 http://www.xboxlive.com
selected so that they incorporate knowledge for the player
performance. A carefully selected set of various classification
algorithms is employed to 1) predict the number of levels
completed (i.e. the level number class) based on those
features of play and 2) predict the game playing time of
players that completed the TRU game. Our algorithms are
tested on two tasks: 1) learn to predict based on playing
features of level 1 of the game and 2) based on playing
features of level 1 and 2. Results showcase the effectiveness
of linear regression techniques as well as nonlinear classification approaches. It also appears that decision tree learning
achieves moderate performance but provides a full degree
of model expressiveness. Moreover, decision trees showcase
that a very small number of playing features is adequate for
achieving a moderate classification accuracy.
The findings directly address the industrial need of automated processes that could assist towards identifying dissimilar playing patterns and predicting forthcoming player actions
and events. The main arguments supporting the commercial
applicability of results include the large-scale training dataset
consisting of 10000 players; the major commercial game
used and the available industrial system for logging the data.
II. G AME M ETRICS M INING
Viewing the mining of game data as a process towards
player modeling [7], [8] we can identify few studies in the
literature. Quantitative models of players have been built
to assist the learning of basic non-player character (NPC)
behaviors (e.g. moving, shooting) in Quake II [9], [10], [11].
In those studies self-organizing maps [10], Bayesian networks [11] and neural gas [9] approaches are employed for
clustering game-playing samples. Similarly, self-organizing
maps have been used for clustering players of the trails
(player waypoints) of users playing a simple level exploration
game [12]. Missura and Gärtner [13] investigate the use
of k-means for clustering player data and support vector
machines for predicting dynamic difficulty adjustment in a
simple shooter game; data is derived from a small sample of
17 players.
The vast majority of the aforementioned approaches concentrates on a few specific scenarios (e.g. imitate human
movement in a particular level of a game) while the game
environments investigated are simple test-bed games or simplified versions of commercial games. Moreover, the studies
focus on constructing models or predictors of playing behavior based on small-scale player-data collection experiments
held in laboratories. Doing so questions the scalability of the
obtained performance and leads to the simplification of the
learning task — which in turn acts in favor of the learning
approach.
Game data mining should consider large-scale data sets
(ideally live player data sets) if the study wishes to ensure
that the findings are representative and scalable. The existence of large-scale data, in turn, addresses the need for
efficient and robust algorithms able to classify (or cluster)
data successfully. Thawonmas et al. [14] used game metrics
from the Massively Multiplayer Online Game (MMOG)
Cabal Online to establish patterns of behavior among the
player base, trying to identify aberrant patterns indicative
of computer-controlled agents, i.e. game bots. The approach
followed in that paper is based on simple frequency analysis.
A similar approach was used to visualize the behavior of
online game players in [15]. Ducheneaut and Moore [16]
investigated interaction patterns between players in the Star
Wars Galaxies MMOG utilizing action frequencies to group
player behaviors. Conversely, Chen et al. [17] utilized the
spatial behavior of avatars to establish models of bot and
player behavior. None of the aforementioned studies moves
beyond relatively simple statistical methods.
To the best of our knowledge the most related study to
this research is the work of Weber and Mateas, mining game
metrical data for the prediction of player strategy in the realtime strategy game Starcraft [18]. Replays from over 5000
expert players were compared using various classification algorithms for recognizing the player’s strategy, and regression
algorithms for the task of predicting when specific unit or
building types will be produced. In [19] non-negative-matrix
factorization is applied to mine 1.6 million images on World
of Warcraft guilds. That study, however, does not consider
live data of playing behavior rather than online player
appearances. Our earlier study utilized self-organization for
the identification of playing behavior clusters of 1365 TRU
players [6].
III. T OMB R AIDER : U NDERWORLD
The Tomb Raider franchise is one of the most established
in the digital games industry. The Tomb Raider games,
a combination of adventure games and 3D platformers,
have been published in different versions on all hardware
platforms, including mobile devices, and the current game
in the series, Tomb Raider: Underworld is the eighth to be
published.
The main protagonist of the games, whom the player
controls and interacts with the game world through, is Lara
Croft. She is designed as a combination between an action
heroine and Indiana Jones, who travels to exotic locations
and enters forgotten tombs and lairs, solving puzzles and
finding ancient treasures at the same time. The Tomb Raider
game environments have been 3D from the beginning, and
Tomb Raider: Underworld (TRU) is no exception. Tomb
Raider: Underworld is a 3D platform game and is played in
third-person perspective. The players are tasked with solving
various navigational puzzles and apply strategic thinking in
their navigational behavior (see Fig. 1).
The player faces different types of danger from the
game environment and computer-controlled agents operating
within it. Falling is an almost continuous risk in the game,
and the player also encounters different types of mobile NPC
enemies. The environment is also a danger, as it is filled
with traps, hazardous substances, fire, etc., which can kill
the player. The game consists of seven game levels plus a
prologue. Each game level is set to a specific theme, for
example Thailand or the Arctic Sea, subdivided into 71 map
units (MU) of varying size.
Fig. 1. A screenshot from Tomb Raider: Underworld level “Thailand”.
Image is copyright of Crystal Dynamics/Square Enix Europe (2009).
Because TRU was the first game that the Metrics Suite
collected data from, there were a number of data cleaning
issues such as the recording of negative values, missing
timestamps, etc., which made the data cleaning process
extensive. The 10000 player sample was also cleaned to
remove e.g. instances where players had completed the game
and then started playing the game again (approximately
1600 players did this). Additionally, instances where the
Metrics Suite had missing data reported for a player from
e.g. a specific game level or map unit or similar missing
intermediate location times (those that were reported as not
having spent any time in one or more locations that are part
of a level they have completed), where removed. Missing
data is discussed further in the last section of this paper.
B. Extracted features
IV. DATA COLLECTION
The gameplay metrics data were obtained from the Square
Enix Europe Metrics (SQE; the former EIDOS) Suite, which
contains data from a range of SQE-produced games. The
SQE Suite is an instrumentation/telemetry system developed
to capture and store game metrics. Gameplay metrics are
normally logged as event-based data, and each metric is associated with a range of descriptives (contextual information)
such as time stamps, user IDs, IP addresses, etc.
An important aspect of the system is that it delivers live
data, i.e. data from people playing these games in their
natural habitats. The data collection is completely unobtrusive, providing detailed, quantitative information about how
users play games free from any effects or bias imposed by
experimental approaches to research [6], [20].
A. Data Preprocessing
The SQE Suite holds data from more than 1.5 million
players of TRU. A sample was drawn covering all data
collected from a two month period (1st Dec 2008 - 31st Jan
2009), providing records from approximately 203000 players
(around 100 GB). The game was launched in November
2008, so the data represent a time period where the game was
recently released to the public. The data was imported to dual
Microsoft SQL Server databses. Such large data amounts
require substantial computing power to analyze, and it was
therefore chosen to extract a subsample of 10000 players for
an initial study. The 10000 players provide a sample large
enough to form the basis for developing analysis methods,
while at the same time being manageable in terms of analysis
runtime. The only criterion applied to the selection was that
players in the sample must have completed the first level of
TRU.
In terms of preprocessing, the main challenge was to
transpose the data obtained from the Metrics Suite into
a format we could use to analyze the data. To identify
distinct players it was necessary to collect several messages
to reconstruct their progress. The data in the sample were
extracted in a series of tables, cleaned and transposed to a
single table.
Based on previous experience with a smaller sample of
data from TRU [6], it was chosen to focus on game metrics
that relate to the primary game mechanics and play features,
as these are the most descriptive of the way TRU is played
and how players can interact with the game system. TRU
is a 3D platformer, with navigation being a major part of
the gameplay, as is solving puzzles and fighting enemies.
The features used for the current analysis relate to the core
mechanics of the game. Eight categories of features were
extracted, at two scales of resolution: Map Unit or Game
Level, giving a total of 674 variables per player. Which
resolution scale to use for each feature was chosen depending
on the frequency of the specific variable, the distribution of
use among the sampled players, the relation to the core game
mechanics and its suitability for machine learning. Given
the above-mentioned rationale the following features were
extracted:
•
•
•
Playing time: The time that each player spent playing
the game, T . A total of 8.06 years of playtime were
included in the dataset (including the game prologue),
with an average playing time of 7.06 hours — with
different levels/MUs of TRU taking different amounts
of time to complete due to their varying size and/or
puzzle difficulty. The total playing time per player varies
between 21 minutes and 58.64 hours. The average time
taken to complete the entire game was 10.23 hours.
Total number of deaths: The total number of deaths
for each player, D. There are 961403 instances of
death registered, across all levels/MUs and death causes
(96.14 average per player, varying from 0-1343 death
events; σ{D} = 83). The death count is dependent on
e.g. how much of TRU that a player has played, and
the skill of the player.
Help-on-Demand: The number of times help was requested, H. A key feature of TRU is the focus on
navigational puzzle solving. A typical puzzle could be
a door which requires specific switches to be pressed
in order to open. Players need to solve the numerous
puzzles in order to progress through the game. In
order to avoid player frustration with the puzzles, a
•
native Help-on-Demand (HoD) system was added to
TRU, from which a hint or solution can be requested
in relation to puzzles. The sampled data indicate that
players generally either request both hints and answers
or no help at all for specific puzzles. Both hint and
answer requests were therefore aggregated into the H
value. A total of 329907 HoD-requests are recorded
(32.99 average), this value is also highly dependent on
how much of the game a player has played, and the
player skill and playstyle.
Causes of death: TRU features a variety of ways in
which players can die. The causes of death can be
grouped into three categories: Death via enemies (which
can be subdivided into ranged- and melee-oriented
enemies), from falling or from environmental hazards.
Death events caused by game bugs, for example players
dying during cinematic encounters, were not included.
– Enemies (melee), Dm : the number of deaths caused
by melee enemies. Those enemies include tigers,
panthers, who attack Lara Croft in close combat.
Dying from melee enemies comprise 3.03% of the
total number of deaths recorded.
– Enemies (ranged), Dr : the number of deaths caused
by NPC enemies who attach using ranged weapons,
e.g. mercenary snipers. Dying from ranged enemies
comprise 4.14% of the total number of deaths
recorded.
– Environment, De : the number of deaths caused by
environment-related causes of death such as player
drowning, being consumed by fire, or killed in
a trap, comprising 29.9% of the total number of
deaths across all players.
– Falling, Df : the number of deaths caused by
falling. This cause of death comprises the 62.92%
of all death events making it the dominating way to
die in TRU, as would be expected from the game
design.
•
These numbers vary from those reported in [6], reflecting the different properties of the underlying samples: in
that study a sample of 1365 players was used who completed the game, whereas the current sample comprises
of 10000 randomly selected players among those who
completed the first level. The effect of sampling is seen
in e.g. death from opponents only comprising 8.13% in
the current dataset, but 28.9% in dataset of [6]. Enemies
have a high impact in levels 5 and 6, levels that not all
players in the current sample will have reached. Death
by environmental causes comprised 13.7% in the earlier
study, 29.9% in the current, which is likely again due
to the different properties of the two samples. Death
by falling is similar however: 57.2% reported in [6] vs.
62.92% in the current sample. Fig. 2 depicts the causes
of death in TRU.
Adrenalin: The number of times the adrenalin feature
was used, A. This is an advanced gameplay feature of
TRU that permits the player to temporarily slow down
Fig. 2. Percentages of the four causes of death in Tomb Raider: Underworld
across all seven game levels. Values are averages of all players (out of the
10000 players) that completed the corresponding level.
•
•
•
time while performing special attacks against enemies.
When activated, a cursor has to be moved to the head
area of the target, which will trigger a headshot event.
The players in the sample used the adrenalin feature
72593 times, i.e. 7.26 per player. The use of adrenalin
is highly varied between players: between 0 and 304
uses.
Rewards: The number of rewards collected, R. The
levels of Tomb Raider: Underworld are rife with ancient artifacts, shards and similar relics, which players
have the opportunity to collect during the playing of
the game. A total of 1120708 artefacts/shards were
located by the players in the game (112.08 average
(σ{R} = 86.9).
Treasure: The number of treasures found, T . Most
levels in TRU contain one or a few major treasures,
which take particular exploration to locate. Thus, a high
treasure count is indicative of explorative behavior in
players. A total of 24927 treasures are located in the
dataset (T = 2.49; σ{T } = 5.1).
Setting changes: Players can change various parameters
of the TRU game. Among these, four directly impact
on gameplay, and therefore are of interest to the current
analysis:
– Ammo adjustment, Sa : The number of times the
player adjusts how much ammunition Lara Croft
is able to carry. Changing this setting comprises
29.6% of the total amount of settings changes.
– Enemy hit points, Se : The number of times the
player changes the amount of hit points that
computer-controlled enemies have, either positively
or negatively. Changing this setting comprises
31.5% of the total amount of settings changes.
– Player hit points, Sp : The number of times the
player adjusts how many hit points Lara Croft has,
effectively making her harder vs. easier to kill.
Changing this setting comprises 19.5% of the total
amount of settings changes.
– Saving grab adjustment, Ss : The number of times
the player lowers the recovery time when performing platform jumps, increasing the time available to
gain a handhold. Changing this setting comprises
19.4% of the total amount of settings changes.
There were 15317 settings changes made (max 104,
1.53 average); however, only 1740 players changed settings (8.8 average). Settings changes were vastly more
common in the first two levels (comprising 34.71% and
37.82% of the changes, respectively), as compared to
the later levels (8.02% for level 3, 10.89% for level 4,
4.89% for level 5, 1.21% for level 6, 2.47% for level
7). This pattern possibly reflects the players adjusting
the difficulty parameters of the game early on, until
they are satisfied, and then use the adjusted parameters
throughout the rest of the game.
V. M ETHODOLOGY
After cleaning the 10000 player sample as described
above, 6430 players remained. For these players, 30 features
were collected relating to the performance of the player
on level 1. These were the amount of time, T , spent in
19 different locations of the level (e.g. in the ship engine
room and on the surface of the sea), and 11 other features
relating to this level only: the number of deaths, the total
reward, the number of help requests, the adrenalin used, the
number of treasures found, and the number of deaths from the
four different causes (melee, ranged weapons, environment,
falling, and unknown).
From this set, a second, smaller set consisting of 3517
players who also completed level 2 was selected. For this set,
25 additional features were computed related to gameplay
performance on level 2 following the principles of designing
the level 1 dataset: the time spent on 14 locations of level 2
plus the 11 gameplay features used in dataset 1. All features
are normalized to be in [0, 1] via a uniform distribution.
The target outputs for both data sets is a number indicating
the last level completed by the player. We thereby assume
there is an unknown underlying function between features of
gameplay behavior on the first two levels and the last TRU
level that was completed that a classification algorithm will
be able to predict.
A third data set was created from the second data set,
containing only the 1732 players that finished the whole
game, and including the same features as the second data set.
This data set was used for trying the predict the time taken to
play through the game, assuming that there is some function
between early playing behaviour and speed of completion.
To test the possibility of predicting both the TRU level the
player completed last, and the time taken to complete game,
we apply various classification and prediction algorithms
using the WEKA machine learning software (version 3.6.2)
from the University of Waikato [21]. WEKA is a comprehensive software package that includes versions of all the
main prediction and classification algorithms from machine
learning, as well as standard algorithms for preprocessing
and unsupervised learning and regression techniques from
statistics. This version of WEKA contains 76 algorithms
applicable to classifying a nominal attribute (the final level
played) from a vector of real-valued numeric attributes (the
normalized location times, deaths etc. mentioned above) from
8 algorithm families. Somewhat fewer (34 algorithms) can
predict a real value (time taken to finish the game) from
a real-valued vector. This abundance of tools points to the
maturity of the machine learning field, but means that all
algorithms and all parameters cannot reasonably be tried on
any particular problem.
Given the experimental aim, our approach was to try at
least one algorithm from each of the families of algorithms
on each dataset, and to spend extra effort on those classification algorithms that were included in the recent list of the
most important algorithms in data mining: decision tree induction, backpropagation/multilayer perceptrons and simple
regression [22]. Variants of those algorithms were explored
and the space of parameters was searched manually. They
were also used as components for ensemble classifiers and
as subset evaluators for feature subset evaluation algorithms,
in order to achieve maximum classification performance. In
the following section, we only report the best and most interesting results we have obtained from this experimentation.
For all tested algorithms, the reported classification/prediction accuracy was achieved through 10-fold cross
validation.
VI. E XPERIMENTS
The first two sets of experiments aim to predict the last
level finished for each player, based only on features from
level 1 and based on features from level 1 and 2 combined.
The second set of experiments aims to predict the total time
the player took to finish the game, based either on only level
1 features or on both level 1 and 2 features.
A. Last level completed
Before trying to predict which will be the last level a
player finishes, we need to establish the baseline accuracy:
what would an optimal predictor predict in the absence of
any attribute data? This number is equivalent to the number
of samples in the most common class (i.e. level completed)
divided by the total number of classes. As can be seen
from Table I, for the dataset containing all 6430 players that
finished level 1, the best guess — in the absence of further
information — is that the player only finishes level 1, leading
to a baseline prediction accuracy of 34.3%. For the 3571 1
players that also finished level 2, the best guess is that a
player finishes all the levels (last level finished is level 7),
yielding a baseline prediction accuracy of 50%.
1 This number is lower than would be expected by subtracting the players
that only finished level 1 from the first dataset (6430 − 2561 = 3869) due
to extra cleaning that was performed to remove players with missing level
2 features.
TABLE I
N UMBER OF PLAYERS ( OUT OF THE 6430 FINISHING THE FIRST LEVEL )
THAT STOPPED PLAYING THE GAME ON EACH LEVEL .
Level
No. of Players
1
2561
2
376
3
1045
4
393
5
56
6
267
7
1732
TABLE II
B EST ACCURACY (%) OF SEVERAL CLASSIFICATION ALGORITHMS ON
PREDICTING FINAL LEVEL BASED ON FEATURES FROM ONLY LEVEL 1 OR
FROM LEVEL 1 AND 2, USING DEFAULT OR LIGHTLY MANUALLY TUNED
PARAMETERS . H IGHER VALUES ARE BETTER . N OTE THAT THIS IS JUST A
SUBSET OF ALL ALGORITHMS THAT WERE TESTED .
Algorithm
Logistic regression
MLP/Backpropagation
J48 (C4.5) decision tree (pruned)
REPTree decision tree (pruned)
Multinomial naive bayes
Bayes network
SMO Support vector machine
Baseline
Level 1
48.3
47.7
48.7
48.5
43.9
46.7
45.9
39.8
Levels 1 and 2
77.3
70.2
77.4
77.2
50.2
65.1
70.0
45.3
As described above, a number of classification algorithms
were brought to bear on the problem of predicting last
finished level based on attributes from level 1 or from level
1 and 2. It was found to be easy to do substantially better
than baseline accuracy. The best accuracy on predicting final
level based on attributes from level 1 was 47.7% (baseline
39.8%), and from attributes from both level 1 and 2 it is
76.9% (baseline 50%).
The best results were found using logistic regression;
several algorithms were able to achieve similar accuracy, but
none could surpass this simple algorithm. The performance
of a selected few algorithms can be seen in Table II.
Most of the tested algorithms had similar levels of performance (with the exception of a few algorithms, especially
the Bayesian ones, which underperformed), and were able
to predict substantially better than the baseline. In particular,
when using features also from level 2, we were able to predict
the last level with a much better accuracy than the baseline
guess, suggesting that such predictors could be meaningfully
used both for analyzing game mechanics and adapting the
game online so as to keep the player playing. The difference
in the predictive strength of using level 1 and 2 data as
compared to only level 1 data is partly due to increased
amount of features used in the second case, and of course
to the fact that players who stopped playing before finishing
level 2 are not part of the second data set. But it is also
important to note that level 1 of TRU is designed as a form
of “training level”, with less varied hazards to the player.
The main hazard is falling, which is also evident from the
recorded causes of death for level 1 (see Fig. 2). Levels 2-7,
while showing substantial variation in theme and design, are
more homogenous in that they are varied in their navigational
challenges and the challenges the players encounter.
Apart from accuracy, another important advantage of some
machine learning algorithms is the transparency and the
expressiveness of the acquired model. The models are more
useful to a human game designer if they can be expressed in
a form which is easy to visualize and comprehend, so that
the consequences of changing particular design elements can
be easily grasped. Multi-layer perceptrons are particularly
limited from this perspective, and linear models with many
free variables not so powerful either. However, decision trees
of the form constructed by the ID3 algorithm and its many
derivatives are excellent from this perspective, especially if
pruned to a small size.
The following extremely small decision tree is produced
by the REPTree algorithm constrained to tree depth 2, and
has a classification accuracy of 47.3% when trained on data
from the level 1 only:
L1-Seatop-T < 10835.5
→ L1-R < 25.5 : 1
→ L1-R ≥ 25.5 : 7
L1-Seatop-T ≥ 10835.5 : 7
On the set of players who completed both levels 1 and 2,
the following tree has a classification accuracy of 76.7%:
L2-R < 18.5
→ L2-Flushtunnel-T < 9.858 : 2
→ L2-Flushtunnel-T ≥ 9.858 : 3
L2-R ≥ 18.5 : 7
The right arrow (→) symbol depicted at the above trees
indicates a branch under the tree-node which is right above
the symbol. The number right to the colon symbol represents
the predicted game level. The accuracy of these predictors is
quite impressive given how extremely simple they are. The
idea that it would be possible to guess which level a player
will finish on much better than baseline, based simply on how
long time the player spends on the surface of the sea (L1Seatop-T ; in seconds) in the first level and her total reward
(L1-R) during the first level would seem rather outrageous if
it was not supported by empirical evidence. The same goes
for the idea that we could predict final level with a quite
high accuracy based only on the amount of time spent in the
Flush Tunnel room (L2-Flushtunnel-T ) and the total rewards
collected, for level 2 (L2-R).
What these two decision trees indicate is that the amount
of time players spent within a given area early in the game
and how well they perform is important for determining
if they continue playing the game. Time spent can be
indicative of problems with progressing through the game,
which can lead to frustration. According to these trees the
computer-controlled enemies of TRU do not appear to help
in predicting when players will stop playing the game.
The fact that only very little performance can be gained
from using all 30 (or 55) features rather than just 2 or
3, especially when those 2 or 3 features do not appear
to be much more important than other features, suggests
that there is a very high degree of inter-correlation among
those features. We, therefore, used the CFS feature subset
evaluator [23], which rates a set of features depending on
their correlation with the target class and the degree of
redundancy between the features, together with a greedy
search method (i.e. sequential forward features selection)
which starts with adding the most significant feature and then
adds one feature at a time until the feature subset cannot
be improved . From all 55 features, this method selected
only four (L1-Seatop-T , L2-Norsehall-T , L2-R and L2-H)
confirming our assumption that the vast majority of features
are highly inter-correlated.
B. Completion time
The next set of experiments aims at predicting the time
taken to finish the game, based on the same features as above,
either from level 1 only or from both levels 1 and 2.
As in the previous set of experiments we tried standard
linear regression methods for the prediction of completion
time. The feature (of all features from level 1 and 2)
that correlates most with completion time is L1-Seafloor-T
(positive correlation 0.35) and employing univariate linear
regression from this feature to completion time yields an
absolute relative error (RAE) of 92%. The RAE statistic is
computed as the average difference between the predicted
and the target value divided by the difference between the
mean and the target value. Multivariate linear regression
manages to reduce this error to 88.2% when only using
features from level 1, and to 84.5% when using features from
both levels 1 and 2 (see Table III).
These linear methods were contrasted a large number
of nonlinear methods for numeric prediction from machine
learning; selected results are shown in Table III. As can be
seen some of the methods (SMO and REPTree combined
with bagging) outperformed the linear methods by a notable
amount of error. Attribute selection and ensemble classification were tried, as well as moderate parameter tuning, and
the results reported in the table reflect the best configuration
found for each algorithm (as above, results are only reported
for selected algorithms). Like for the classification task,
surprisingly poor (sub-baseline) performance was noted from
an otherwise reliable algorithm, the MLP using backpropagation. This serves to underscore that experiments like these,
which do not perform systematic search in parameter space,
can only show that a particular algorithm can work for some
type of problem, not that it cannot work.
The features that best predict the time taken to complete
the game are unsurprisingly the times taken to complete
various units of the first two levels. This can be seen both
from which features best correlate with the game completion
time, and which features best split the data set into binned
classes in the REPTree classifier.
To summarize, we can predict the completion time substantially better than just random guessing, and using features
from level 2 as well as well as from level 1 increases the
accuracy of our predictions; the best predictor found is the
support vector machine achieving a RAE of 82.4%. Given
TABLE III
R ELATIVE ABSOLUTE ERROR (%) OF SEVERAL ALGORITHMS ON
PREDICTING GAME COMPLETION TIME BASED ON FEATURES FROM ONLY
LEVEL 1 OR FROM LEVEL 1 AND 2, USING DEFAULT OR LIGHTLY
MANUALLY TUNED PARAMETERS . L OWER VALUES ARE BETTER .
Algorithm
Simple linear regression
Multivariate linear regression
SMO Support vector machine
MLP/Backpropagation
REPTree decision tree (pruned)
Bagging REPTree (pruned)
M5Rules decision list
Gaussian processes
Baseline
Level 1
92.0
89.4
88.2
107.2
92.5
85.2
93.7
88.8
100.0
Levels 1 and 2
92.0
84.5
82.4
111.5
91.8
83.5
88.6
84.3
100.0
the results obtained the underlying function appears to be
nonlinear.
The question that remains to be answered is exactly
how useful these predictions are. Our best predictions still
have 4/5 as high errors as just guessing the average value,
meaning that it is unlikely this information would really help
in e.g. guiding real-time game adaption, but does provide
useful feedback to guide game design. It might be possible to
predict outliers – extremely high or low completion times —
with higher accuracy, something we have not tried. But our
main conclusion regarding completion time is that prediction
algorithms are in need of more detailed gameplay metrics and
more extracted statistical features.
VII. D ISCUSSION
Despite the strong indication that prediction of player
behavior based on quantitative measures of their early play
performance is possible and the indication that it may be
a few features of the behavior of the players that are the
most important predictors, the predictive power of the models
presented in the above is moderate.
We believe that one of the reasons for the moderate
performances achieved in this paper is the existence of data
noise both in terms of unreasonable outliers in unit times
(which could be generated due to different patches of the
game interlocking with the Metrics Suite) and in terms of
missing information of players for some game levels. Even
though we put substantial effort to remove noise from these
large-scale datasets we cannot be entirely certain of the
degree of noise that is still existent within those datasets. An
additional issue is the limited number of variables available
in the TRU-dataset, which do correlate with the core of
the gameplay, but lack for example player movement paths.
Improving on these points are likely to improve on the
predictive strength of the algorithms used here.
In the future it would be our desire to have access to
less noisy data via improved logging systems that use even
more efficient server-client network communications. The
data obtained from Tomb Raider: Underworld was among
the first using the — by then — newly developed SQE
Metrics Suite, which has since then been further developed.
Data from the newer games, which contain more variables
compared to TRU, will form the focus of future research in
our attempt to test the generality of the approach followed
in this paper.
Future work will also focus on being able to predict when
a player stops the game at a finer granularity. Thus, we
would like to know not only at what level, but also at
which map unit/specific situation the player stops playing. On
that basis, supervised learning techniques could potentially
perform better if we instead attempt to predict the type of
situation in which the player is when she stops playing.
Future research will also investigate association mining,
combining clusters of player behavior with gameplay metrics
data to investigate if particular play-styles have an impact on
game completion and the underlying reasons for why players
stop playing before a game is completed. Finally, the 10000
player dataset used in the current study is only a fraction of
the main dataset containing data from 203000 players, which
in turn is a subset of the main SQE Metrics Suite database
which contains data from over 1.5 million players. Future
research will focus on testing clustering and classification
methodologies on those massive-scale datasets.
The causes preventing players from completing a game are
possibly game specific and maybe relate to particular playing
styles [6], [4], although it is possible that there are principles
that apply across specific subsets of games or digital games in
general, for example a high difficulty (steep learning curve)
early in a game. Ideas about how to keep players engaged
are prevalent in the game industry, and increasingly backed
by behavioral and cognitive psychology as user research
is gaining importance in commercial game development;
however, there is very limited publicly available empirical
evidence, due to the general proprietary nature of such data.
Studies such as the one presented here form a first step
towards addressing this problem.
ACKNOWLEDGMENTS
This work would not be possible without the game development companies who are involved. The authors would
like to thank their colleagues at Crystal Dynamics and IO
Interactive (IOI) for continued assistance with access to
the Square Enix Europe Metrics Suite and discussion of
approaches, methods and results, including but certainly
not limited to: Thomas Hagen and the rest of the Square
Enix Europe Online Development Team, Janus Rau Sørensen
and the rest of the IOI User-Research Team, Tim Ward,
Kim Krogh, Noah Hughes, Jim Blackhurst, Markus Friedl,
Thomas Howalt, Anders Nielsen as well as the management
of both companies.
R EFERENCES
[1] K. Isbister and N. Schaffer, Game Usability: Advancing the Player
Experience. Morgan Kaufman, 2008.
[2] J. H. Kim, D. V. Gunn, E. Schuh, B. C. Phillips, R. J. Pagulayan, and
D. Wixon, “Tracking real-time user experience (true): A comprehensive instrumentation solution for complex systems,” in Proceedings of
CHI, Florence, Italy, 2008, pp. 443–451.
[3] R. J. Pagulayan, K. Keeker, D. Wixon, R. L. Romero, and T. Fuller,
The HCI handbook. Lawrence Erlbaum Associates, 2003, ch. Usercentered design in games, pp. 883–906.
[4] A. Drachen and A. Canossa, “Towards Gameplay Analysis via Gameplay Metrics,” in Proceedings of the 13th MindTrek 2009. Tampere,
Finland: ACM-SIGCHI Publishers, September 2009.
[5] A. Tychsen and A. Canossa, “Defining personas in games using
metrics,” in Proceedings of Future Play 2008. Toronto, Canada:
ACM publishers, 2008, pp. 73–80.
[6] A. Drachen, A. Canossa, and G. N. Yannakakis, “Player Modeling
using Self-Organization in Tomb Raider: Underworld,” in Proceedings
of the IEEE Symposium on Computational Intelligence and Games.
Milan, Italy: IEEE, September 2009, pp. 1–8.
[7] R. Houlette, Player Modeling for Adaptive Games. AI Game Programming Wisdom II. Charles River Media, Inc, 2004, pp. 557–566.
[8] D. Charles and M. Black, “Dynamic player modelling: A framework
for player-centric digital games,” in Proceedings of the International
Conference on Computer Games: Artificial Intelligence, Design and
Education, 2004, pp. 29–35.
[9] C. Thurau, C. Bauckhage, and G. Sagerer, “Learning human-like
Movement Behavior for Computer Games,” in From Animals to Animats 8: Proceedings of the 8th International Conference on Simulation
of Adaptive Behavior (SAB-04), S. Schaal, A. Ijspeert, A. Billard,
S. Vijayakumar, J. Hallam, and J.-A. Meyer, Eds. Santa Monica,
LA, CA: The MIT Press, July 2004, pp. 315–323.
[10] ——, “Combining self organizing maps and multilayer perceptrons to
learn bot-behaviour for a commercial game,” in GAME-ON, 2003, pp.
119–123.
[11] C. Thurau, T. Paczian, and C. Bauckhage, “Is bayesian imitation
learning the route to believable gamebots?” International Journal of
Intelligent Systems Technologies and Applications, vol. 2, no. 2/3, pp.
284–295, 2007.
[12] R. Thawonmas, M. Kurashige, K. Iizuka, and M. Kantardzic, “Clustering of Online Game Users Based on Their Trails Using Self-organizing
Map,” in Proceedings of Entertainment Computing - ICEC 2006, 2006,
pp. 366–369.
[13] O. Missura and T. Gärtner, “Player modeling for intelligent difficulty
adjustment,” in Proceedings of the ECML–09 Workshop From Local
Patterns to Global Models (LeGo–09), J. F. Arno Knobbe, Ed., Bled,
Slovenia, September 2009.
[14] R. Thawonmas, Y. Kashifuji, and K.-T. Chen, “Detection of MMORPG
Bots Based on Behavior Analysis,” in Proceedings of the 2008
International Conference on Advances in Computer Entertainment
Technology (ACE). Yokohama, Japan: ACM, 2008, pp. 91–94.
[15] R. Thawonmas and K. Iizuka, “Visualization of online-game players
based on their action behaviors,” International Journal of Computer
Games Technology.
[16] N. Ducheneaut and R. J. Moore, “The Social Side of Gaming: A
study of interaction patterns in a Massively Multiplayer Online Game,”
in Proceedings of the 2004 ACM conference on Computer supported
cooperative work. Chicaco, Illinois: ACM, 2004, pp. 360–369.
[17] H.-K. K. P. H.-H. C. Kuan-Ta Chen, Andrew Liao, “Game Bot
Detection Based on Avatar Trajectory,” in Proceedings of the 7th
International Conference on Entertainment Computing (ACE). ACM,
2008, pp. 94–105.
[18] B. Weber and M. Mateas, “A Data Mining Approach to Strategy
Prediction,” in IEEE Symposium on Computational Intelligence in
Games (CIG 2009), Milan, Italy, September 2009, pp. 140–147.
[19] C. Thurau, K. Kersting, and C. Bauckhage, “Convex non–negative
matrix factorization in the wild,” in Proceedings of the 9th IEEE International Conference on Data Mining (ICDM–09), W. W. H. Kargupta,
Ed., Miami, FL, USA, Dec. 6–9 2009.
[20] R. Rosenthal, “Covert communication in laboratories, classrooms, and
the truly real world,” Current Directions in Psychological Science,
vol. 12, no. 5, pp. 151–154, 2003.
[21] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.
Witten, “The WEKA Data Mining Software: An Update,” SIGKDD
Explorations, vol. 11, no. 1, 2009.
[22] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda,
G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach,
D. J. Hand, and D. Steinberg, “Top 10 algorithms in data mining,”
Knowl. Inf. Syst., vol. 14, no. 1, pp. 1–37, 2007.
[23] M. A. Hall and L. A. Smith, “Practical feature subset selection
for machine learning,” in Australian Computer Science Conference.
Springer, 1998, pp. 181–191.
Player Modeling using Self-Organization in Tomb Raider:
Underworld
Anders Drachen, Alessandro Canossa and Georgios N. Yannakakis
Abstract—We present a study focused on constructing models
of players for the major commercial title Tomb Raider: Underworld (TRU). Emergent self-organizing maps are trained on
high-level playing behavior data obtained from 1365 players
that completed the TRU game. The unsupervised learning
approach utilized reveals four types of players which are
analyzed within the context of the game. The proposed approach automates, in part, the traditional user and play testing
procedures followed in the game industry since it can inform
game developers, in detail, if the players play the game as
intended by the game design. Subsequently, player models can
assist the tailoring of game mechanics in real-time for the needs
of the player type identified.
Keywords: Player modeling, unsupervised learning, emergent self-organizing maps, Tomb Raider: Underworld
I. I NTRODUCTION
Being able to evaluate how people play a game is a
crucial component of the user-oriented testing process in the
game development industry. During the development phases,
games are iteratively improved and modified towards the final
gold master version, which is published. Representatives of
the target audience as well as internal professional testers
spend hundreds of hours testing the games and evaluating
the quality of the gaming experience [1]. Moreover, one
of the key components of user-oriented testing both during
production, as well as after game launch, is to evaluate if
people play the game as intended — and if not, to find out
why there is a difference between the intended and actual
playing behavior, and whether this has an impact on their
playing experience [1], [2]. Given that nonlinear game design
(i.e. game design in which the player has multiple choices
about how to progress in the game) becomes increasingly
popular — massively multi-layer on-line games being a good
example of the increased popularity of nonlinear sandboxtype games — the need of more reliable and detailed usertesting is growing.
Within the last five years, instrumentation data — or game
metrics as they are referred to in game development — has
gained increasing attention in the game industry as a source
of detailed information about player behavior in computer
games [2]. Gameplay metrics are detailed numerical data
extracted from the interaction of the player with the game
using specialized monitoring software [3]. The application
of machine learning on such data and the inference of
AD and GNY are with the Center for Computer Games Research, IT
Univer...
Purchase answer to see full
attachment