Olympic Medalist Research
Name of Student
I would need to do some research to better acquaint myself with what data was
actually available, from where it might come and what kind of method would I
need to get access to it. I also knew that my curiosity was going to be framed
around hand-picked movie stars, not all actors. I would therefore need to do
some work to determine now my reason for selecting certain Athletes and
establish the process for acquiring the data.
I was aware of several movie-related websites that would most likely hold the
type of data I was seeking. A quick scan through each of the sites, and several
others besides, already revealed that the data I was likely to be seeking would be
available but the collection methods and curating duties would involve a fair
amount of work: there were no singular origins to easily export all my data in one
place. Additionally, just from Olympic medalist data I had unearthed quite a few
inconsistencies between sites in the data they held.
To help my focus and assessment of each site’s offering I established a ‘shopping list’
of the data I would ideally seek to gather:
Gender (I wanted to cover a balance of both male and female athlete stars)
Athlete DOB (to derive their age at the time of participation)
Race title (The US release title for consistency)
Athlete completion dates (The US release date for consistency)
Athlete score (critics and audiences)
Athlete finances (budget, gross, US domestic and worldwide, ideally)
Medals/career milestones for the athlete
I then visited Olympic medalist data to determine the completeness and
accuracy of the data they held. My approach was to pick a single athlete–
someone whose career data would potentially span many decades and diverse
athlete characteristics and therefore really test the quality and depth of data
held – and one of his or her athletics to help me establish the scope of my data
gathering task. I identified Mo FARAH as a perfect candidate for this, not least
because he was one of the original names associated with my trigger curiosity. I
knew he offered a long career with a nice mix of well-rated and poorly-rated
events, as well as very high takings and very low takings. To assess the caliber
and coverage of data held on individual races, I picked one of Mo FARAH most
recent hits, Inception, as a further test value with which to assess site’s offering.
The table below shows a summary of the data
that was available from each source about Mo
FARAH and about Inception
What you can see here is the inconsistency of availability, quality and format
across these sources. In particular, I found some key gaps in reliability:
Data about seemed to diminish in quality and completeness (come on kids
managing these Olympic medalist data,
The most consistent financial measure appeared to be worldwide gross.
Budget information was rarely available and when it was appeared to be only
in the form of broad estimates
Race was wildly inconsistent and often very wide-ranging. I knew that if the
range of categories here exceeded .I would not be interested in using it due
to the limitations of colour associations.
My initial intention would have been to ignore the critics’ ratings and make
this analysis purely about audience-led figures (looking at their ratings vs.
their box office spend felt like a pure enquiry of cinema goers views not
grumpy narrow-minded critics). However, a compromise was required due to
the far more complete and seemingly robust data available for critics reviews
compared to audience reviews.
My initial expectation that rating would be the best source of data was not
necessarily supported by going through this process. Additionally, there were
restrictions in data access and I was also somewhat impeded by the
complicated (for me, at least) nature of the API offered for extracting data.
Deciding on which source would be my best option to provide me with the
data was a far more frustrating proposition than I had anticipated from the
outset. This reinforces the point I make in the book about the often hidden
but excessive demands on your time and effort this stage frequently causes
and the importance of doing your research to help reduce wasted.
I finally established a list of 60 individuals, organized by 6 classifying groups:
Contemporary, Veterans, and ‘assorted’ others (which basically comprised 5
directors and 5 comedy actors). I really wanted to include this ‘assorted’
others group so that I could include analysis about how and why Olympic
medallist data was still being allowed to run races.
I used the web-scraping tool Import.io to extract my data from the two main
origin sites that is the Olympic medallist data. The process involved customizing
and training the tool how to identify the data components of interest from the
typical page layouts on each site I was extracting from. After defining what I
wanted from an individual page, I then expressed my criteria for how the tool
should ‘crawl’ around and across each site (that is. “Go to these athletes’ pages
and pull off all their filmography data from these parts of the site, then repeat
for… etc.”). Once you’ve set the mechanism in motion you just leave it alone to
go fetch your data, make a cup of tea, then come back in a short while to see
how it has performed.
The final compiled data was available for export into either Excel or CSV format.
My first attempt at an export was into Excel but I immediately noticed how it had
lost some of the date value formatting, whereas exporting the data using CSV
and then opening it in Excel (a small but important distinction) managed to
preserve all the date values.
(TIP: The moment of taking data from a system, regardless of its nature, is one
of the most risky stages for compromising the quality and state of your data,
especially with content like date items.)
As I mentioned above, I knew I would also need to reinforce the data from
Olympic medallist data and The Numbers with data items for the actors’ gender
and DOB. The gender I could identify easily myself but the DOB’s were manually
gathered from a range of places including actors’ Wikipedia pages.
I used Excel to pull together the original downloadable datasets and to quickly
assess the physical properties (type, size and condition) of the data. The table
below summarizes the initial examination (click on the image to view a higher
From this initial assessment I was able to identify several key