Running head: DATA VISUALIZATION
1
Discussion Post 1:
Employee Satisfaction Survey
6
5
4
3
2
1
0
Opportunities for growth
Chance to be creative
Customer Service Unit
Friendly working
environment
Marketing Unit
Employees reviews
Sales Department
What does the visual represent?
Recently I carried out an employee satisfaction survey. The data was collected using the Likert
scale. Their responses fell into one of the categories below:
•
Agree
•
Neutral
•
Disagree
The survey data visualization shows that the organization under study is not seen as a place
where one can advance their careers across three different departments. The data visualization
reveals that the organization is not seen as a place where one can grow their careers. This s
manly because in the two departments senior officers are externally sourced whenever a
position falls vacant instead of promoting internally. The other part that is interesting is the
DATA VISUALIZATION
2
marketing department not being satisfied with their working environment. The topic can be
discussed in the next company meeting.
What was its purpose?
The main objective of the survey was to have insights on the motivation of the employees
across the three main departments within the organization.
Who was your audience?
The main audience of the survey is the line supervisors, departmental heads, managers,
directors, and the Chief Executive Officer. Through this survey they stand a better chance of
understanding the issues and being to understand the appropriate steps necessary so as to
motivate the staff in the three different departments.
Why did you compose the visual this way?
The visual plainly differenties worker fulfillment across various divisions. It shows the
normal fulfillment level dependent on information from the three offices which is valuable in
understanding whether a specific office workers are fulfilled above or underneath their general
normal. Orchestrating classes from least fulfilled to generally fulfilled, in view of the normal
obviously identies where the organization should center.
Comment 1:
Hi,
I am glad that the idea you described in your post relates to the overall perception of how we ought
to make data visualization meaning. In fact, I think you may agree with me that all the steps that
you have described entails a function that adds more value to data. Also, I realized that you have
also given more attention to how the audience should receive the data. This is a wide view of the
data visualization. I still believe that there exist other scientific perceptions of data visualization.
For instance, if we consider the data visualization in regards of how a computer work, there may
be some different steps. However, the same logic still flows. Your post is quite informative.
DATA VISUALIZATION
3
Comment 2:
Hi,
Nice Explanation. I want to add few more points to your discussion. We need data visualization
because a visual summary of information makes it easier to identify patterns and trends than
looking through thousands of rows on a spreadsheet. It’s the way the human brain works. Since
the purpose of data analysis is to gain insights, data is much more valuable when it is visualized.
Even if a data analyst can pull insights from data without visualization, it will be more difficult to
communicate the meaning without visualization. Charts and graphs make communicating data
findings easier even if you can identify the patterns without them.
Reference
Andy Kirk. (2016). Data Visualization: A Handbook for Data Driven Design. SAGE
Publications Ltd.
2 Working With Data
In Chapter 3 the workflow process was initiated by exploring the defining matters around context and vision.
The discussion about curiosity, framing not just the subject matter of interest but also a specific enquiry that
you are seeking an answer to, in particular leads your thinking towards this second stage of the process:
working with data.
In this chapter I will start by covering some of the most salient aspects of data and statistical literacy. This
section will be helpful for those readers without any – or at least with no extensive – prior data experience. For
those who have more experience and confidence with this topic, maybe through their previous studies, it
might merely offer a reminder of some of the things you will need to focus on when working with data on a
visualisation project.
There is a lot of hard work that goes into the activities encapsulated by ‘working with data’. I have broken
these down into four different groups of action, each creating substantial demands on your time:
Data acquisition: Gathering the raw material.
Data examination: Identifying physical properties and meaning.
Data transformation: Enhancing your data through modification and consolidation.
Data exploration: Using exploratory analysis and research techniques to learn.
You will find that there are overlapping concerns between this chapter and the nature of Chapter 5, where you
will establish your editorial thinking. The present chapter generally focuses more on the mechanics of
familiarisation with the characteristics and qualities of your data; the next chapter will build on this to shape
what you will actually do with it.
As you might expect, the activities covered in this chapter are associated with the assistance of relevant tools
and technology. However, the focus for the book will remain concentrated on identifying which tasks you
have to undertake and look less at exactly how you will undertake these. There will be tool-specific references
in the curated collection of resources that are published in the digital companion.
2.1 Data Literacy: Love, Fear and Loathing
I frequently come across people in the field who declare their love for data. I don’t love data. For me it would
be like claiming ‘I love food’ when, realistically, that would be misleading. I like sprouts but hate carrots. And
don’t get me started on mushrooms.
At the very start of the book, I mentioned that data might occasionally prove to be a villain in your quest for
developing confidence with data visualisation. If data were an animal it would almost certainly be a cat: it has a
capacity to earn and merit love but it demands a lot of attention and always seems to be conspiring against
you.
I love data that gives me something interesting to do analysis-wise and then, subsequently, also visually.
Sometimes that just does not happen.
I love data that is neatly structured, clean and complete. This rarely exists. Location data will have inconsistent
place-name spellings, there will be dates that have a mixture of US and UK formats, and aggregated data that
does not let me get to the underlying components.
You don’t need to love data but, equally, you shouldn’t fear data. You should simply respect it by appreciating
that it will potentially need lots of care and attention and a shift in your thinking about its role in the creative
process. Just look to develop a rapport with it, embracing its role as the absolutely critical raw material of this
process, and learn how to nurture its potential.
For some of you reading this book, you might have interest in data but possibly not much knowledge of the
specific activities involving data as you work on a visualisation design solution. An assumed prerequisite for
anyone working in data visualisation is an appreciation of data and statistical literacy. However, this is not
always the case. One of the biggest causes of failure in data visualisations – especially in relation to the principle
I introduced about ‘trustworthy design’ – comes from a poor understanding of these numerate literacies. This
can be overcome, though.
‘When I first started learning about visualisation, I naively assumed that datasets arrived at your doorstep
ready to roll. Begrudgingly I accepted that before you can plot or graph anything, you have to find the data,
understand it, evaluate it, clean it, and perhaps restructure it.’ M arcia Gray, Grap hic Desig ner
I discussed in the Introduction the different entry points from which people doing data visualisation work
come. Typically – but absolutely not universally – those who join from the more creative backgrounds of
graphic design and development might not be expected to have developed the same level of data and statistical
knowledge than somebody from the more numerate disciplines. If you are part of this creative cohort and can
identify with this generalisation, then this chapter will ease you through the learning process (and in doing so
hopefully dispel any myth that it is especially complicated).
Conversely, many others may think they do not know enough about data but in reality they already do ‘get’ it
– they just need to learn more about its role in visualisation and possibly realign their understanding of some
of the terminology. Therefore, before delving further into this chapter’s tasks, there are a few ‘defining’ matters
I need to address to cover the basics in both data and statistical literacy.
Data Assets and Tabulation Types
Firstly, let’s consider some of the fundamentals about what a dataset is as well as what shape and form it comes
in.
When working on a visualisation I generally find there are two main categories of data ‘assets’: data that exist in
tables, known as datasets; and data that exists as isolated values.
For the purpose of this book I describe this type of data as being raw because it has not yet been statistically
or mathematically manipulated and it has not been modified in any other way from its original state.
Tabulated datasets are what we are mainly interested in at this point. Data as isolated values refers to data that
exists as individual facts and statistical figures. These do not necessarily belong in, nor are they normally
collected in, a table. They are just potentially useful values that are dispersed around the Web or across reports:
individual facts or figures that you might come across during your data gathering or research stages. Later on in
your work you might use these to inform calculations (e.g. applying a currency conversion) or to incorporate a
fact into a title or caption (e.g. 78% of staff participated in the survey), but they are not your main focus for
now.
Tabulated data is unquestionably the most common form of data asset that you will work with, but it too can
exist in slightly different shapes and sizes. A primary difference lies between what can be termed normalised
datasets (Figure 4.1) and cross-tabulated datasets (Figure 4.2).
A normalised dataset might loosely be described as looking like lists of data values. In spreadsheet parlance, you
would see this as a series of columns and rows of data, while in database parlance it is the arrangement of fields
and records. This form of tabulated data is generally the most detailed form of data available for you to work
with. The table in Figure 4.1 is an example of normalised data where the columns of variables provide
different descriptive values for each movie (or record) held in the table.
F igure 4.1 Example of a Normalised Dataset
Cross-tabulated data is presented in a reconfigured form where, instead of displaying raw data values, the table
of cells contain the results of statistical operations (like summed totals, maximums, averages). These values are
aggregated calculations formed from the relationship between two variables held in the normalised form of the
data. In Figure 4.2, you will see the cross-tabulated result of the normalised table of movie data, now showing
a statistical summary for each movie category. The statistic under ‘Max Critic Rating’ is formed from an
aggregating calculation based on the ‘Critic Rating’ and ‘Category’ variables seen in Figure 4.1.
F igure 4.2 Example of a Cross-tabulated Dataset
Typically, if you receive data in an already cross-tabulated form, you do not have access to the original data.
110
This means you will not be able to ‘reverse-engineer’ it back into its raw form, which, in turn, means you have
reduced the scope of your potential analysis. In contrast, normalised data gives you complete freedom to
explore, manipulate and aggregate across multiple dimensions. You may choose to convert the data into ‘crosstabulated’ form but that is merely an option that comes with the luxury of having access to the detailed form
of your data. In summary, it is always preferable, where possible, to work with normalised data.
Data Types
One of the key parts of the design process concerns understanding the different types of data (sometimes
known as levels of data or scales of measurement). Defining the types of data will have a huge influence on
so many aspects of this workflow, such as determining:
the type of exploratory data analysis you can undertake;
the editorial thinking you establish;
the specific chart types you might use;
the colour choices and layout decisions around composition.
In the simplest sense, data types are distinguished by being either qualitative or quantitative in nature. Beneath
this distinction there are several further separations that need to be understood. The most useful taxonomy I
have found to describe these different types of data is based on an approach devised by the psychologist
researcher Stanley Stevens. He developed the acronym NOIR as a mnemonic device to cover the different
types of data you may come to work with, particularly in social research: Nominal, Ordinal, Interval, and
Ratio. I have extended this, adding onto the front a ‘T’ – for Textual – which, admittedly, somewhat
undermines the grace of the original acronym but better reflects the experiences of handling data today. It is
important to describe, define and compare these different types of data.
Textual (Qualitative)
Textual data is qualitative data and generally exists as unstructured streams of words. Examples of textual data
might include:
‘Any other comments?’ data submitted in a survey.
Descriptive details of a weather forecast for a given city.
The full title of an academic research project.
The description of a product on Amazon.
The URL of an image of Usain Bolt’s victory in the 100m at the 2012 Olympics.
F igure 4.3 Graphic Language: The Curse of the CEO
111
In its native form, textual data is likely to offer rich potential but it can prove quite demanding to unlock this.
To work with textual data in an analysis and visualisation context will generally require certain natural language
processing techniques to derive or extract classifications, sentiments, quantitative properties and relational
characteristics.
112
An example of how you can use textual data is seen in the graphic of CEO swear word usage shown in Figure
4.3. This analysis provides a breakdown of the profanities used by CEOs from a review of recorded conference
calls over a period of 10 years. This work shows the two ways of utilising textual data in visualisation. Firstly,
you can derive categorical classifications and quantitative measurements to count the use of certain words
compared to others and track their usage over time. Secondly, the original form of the textual data can be of
direct value for annotation purposes, without the need for any analytical treatment, to include as captions.
Working with textual data will always involve a judgement of reward vs effort: how much effort will I need to
expend in order to extract usable, valuable content from the text? There are an increasing array of tools and
algorithmic techniques to help with this transformational approach but whether you conduct it manually or
with some degree of automation it can be quite a significant undertaking. However, the value of the insights
you are able to extract may entirely justify the commitment. As ever, your judgment of the aims of your work,
the nature of your subject and the interests of your audience will influence your decision.
Nominal (Qualitative)
Nominal data is the next form of qualitative data in the list of distinct data types. This type of data exists in
categorical form, offering a means of distinguishing, labelling and organising values. Examples of nominal data
might include:
The ‘gender’ selected by a survey participant.
The regional identifier (location name) shown in a weather forecast.
The university department of an academic member of staff.
The language of a book on Amazon.
An athletic event at the Olympics.
Often a dataset will hold multiple nominal variables, maybe offering different organising and naming
perspectives, for example the gender, eye colour and hair colour of a class of school kids.
Additionally, there might be a hierarchical relationship existing between two or more nominal variables,
representing major and sub-categorical values: for example, a major category holding details of ‘Country’ and a
sub-category holding ‘Airport’; or a major category holding details of ‘Industry’ and a sub-category holding
details of ‘Company Names’. Recognising this type of relationship will become important when considering
the options for which angles of analysis you might decide to focus on and how you may portray them visually
using certain chart types.
Nominal data does not necessarily mean text-based data; nominal values can be numeric. For example, a
student ID number is a categorical device used uniquely to identify all students. The shirt number of a
footballer is a way of helping teammates, spectators and officials to recognise each player. It is important to be
aware of occasions when any categorical values are shown as numbers in your data, especially in order to
understand that these cannot have (meaningful) arithmetic operations applied to them. You might find logic
statements like TRUE or FALSE stated as a 1 and a 0, or data captured about gender may exist as a 1 (male), 2
(female) and 3 (other), but these numeric values should not be considered quantitative values – adding ‘1’ to
‘2’ does not equal ‘3’ (other) for gender.
Ordinal (Qualitative)
113
Ordinal data is still categorical and qualitative in nature but, instead of there being an arbitrary relationship
between the categorical values, there are now characteristics of order. Examples of nominal data might include:
The response to a survey question: based on a scale of 1 (unhappy) to 5 (very happy).
The general weather forecast: expressed as Very Hot, Hot, Mild, Cold, Freezing.
The academic rank of a member of staff.
The delivery options for an Amazon order: Express, Next Day, Super Saver.
The medal category for an athletic event: Gold, Silver, Bronze.
Whereas nominal data is a categorical device to help distinguish values, ordinal data is also a means of
classifying values, usually in some kind of ranking. The hierarchical order of some ordinal values goes through
a single ascending/descending rank from high or good values to low or bad values. Other ordinal values have a
natural ‘pivot’ where the direction changes around a recognisable mid-point, such as the happiness scale which
might pivot about ‘no feeling’ or weather forecast data that pivots about ‘Mild’. Awareness of these different
approaches to ‘order’ will become relevant when you reach the design stages involving the classifying of data
through colour scales.
Interval (Quantitative)
Interval data is the less common form of quantitative data, but it is still important to be aware of and to
understand its unique characteristics. An interval variable is a quantitative and numeric measurement defined by
difference on a scale but not by relative scale. This means the difference between two values is meaningful but
an arithmetic operation such as multiplication is not.
The most common example is the measure for temperature in a weather forecast, presented in units of Celsius.
The absolute difference between 15°C and 20°C is the same difference as between 5°C and 10°C. However,
the relative difference between 5°C and 10°C is not the same as the difference between 10°C and 20°C (where
in both cases you multiply by two or increase by 100%). This is because a zero value is arbitrary and often
means very little or indeed is impossible. A temperature reading of 0°C does not mean there is no temperature,
it is a quantitative scale for measuring relative temperature. You cannot have a shoe size or Body Mass Index of
zero.
Ratio (Quantitative)
Ratio data is the most common quantitative variable you are likely to come across. It comprises numeric
measurements that have properties of difference and scale. Examples of nominal data might include:
The age of a survey participant in years.
The forecasted amount of rainfall in millimetres.
The estimated budget for a research grant proposal in GBP (£).
The number of sales of a book on Amazon.
The distance of the winning long jump at the 2012 Olympics in metres.
Unlike interval data, for ratio data variables zero means something. The absolute difference in age between a 10
and 20 year old is the same as the difference between a 40 and 50 year old. The relative difference between a 10
and a 20 year old is the same as the difference between a 40 and an 80 year old (‘twice as old’).
114
Whereas most of the quantitative measurements you will deal with are based on a linear scale, there are
exceptions. Variables about the strength of sound (decibels) and magnitude of earthquakes (Richter) are
actually based on a logarithmic scale. An earthquake with a magnitude of 4.0 on the Richter scale is 1000 times
stronger based on the amount of energy released than an earthquake of magnitude 2.0. Some consider these as
types of data that are different from ratio variables. Most still define them as ratio variables but separate them
as non-linear scaled variables.
If temperature values were measured in kelvin, where there is an absolute zero, this would be considered a
ratio scale, not an interval one.
Temporal Data
Time-based data is worth mentioning separately because it can be a frustrating type of data to deal with,
especially in attempting to define its place within the TNOIR classification. The reason for this is that
different components of time can be positioned against almost all data types, depending simply on what form
your time data takes:
T ex tual: ‘Four o’clock in the afternoon on Monday, 12 March 2016’
O rdinal: ‘PM’, ‘Afternoon’, ‘March’, ‘Q1’
Interv al: ‘12’, ‘12/03/2016’, ‘2016’
R atio : ‘16:00’
Note that time-based data is separate in concern to duration data, which, while often formatted in structures
such as hh:mm:ss, should be seen as a ratio measure. To work with duration data it is often useful to
transform it into single units of time, such as total seconds or minutes.
Discrete vs Continuous
Another important distinction to make about your data, and something that cuts across the TNOIR
classification, is whether the data is discrete or continuous. This distinction is influential in how you might
analyse it statistically and visually.
The relatively simple explanation is that discrete data is associated with all classifying variables that have no ‘inbetween’ state. This applies to all qualitative data types and any quantitative values for which only a whole is
possible. Examples might be:
Heads or tails for a coin toss.
Days of the week.
The size of shoes.
Numbers of seats in a theatre.
In contrast, continuous variables can hold the value of an in-between state and, in theory, could take on any
value between the natural upper and lower limits if it was possible to take measurements in fine degrees of
115
detail, such as:
Height and weight.
Temperature.
Time.
One of the classifications that is hard to nail down involves data that could, on the TNOIR scale, arguably fall
under both ordinal and ratio definitions based on its usage. This makes it hard to determine if it should be
considered discrete or continuous. An example would be the star system used for rating a movie or the
happiness rating. When a star rating value is originally captured, the likelihood is that the input data was
discrete in nature. However, for analysis purposes, the statistical operations applied to data that is based on
different star ratings could reasonably be treated either as discrete classifications or, feasibly, as continuous
numeric values. For both star review ratings or happiness ratings decimal averages could be calculated as a way
of formulating average score. (The median and mode would still be discrete.) The suitability of this approach
will depend on whether the absolute difference between classifying values can be considered equal.
2.2 Statistical Literacy
If the fear of data is misplaced, I can sympathise with anybody’s trepidation towards statistics. For many,
statistics can feel complicated to understand and too difficult a prospect to master. Even for those relatively
comfortable with stats, it is unquestionably a discipline that can easily become rusty without practice, which
can also undermine your confidence. Furthermore, the fear of making mistakes with delicate and rule-based
statistical calculations also depresses the confidence levels lower than they need to be.
The problem is that you cannot avoid the need to use some statistical techniques if you are going to work with
data. It is therefore important to better understand statistics and its role in visualisation, as you must do with
data. Perhaps you can make the problem more surmountable by packaging the whole of statistics into smaller,
manageable elements that will dispel the perception of overwhelming complexity.
I do believe that it is possible to overstate the range and level of statistical techniques most people will need to
employ on most of their visualisation tasks. The caveats are important as I know there will be people with
visualisation experience who are exposed to a tremendous amount of statistical thinking in their work, but it is
a relevant point.
It all depends, of course. From my experience, however, the majority of data visualisation challenges will
generally involve relatively straightforward univariate and multivariate statistical techniques. Univariate
techniques help you to understand the shape, size and range of quantitative values. Multivariate techniques help
you to explore the possible relationships between different combinations of variables and variable types. I will
describe some of the most relevant statistical operations associated with these techniques later in this chapter, at
the point in your thinking where they are most applicable.
As you get more advanced in your work (and your confidence increases) you might have occasion to employ
inference techniques. These include concepts such as data modelling and the use of regression analysis:
attempting to measure the relationships between variables to explore correlations and (the holy grail)
causations. Many of you will likely experience visualisation challenges that require an understanding of
probabilities, testing hypotheses and becoming acquainted with terms like confidence intervals. You might use
these techniques to assist with forecasting or modelling risk and uncertainty. Above and beyond that, you are
116
moving towards more advanced statistical modelling and algorithm design.
It is somewhat dissatisfactory to allocate only a small part of this text to discussing the role of descriptive and
exploratory statistics. However, for the scope of this book, and seeking to achieve a pragmatic balance, the
most sensible compromise is just to flag up which statistical activities you might need to consider and where
these apply. It can take years to learn about the myriad advanced techniques that exist and it takes experience to
know when and how to deploy all the different methods.
There are hundreds of books better placed to offer the depth of detail you truly need to fulfil these activities
and there is no real need to reinvent the wheel – and indeed reinvent an inferior wheel. That statistics is just
one part of the visualisation challenge, and is in itself such a prolific field, further demonstrates the variety and
depth of this subject.
2.3 Data Acquisition
The first step in working with data naturally involves getting it. As I outlined in the contextual discussion
about the different types of trigger curiosities, you will only have data in place before now if the opportunity
presented by the data was the factor that triggered this work. You will recall this scenario was described as
pursuing a curiosity born out of ‘potential intrigue’. Otherwise, you will only be in a position to know what
data you need after having established your specific or general motivating curiosity. In these situations, once
you have sufficiently progressed your thinking around ‘formulating your brief’, you will need to switch your
thinking onto the task of acquiring your data:
What data do you need and why?
From where, how, and by whom will the data be acquired?
When can you obtain it?
What Data Do You Need?
Your primary concern is to ensure you can gather sufficient data about the subject in which you are interested
to pursue your identified curiosity. By ‘sufficient’, I mean you will need to establish some general criteria in
your mind for what data you do need and what data you do not need. There is no harm in getting more than
you need at this stage but it can result in wasted efforts, waste that you would do well to avoid.
Let’s propose you have defined your curiosity to be ‘I wonder what a map of McDonald’s restaurant openings
looks like over time?’. In this scenario you are going to try to find a source of data that will provide you with
details of all the McDonald’s restaurants that have ever opened. A shopping list of data items would probably
include the date of opening, the location details (as specific as possible) and maybe even a closing date to ensure
you can distinguish between still operating and closed-down restaurants.
You will need to conduct some research, a perpetual strand of activity that runs throughout the workflow, as I
explained earlier. In this scenario you might need first to research a bit of the history of McDonald’s restaurants
to discover, for instance, when the first one opened, how many there are, and in which countries they are
located. This will establish an initial sense of the timeframe (number of years) and scale (outlets, global spread)
of your potential data. You might also discover significant differences between what is considered a restaurant
and what is just a franchise positioned in shopping malls or transit hubs. Sensitivities around the qualifying
117
criteria or general counting rules of a subject are important to discover, as they will help significantly to
substantiate the integrity and accuracy of your work.
Unless you know or have been told where to find this restaurant data, you will then need to research from
where the data might be obtainable. Will this type of information be published on the Web, perhaps on the
commercial pages of McDonald’s own site? You might have to get in touch with somebody (yes, a human) in
the commercial or PR department to access some advice. Perhaps there will be some fast-food enthusiast in
some niche corner of the Web who has already gathered and made available data like this?
Suppose you locate a dataset that includes not just McDonald’s restaurants but all fast-food outlets. This
could potentially broaden the scope of your curiosity, enabling broader analysis about the growth of the fastfood industry at large to contextualise MacDonald’s contribution to this. Naturally, if you have any
stakeholders involved in your project, you might need to discuss with them the merits of this wider
perspective.
Another judgement to make concerns the resolution of the data you anticipate needing. This is especially
relevant if you are working with big, heavy datasets. You might genuinely want and need all available data.
This would be considered full resolution – down to the most detailed grain (e.g. all details about all
MacDonald’s restaurants, not just totals per city or country). Sometimes, in this initial gathering activity, it
may be more practical just to obtain a sample of your data. If this is the case, what will be the criteria used to
identify a sufficient sample and how will you select or exclude records? What percentage of your data will be
sufficient to be representative of the range and diversity (an important feature we will need to examine next)?
Perhaps you only need a statistical, high-level summary (total number of restaurants opened by year)?
The chances are that you will not truly know what data you want or need until you at least get something to
start with and learn from there. You might have to revisit or repeat the gathering of your data, so an attitude of
‘what I have is good enough to start with’ is often sensible.
From Where, How and By Whom Will the Data Be Acquired?
There are several different origins and methods involved in acquiring data, depending on whether it will
involve your doing the heavy work to curate the data or if this will be the main responsibility of others.
Curated by You
This group of data-gathering tasks or methods is characterised by your having to do most of the work to bring
the data together into a convenient digital form.
P rim ary data co llectio n: If the data you need does not exist or you need to have full control over its
provenance and collection, you will have to consider embarking on gathering ‘primary’ data. In contrast to
secondary data, primary data involves you measuring and collecting the raw data yourself. Typically, this relates
to situations where you gather quite small, bespoke datasets about phenomena that are specific to your needs.
It might be a research experiment you have designed and launched for participants to submit responses. You
may manually record data from other measurement devices, such as your daily weight as measured by your
bathroom scales, or the number of times you interacted face-to-face with friends and family. Some people take
daily photographs of themselves, their family members or their gardens, in order to stitch these back together
eventually to portray stories of change. This data-gathering activity can be expensive in terms of both the time
118
and cost. The benefit however is that you have carefully controlled the collection of the data to optimise its
value for your needs.
M anual co llectio n and data fo raging: If the data you need does not exist digitally or in a convenient
singular location, you will need to forage for it. This again might typically relate to situations where you are
sourcing relatively small datasets. An example might be researching historical data from archived newspapers
that were only published in print form and not available digitally. You might look to pull data from multiple
sources to create a single dataset: for example, if you were comparing the attributes of a range of different cars
and weighing up which to buy. To achieve this you would probably need to source different parts of the data
you need from several different places. Often, data foraging is something you undertake in order to finish off
data collected by other means that might have a few missing values. It is sometimes more efficient to find the
remaining data items yourself by hand to complete the dataset. This can be somewhat time-consuming
depending on the extent of the manual gathering required, but it does provide you with greater assurance over
the final condition of the data you have collected.
E x tracted fro m pdf files : A special subset of data foraging – or a variation at least – involves those
occasions when your data is digital but essentially locked away in a pdf file. For many years now reports
containing valuable data have been published on the Web in pdf form. Increasingly, movements like ‘open
data’ are helping to shift the attitudes of organisations towards providing additional, fully accessible digital
versions of data. Progress is being made but it will take time before all industries and government bodies adopt
this as a common standard. In the meantime, there are several tools on the market (free and proprietary) that
will assist you in extracting tables of data from pdf files and converting these to more usable Excel or CSV
formats.
Some data acquisition tasks may be repetitive and, should you possess the skills and have access to the
necessary resources, there will be scope for exploring ways to automate these. However, you always have to
consider the respective effort and ongoing worth of your approach. If you do go to the trouble of authoring
an automation routine (of any description) you could end up spending more time on that than you would
otherwise collecting by more manual methods. If it is going to be a regular piece of analysis the efficiency
gains from your automation will unquestionably prove valuable going forward, but, for any one-off projects, it
may not be ultimately worth it
W eb s craping ( als o k no w n as w eb harv es ting) : This involves using special tools or programs to
extract structured and unstructured items of data published in web pages and convert these into tabulated form
for analysis. For example, you may wish to extract several years’ worth of test cricket results from a sports
website. Depending on the tools used, you can often set routines in motion to extract data across multiple
pages of a site based on the connected links that exist within it. This is known as web crawling. Using the same
example (let’s imagine), you could further your gathering of test cricket data by programmatically fetching data
back from the associated links pointing to the team line-ups. An important consideration to bear in mind with
any web scraping or crawling activity concerns rules of access and the legalities of extracting the data held on
certain sites. Always check – and respect – the terms of use before undertaking this.
Curated by Others
In contrast to the list of methods I have profiled, this next set of data-gathering approaches is characterised by
other people having done most of the work to source and compile the data. They will make it available for
you to access in different ways without needing the extent of manual efforts often required with the methods
119
presented already. You might occasionally still have to intervene by hand to fine-tune your data, but others
would generally have put in the core effort.
Is s ued to y o u: On the occasions when you are commissioned by a stakeholder (client, colleague) you will
often be provided with the data you need (and probably much more besides), most commonly in a spreadsheet
format. The main task for you is therefore less about collection and more about familiarisation with the
contents of the data file(s) you are set to work with.
D o w nlo ad fro m the W eb : Earlier I bemoaned the fact that there are still organisations publishing data
(through, for example, annual reports) in pdf form. To be fair, increasingly there are facilities being developed
that enable interested users to extract data in a more structured form. More sophisticated reporting interfaces
may offer users the opportunity to construct detailed queries to extract and download data that is highly
customised to their needs.
S y s tem repo rt o r ex po rt: This is related more to an internal context in organisations where there are
opportunities to extract data from corporate systems and databases. You might, for example, wish to conduct
some analysis about staff costs and so the personnel database may be where you can access the data about the
workforce and their salaries.
‘Don’t underestimate the importance of domain expertise. At the Office for National Statistics (ONS), I was
lucky in that I was very often working with the people who created the data – obviously, not everyone will
have that luxury. But most credible data producers will now produce something to accompany the data they
publish and help users interpret it – make sure you read it, as it will often include key findings as well as
notes on reliability and limitations of the data.’ Alan Smith OBE, Data Visu alisation Ed itor,
Fina ncia l T imes
T hird- party s erv ices : There is an ever-increasing marketplace for data and many commercial services out
there now offer extensive sources of curated and customised data that would otherwise be impossible to obtain
or very complex to gather. Such requests might include very large, customised extracts from social media
platforms like Twitter based on specific keywords and geo-locations.
AP I: An API (Application Programme Interface) offers the means to create applications that
programmatically access streams of data from sites or services, such as accessing a live feed from Transport for
London (TfL) to track the current status of trains on the London Underground system.
When Can the Data Be Acquired?
The issue of when data is ready and available for acquisition is a delicate one. If you are conducting analysis of
some survey results, naturally you will not have the full dataset of responses to work with until the survey is
closed. However, you could reasonably begin some of your analysis work early by using an initial sample of
what had been submitted so far. Ideally you will always work with data that is as complete as possible, but on
occasions it may be advantageous to take the opportunity to get an early sense of the nature of the submitted
responses in order to begin preparing your final analysis routines. Working on any dataset that may not yet be
complete is a risk. You do not want to progress too far ahead with your visualisation workflow if there is the
real prospect that any further data that emerges could offer new insights or even trigger different, more
interesting curiosities.
120
2.4 Data Examination
After acquiring your data your next step is to thoroughly examine it. As I have remarked, your data is your key
raw material from which the eventual visualisation output will be formed. Before you choose what meal to
cook, you need to know what ingredients you have and what you need to do to prepare them.
It may be that, in the act of acquiring the data, you have already achieved a certain degree of familiarity about
its status, characteristics and qualities, especially if you curated the data yourself. However, there is a definite
need to go much further than you have likely achieved before now. To do this you need to conduct an
examination of the physical properties and the meaning of your data.
As you progress through the stages of this workflow, your data will likely change considerably: you will bring
more of it in, you will remove some of it, and you will refine it to suit your needs. All these modifications will
alter the physical makeup of your data so you will need to keep revisiting this step to preserve your critical
familiarity.
Data Properties
The first part of familiarising yourself with with your data is to undertake an examination of its physical
properties. Specifically you need to ascertain its type, size and condition. This task is quite mechanical in many
ways because you are in effect just ‘looking’ at the data, establishing its surface characteristics through visual
and/or statistical observations.
What To Look For?
The type and size of your data involve assessing the characteristics and amount of data you have to work with.
As you examine the data you also need to determine its condition: how good is its quality and is it fit for
purpose?
D ata ty pes : Firstly, you need to identify what data types you have. In gathering this data in the first
place you might already have a solid appreciation about what you have before you, but doing this
thoroughly helps to establish the attention to detail you will need to demonstrate throughout this stage.
Here you will need to refer to the definitions from earlier in the chapter about the different types of data
(TNOIR). Specifically you are looking to define each column or field of data based on whether it is
qualitative (text, nominal, ordinal) or quantitative (interval, ratio) and whether it is discrete or
continuous in nature.
S iz e: Within each column or field you next need to know what range of values exist and what are the
specific attributes/formats of the values held. For example, if you have a quantitative variable (interval or
ratio), what is the lowest and the highest value? In what number format is it presented (i.e. how many
decimal points or comma formatted)? If it is a categorical variable (nominal or ordinal), how many
different values are held? If you have textual data, what is the maximum character length or word count?
C o nditio n: This is the best moment to identify any data quality and completeness issues. Naturally,
unidentified and unresolved issues around data quality will come to bite hard later, undermining the
scope and, crucially, trust in the accuracy of your work. You will address these issues next in the
‘transformation’ step, but for now the focus is on identifying any problems. Things to look out for may
include the following:
121
Missing values, records or variables – Are empty cells assumed as being of no value (zero/nothing)
or no measurement (n/a, null)? This is a subtle but important difference.
Erroneous values – Typos and any value that clearly looks out of place (such as a gender value in
the age column).
Inconsistencies – Capitalisation, units of measurement, value formatting.
Duplicate records.
Out of date – Values that might have expired in accuracy, like someone’s age or any statistic that
would be reasonably expected to have subsequently changed.
Uncommon system characters or line breaks.
Leading or trailing spaces – the invisible evil!
Date issues around format (dd/mm/yy or mm/dd/yy) and basis (systems like Excel’s base dates on
daily counts since 1 January 1900, but not all do that).
How to Approach This?
I explained in the earlier ‘Data literacy’ section the difference in asset types (data that exists in tables and data
that exists as isolated values) and also the difference in form (normalised data or cross-tabulated). Depending
on the asset and form of data, your examination of data types may involve slightly different approaches, but
the general task is the same. Performing this examination process will vary, though, based on the tools you are
using. The simplest approach, relevant to most, is to describe the task as you would undertake it using Excel,
given that this continues to be the common tool most people use or have the skills to use. Also, it is likely that
most visualisation tasks you undertake will involve data of a size that can be comfortably handled in Excel.
‘Data inspires me. I always open the data in its native format and look at the raw data just to get the lay of
the land. It’s much like looking at a map to begin a journey.’ Kim R ees, C o-fou nd er, Periscop ic
As you go through this task, it is good practice to note down a detailed overview of what data you have,
perhaps in the form of a table of data descriptions. This is not as technical a duty as would be associated with
the creation of a data dictionary but its role and value are similar, offering a convenient means to capture all the
descriptive properties of your various data assets.
Ins pect and s can: Your first task is just to scan your table of data visually. Navigate around it using
the mouse/trackpad, use the arrow keys to move up or down and left or right, and just look at all the
data. Gain a sense of its overall dimension. How many columns and how many rows does it occupy?
How big a prospect might working with this be?
D ata o peratio ns : Inspecting your data more closely might require the use of interrogation features
such as sorting columns and doing basic filters. This can be a quick and simple way to acquaint yourself
with the type of data and range of values.
Going further, once again depending on the technology (and assuming you have normalised data to start
with), you might apply a cross-tabulation or pivot table to create aggregated, summary views of
different angles and combinations of your data. This can be a useful approach to also check out the
unique range of values that exist under different categories as well as helping to establish how subcategories may relate other categories hierarchically. This type of inspection will be furthered in the next
step of the ‘working with data’ process when you will undertake deeper visual interrogations of the type,
size and condition of your data.
If you have multiple tables, you will need to repeat this approach for each one as well as determine how
122
they are related collectively and on what basis. It could be that just considering one table as the standard
template, representative of each instance, is sufficient: for example, if each subsequent table is just a
different monthly view of the same activity.
For so-called ‘Big Data’ (see the glossary definition earlier), it is less likely that you can conduct this
examination work through relatively quick, visual observations using Excel. Instead it will need tools
based around statistical language that will describe for you what is there rather than let you look at what
is there.
S tatis tical m etho ds : The role of statistics in this examination stage generally involves relatively basic
quantitative analysis methods to help describe and understand the characteristics of each data variable.
The common term applied to this type of statistical approach is univariate, because it involves just
looking at one variable at a time (the best opportunity to perform the analysis of multiple variables
comes later). Here are some different types of statistical analyses you might find useful at this stage.
These are not the only methods you will ever need to use, but will likely prove to be among the most
common:
Frequency counts: applied to categorical values to understand the frequency of different
instances.
Frequency distribution: applied to quantitative values to learn about the type and shape of the
distribution of values.
Measurements of central tendency describe the summary attributes of a group of quantitative
values, including:
the mean (the average value);
the median (the middle value if all quantities were arranged from smallest to largest);
the mode (the most common value).
Measurements of spread are used to describe the dispersion of values above and below the mean:
Maximum, minimum and range: the highest and lowest and magnitude of spread of
values.
Percentiles: the value below which x% of values fall (e.g. the 20th percentile is the value
below which 20% of all quantitative values fall).
Standard deviation: a calculated measure used to determine how spread out a series of
quantitative values are.
Data Meaning
Irrespective of whether you or others have curated the data, you need to be discerning about how much trust
you place in it, at least to begin with. As discussed in the ‘trustworthy design’ principle, there are provenance
issues, inaccuracies and biases that will affect its status on the journey from being created to being acquired.
These are matters you need to be concerned with in order to resolve or at least compensate for potential
shortcomings.
Knowing more about the physical properties of your data does not yet achieve full familiarity with its content
nor give you sufficient acquaintance with its qualities. You will have examined the data in a largely mechanical
and probably quite detached way from the underlying subject matter. You now need to think a little deeper
about its meaning, specifically what it does – and does not – truly represent.
‘A visualization is always a model (authored), never a mould (replica), of the real. That’s a huge
responsibility.’ Paolo C iu ccarelli, Scientific Director of DensityDesig n R esearch Lab at
123
Politecnico d i M ilano
What Phenomenon?
Determining the meaning of your data requires that you recognise this is more than just a bunch of numbers
and text values held in the cells of a table. Ask yourself, ‘What is it about? What activity, entity, instance or
phenomenon does it represent?’.
One of the most valuable pieces of advice I have seen regarding this task came from Kim Rees, co-founder of
Periscopic. Kim describes the process of taking one single row of data and using that as an entry point to learn
carefully about what each value means individually and then collectively. Breaking down the separation
between values created by the table’s cells, and then sticking the pieces back together, helps you appreciate the
parts and the whole far better.
‘Absorb the data. Read it, re-read it, read it backwards and understand the lyrical and human-centred
contribution.’ Kate M cLean, Smellscap e M ap p er and Senior Lectu rer Grap hic Desig n
You saw the various macro- and micro-level views applied to the context of the Texas Department for
Criminal Justice executed offenders information in the previous chapter. The underlying meaning of this data
– its phenomenon – was offenders who had been judged guilty of committing heinous crimes and had faced
the ultimate consequence. The availability of textual data describing the offenders’ last statements and details
of their crimes heightened the emotive potential of this data. It was heavy stuff. However, it was still just a
collection of values detailing dates, names, locations, categories. All datasets, whether on executed offenders or
the locations of MacDonald’s restaurants, share the same properties as outlined by the TNOIR data-type
mnemonic. What distinguishes them is what these values mean.
What you are developing here is a more semantic appreciation of your data to substantiate the physical
definitions. You are then taking that collective appreciation of what your data stands for to influence how you
might decide to amplify or suppress the influence of this semantic meaning. This builds on the discussion in
the last chapter about the tonal dimension, specifically the difference between figurative and non-figurative
portrayals.
A bar chart (Figure 4.4) comprising two bars, one of height 43 and the other of height 1, arguably does not
quite encapsulate the emotive significance of Barack Obama becoming the first black US president, succeeding
the 43 white presidents who served before him. Perhaps a more potent approach may be to present a
chronological display of 44 photographs of each president in order to visually contrast Mr Obama’s headshot
in the final image in the sequence with the previous 43. Essentially, the value of 43 is almost irrelevant in its
detail – it could be 25 or 55 – it is about there being ‘many’ of the same thing followed by the ‘one’ that
is‘different’. That’s what creates the impact. (What will image number 45 bring? A further striking ‘difference’
or a return to the standard mould?)
F igure 4.4 US Presidents by Ethnicity (1789 to 2015)
124
Learning about the underlying phenomena of your data helps you feel its spirit more strongly than just looking
at the rather agnostic physical properties. It also helps you in knowing what potential sits inside the data – the
qualities it possesses – so you are then equipped the best understanding of how you might want to portray it.
Likewise it prepares you for the level of responsibility and potential sensitivity you will face in curating a visual
representation of this subject matter. As you saw with the case study of the ‘Florida Gun Crimes’ graphic,
some subjects are inherently more emotive than others, so we have to demonstrate a certain amount of courage
and conviction in deciding how to undertake such challenges.
‘Find loveliness in the unlovely. That is my guiding principle. Often, topics are disturbing or difficult;
inherently ugly. But if they are illustrated elegantly there is a special sort of beauty in the truthful
communication of something. Secondly, Kirk Goldsberry stresses that data visualization should ultimately be
true to a phenomenon, rather than a technique or the format of data. This has had a huge impact on how I
think about the creative process and its results.’ John Nelson, C artog rap her
Completeness
Another aspect of examining the meaning of data is to determine how representative it is. I have touched on
data quality already, but inaccuracies in conclusions about what data is saying have arguably a greater impact on
trust and are more damaging than any individual missing elements of data.
The questions you need to ask of your data are: does it represent genuine observations about a given
phenomenon or is it influenced by the collection method? Does your data reflect the entirety of a particular
phenomenon, a recognised sample, or maybe even an obstructed view caused by hidden limitations in the
availability of data about that phenomenon?
Reflecting on the published executed offenders data, there would be a certain confidence that it is representative
of the total population of executions but with a specific caveat: it is all the executed offenders under the
jurisdiction of the Texas Department of Criminal Justice since 1982. It is not the whole of the executions
125
conducted across the entire USA nor is it representative of all the executions that have taken place throughout
the history of Texas. Any conclusions drawn from this data must be boxed within those parameters.
The matter of judging completeness can be less about the number of records and more a question of the
integrity of the data content. This executed offenders dataset would appear to be a trusted and reliable record
of each offender but would there/could there be an incentive for the curators of this data not to capture, for
example, the last statements as they were explicitly expressed? Could they have possibly been in any way
sanitised or edited, for example? These are the types of questions you need to pose. This is not aimless
cynicism, it is about seeking assurances of quality and condition so you can be confident about what you can
legitimately present and conclude from it (as well as what you should not).
Consider a different scenario. If you are looking to assess the political mood of a nation during a televised
election debate, you might consider analysing Twitter data by looking at the sentiments for and against the
candidates involved. Although this would offer an accessible source of rich data, it would not provide an
entirely reliable view of the national mood. It could only offer algorithmically determined insights (i.e.
through the process of determining the sentiment from natural language) of the people who have a Twitter
account, are watching the debate and have chosen to tweet about it during a given timeframe.
Now, just because you might not have access to a ‘whole’ population of political opinion data does not mean
it is not legitimate to work on a sample. Sometimes samples are astutely reflective of the population. And in
truth, if samples were not viable then most of the world’s analyses would need to cease immediately.
A final point is to encourage you to probe any absence of data. Sometimes you might choose to switch the
focus away from the data you have got towards the data you have not got. If the data you have is literally as
much as you can acquire but you know the subject should have more data about it, then perhaps shine a light
on the gaps, making that your story. Maybe you will unearth a discovery about the lack of intent or will to
make the data available, which in itself may be a fascinating discovery. As transparency increases, those who are
not stand out the most.
‘This is one of the first questions we should ask about any dataset: what is missing? What can we learn from
the gaps?’ Jer Thorp , Fou nd er of The Office for C reative R esearch
Any identified lack of completeness or full representativeness is not an obstacle to progress, it just means you
need to tread carefully with regard to how you might represent and present any work that emerges from it. It is
about caution not cessation.
Influence on Process
This extensive examination work gives you an initial – but thorough – appreciation of the potential of your
data, the things it will offer and the things it will not. Of course this potential is as yet unrealised. Furthering
this examination will be the focus of the next activity, as you look to employ more visual techniques to help
unearth the as-yet-hidden qualities of understanding locked away in the data. For now, this examination work
takes your analytical and creative thinking forward another step.
P urpo s e m ap ‘to ne’: Through deeper acquaintance with your data, you will have been able to
further consider the suitability of the potential tone of your work. By learning more about the inherent
characteristics of the subject, this might help to confirm or redefine your intentions for adopting a
126
utilitarian (reading) or sensation-based (feeling) tone.
E dito rial angles : The main benefit of exploring the data types is to arrive at an understanding of
what you have and have not got to work with. More specifically, it guides your thinking towards what
possible angles of analysis may be viable and relevant, and which can be eliminated as not. For example,
if you do not have any location or spatial data, this rules out the immediate possibility of being able to
map your data. This is not something you could pursue with the current scope of your dataset. If you
do have time-based data then the prospect of conducting analysis that might show changes over time is
viable. You will learn more about this idea of editorial ‘angle’ in the next chapter but let me state now it
is one of the most important components of visualisation thinking.
P hy s ical pro perties influence s cale: Data is your raw material, your ideas are not. I stated
towards the end of Chapter 3 that you should embrace the instinctive manifestations of ideas and seek
influence and inspiration from other sources. However, with the shape and size of your data having such
an impact on any eventual designs, you must respect the need to be led by your data’s physical properties
and not just your ideas.
F igure 4.5 OECD Better Life Index
In particular, the range of values in your data will shape things significantly. The shape of data in the
‘Better Life Index’ project you saw earlier is a good example. Figure 4.5 presents an analysis of the
quality of life across the 36 OECD member states. Each country is a flower comprising 11 petals with
each representing a different quality of life indicator (the larger the petal, the better the measured quality
of life).
Consider this. Would this design concept still be viable if there were 20 indicators? Or just 3? How
about if the analysis was for 150 countries? The connection between data range and chart design involves
a discerning judgement about ‘fit’. You need to identify carefully the underlying shape of the data to be
displayed and what tolerances this might test in the shape of the possible design concepts used.
‘My design approach requires that I immerse myself deeply in the problem domain and available data very
early in the project, to get a feel for the unique characteristics of the data, its “texture” and the affordances it
brings. It is very important that the results from these explorations, which I also discuss in detail with my
clients, can influence the basic concept and main direction of the project. To put it in Hans Rosling’s words,
you need to “let the data set change your mind set”.’ Moritz Stefaner, Tru th & Beau ty Op erator
127
Another relevant concern involves the challenge of elegantly handling quantitative measures that have hugely
varied value ranges and contain (legitimate) outliers. Accommodating all the values into a single display can
have a hugely distorting impact on the space it occupies. For example, note the exceptional size of the shape for
Avatar in Figure 4.6, from the ‘Spotlight on profitability’ graphic you saw earlier. It is the one movie
included that bursts through the ceiling, far beyond the otherwise entirely suitable 1000 million maximum
scale value. As a single outlier, in this case, it was treated with a rather unique approach. As you can see, its
striking shape conveniently trespasses onto the space offered by the two empty rows above. The result
emphasises this value’s exceptional quality. You might seldom have the luxury of this type of effective
resolution, so the key point to stress is always be acutely aware of the existence of ‘Avatars’ in your data.
F igure 4.6 Spotlight on Profitability
2.5 Data Transformation
Having undertaken an examination of your data you will have a good idea about what needs to be done to
ensure it is entirely fit for purpose. The next activity is to work on transforming the data so it is in optimum
condition for your needs.
At this juncture, the linearity of a book becomes rather unsatisfactory. Transforming your data is something
that will take place before, during and after both the examination and (upcoming) exploration steps. It will
also continue beyond the boundaries of this stage of the workflow. For example, the need to transform data
may only emerge once you begin your ‘editorial thinking’, as covered by the next chapter (indeed you will
likely find yourself bouncing forwards and backwards between these sections of the book on a regular basis).
As you get into the design stage you will constantly stumble upon additional reasons to tweak the shape and
size of your data assets. The main point here is that your needs will evolve. This moment in the workflow is
not going to be the only or final occasion when you look to refine your data.
Two important notes to share upfront at this stage. Firstly, in accordance with the desire for trustworthy
128
design, any treatments you apply to your data need to be recorded and potentially shared with your audience.
You must be able to reveal the thinking behind any significant assumptions, calculations and modifications
you have made to your data.
Secondly, I must emphasise the critical value of keeping backups. Before you undertake any transformation,
make a copy of your dataset. After each major iteration remember to save a milestone version for backup
purposes. Additionally, when making changes, it is useful to preserve original (unaltered) data items nearby for
easy rollback should you need them. For example, suppose you are cleaning up a column of messy data to do
with ‘Gender’ that has a variety of inconsistent values (such as “M”, “Male”, “male”, “FEMALE”, “F”,
“Female”). Normally I would keep the original data, duplicate the column, and then tidy up this second
column of values. I have then gained access to both original and modified versions. If you are going to do any
transformation work that might involve a significant investment of time and (manual) effort, having an
opportunity to refer to a previous state is always useful in my experience.
There are four different types of potential activity involved in transforming your data: cleaning, converting,
creating and consolidating.
T rans fo rm to clean: I spoke about the importance of data quality (better quality in, better quality
out, etc.) in the examination section when looking at the physical condition of the data. There’s no need
to revisit the list of potential observations you might need to consider looking out for but this is the
point where you will need to begin to address these.
There is no single or best approach for how to conduct this task. Some issues can be addressed through a
straightforward ‘find and replace’ (or remove) operation. Some treatments will be possible using simple
functions to convert data into new states, such as using logic formulae that state ‘if this, do this,
otherwise do that’. For example, if the value in the ‘Gender’ column is “M” make it “Male”, if the value
is “MALE” make it “Male” etc. Other tasks might be much more intricate, requiring manual
intervention, often in combination with inspection features like ‘sort’ or ‘filter’, to find, isolate and then
modify problem values.
Part of cleaning up your data involves the elimination of junk. Going back to the earlier scenario about
gathering data about McDonald’s restaurants, you probably would not need the name of the restaurant
manager, details of the opening times or the contact telephone number. It is down to your judgement at
the time of gathering the data to decide whether these extra items of detail – if they were as easily
acquirable as the other items of data that you really did need – may potentially provide value for your
analysis later in the process. My tactic is usually to gather as much data as I can and then reject/trim later;
later has arrived and now is the time to consider what to remove. Any fields or rows of data that you
know serve no ongoing value will take up space and attention, so get rid of these. You will need to
separate the wheat from the chaff to help reduce your problem.
T rans fo rm to co nv ert: Often you will seek to create new data values out of existing ones. In the
illustration in Figure 4.7, it might be useful to extract the constituent parts of a ‘Release Date’ field in
order to group, analyse and use the data in different ways. You might use the ‘Month’ and ‘Year’ fields
to aggregate your analysis at these respective levels in order to explore within-year and across-year
seasonality. You could also create a ‘Full Release Date’ formatted version of the date to offer a more
presentable form of the release date value possibly for labeling purposes.
F igure 4.7 Example of Converted Data Transformation
129
Extracting or deriving new forms of data will be necessary when it comes to handling qualitative
‘textual’ data. As stated in the ‘Data literacy’ section, if you have textual data you will generally always
need to transform this into various categorical or quantitative forms, unless its role is simply to provide
value as an annotation (such as a quoted caption or label). Some would argue that qualitative
visualisation involves special methods for the representation of data. I would disagree. I believe the
unique challenge of working with textual data lies with the task of transforming the data: visually
representing the extracted and derived properties from textual data involves the same suite of
representation options (i.e. chart types) that would be useful for portraying analysis of any other data
types.
Here is a breakdown of some of the conversions, calculations and extractions you could apply to textual
data. Some of these tasks can be quite straightforward (e.g. Using the LEN function in Excel to
determine the number of characters) while others are more technical and will require more sophisticated
tools or programmes dedicated to handling textual data.
Categorical conversions:
Identify keywords or summary themes from text and convert these into categorical classifications.
Identify and flag up instances of certain cases existing or otherwise (e.g. X is mentioned in this
passage).
Identify and flag up the existence of certain relationships (e.g. A and B were both mentioned in
the same passage, C was always mentioned before D).
Use natural language-processing techniques to determine sentiments, to identify specific word
types (nouns, verbs, adjectives) or sentence structures (around clauses and punctuation marks).
With URLs, isolate and extract the different components of website address and sub-folder
locations
Quantitative conversions:
Calculate the frequency of certain words being used.
Analyse the attributes of text, such as total word count, physical length, potential reading duration.
Count the number of sentences or paragraphs, derived from the frequency of different punctuation
marks.
130
Position the temporal location of certain words/phrases in relation to other words/phrases or compared
to the whole (e.g. X was mentioned at 1m51s).
Position the spatial location of certain words/phrases in relation to other words/phrases or compared to
the whole.
A further challenge that falls under this ‘converting’ heading will sometimes emerge when you are working
with data supplied by others in spreadsheets. This concerns the obstacles created when trying to analyse a data
that has been formatted visually, perhaps in readiness for printing. If you receive data in this form you will
need to unpack and reconstruct it into the normalised form described earlier, comprising all records and fields
included in a single table.
Any merged cells need unmerging or removing. You might have a heading that is common to a series of
columns. If you see this, unmerge it and replicate the same heading across each of the relevant columns
(perhaps appending an index number to each header to maintain some differentiation). Cells that have
visual formatting like background shading or font attributes (bold, coloured) to indicate a value or
status are useful when observing and reading the data, but for analysis operations these properties are
largely invisible. You will need to create new values in actual data form that are not visual (creating
categorical values, say, or status flags like ‘yes’ or ‘no’) to recreate the meaning of the formats. The data
provided to you – or that you create – via a spreadsheet does not need to be elegant in appearance, it
needs to be functional.
T rans fo rm to create: This task is something I refer to as the hidden cleverness, where you are doing
background thinking to form new calculations, values, groupings and any other mathematical or manual
treatments that really expand the variety of data available.
A simple example might involve the need to create some percentage calculations in a new field, based on
related quantities elsewhere within your existing data. Perhaps you have pairs of ‘start date’ and ‘end
date’ values and you need to calculate the duration in days for all your records. You might use logic
formula to assist in creating a new variable that summarises another – maybe something like (in
language terms) IF Age < 18 THEN status = “Child”, ELSE status = “Adult”. Alternatively, you might
want to create a calculation that standardised some quantities’ need to source base population figures for
all the relevant locations in your data in order to convert some quantities into ‘per capita’ values. This
would be particularly necessary if you anticipate wanting to map the data as this will ensure you are
facilitating legitimate comparisons.
T rans fo rm to co ns o lidate: This involves bringing in additional data to help expand (more
variables) or append (more records) to enhance the editorial and representation potential of your project.
An example of a need to expand your data would be if you had details about locations only at country
level but you wanted to be able to group and aggregate your analysis at continent level. You could gather
a dataset that holds values showing the relationships between country and continent and then add a new
variable to your dataset against which you would perform a simple lookup operation to fill in the
associated continent values.
Consolidating by appending data might occur if you had previously acquired a dataset that now had
more or newer data (specifically, additional records) available to bring it up to date. For instance, you
might have started some analysis on music record sales up to a certain point in time, but once you’d
actually started working on the task another week had elapsed and more data had become available.
Additionally, you may start to think about sourcing other media assets to enhance your presentation
options, beyond just gathering extra data. You might anticipate the potential value for gathering photos
(headshots of the people in your data), icons/symbols (country flags), links to articles (URLs), or videos
131
(clips of goals scored). All of these would contribute to broadening the scope of your annotation
options. Even though there is a while yet until we reach that particular layer of design thinking, it is
useful to start contemplating this as early possible in case the collection of these additional assets requires
significant time and effort. It might also reveal any obstacles around having to obtain permissions for
usage or sufficiently high quality media. If you know you are going to have to do something, don’t
leave it too late – reduce the possibility of such stresses by acting early.
2.6 Data Exploration
The examination task was about forming a deep acquaintance with the physical properties and meaning of
your data. You now need to interrogate that data further – and differently – to find out what potential insights
and qualities of understanding it could provide.
Undertaking data exploration will involve the use of statistical and visual techniques to move beyond looking
at data and begin to start seeing it. You will be directly pursuing your initially defined curiosity, to determine
if answers exist and whether they are suitably enlightening in nature. Often you will not know for sure whether
what you initially thought was interesting is exactly that. This activity will confirm, refine or reject your core
curiosity and perhaps, if you are fortunate, present discoveries that will encourage other interesting avenues of
enquiry.
‘After the data exploration phase you may come to the conclusion that the data does not support the goal of
the project. The thing is: data is leading in a data visualization project – you cannot make up some data just
to comply with your initial ideas. So, you need to have some kind of an open mind and “listen to what the
data has to say”, and learn what its potential is for a visualisation. Sometimes this means that a project has
to stop if there is too much of a mismatch between the goal of the project and the available data. In other
cases this may mean that the goal needs to be adjusted and the project can continue.’ Jan W illem Tu lp ,
Data Exp erience Desig ner
To frame this process, it is worth introducing something that will be covered in Chapter 5, where you will
consider some of the parallels between visualisation and photography. Before committing to take a photograph
you must first develop an appreciation of all the possible viewpoints that are available to you. Only then can
you determine which of these is best. The notion of ‘best’ will be defined in the next chapter, but for now you
need to think about identifying all the possible viewpoints in your data – to recognise the knowns and the
unknowns.
Widening the Viewpoint: Knowns and Unknowns
At a news briefing in February 2002, the US Secretary of Defense, Donald Rumsfeld, delivered his infamous
‘known knowns’ statement:
Reports that say that something hasn’t happened are always interesting to me, because as we know, there
are known knowns; there are things we know we know. We also know there are known unknowns; that
is to say we know there are some things we do not know. But there are also unknown unknowns – the
ones we don’t know we don’t know. And if one looks throughout the history of our country and other
free countries, it is the latter category that tend to be the difficult ones.
132
There was much commentary about the apparent lack of elegance in the language used and criticism of the
muddled meaning. I disagree with this analysis. I thought it was probably the most efficient way he could have
articulated what he was explaining, at least in written or verbal form. The essence of Rumsfeld’s statement was
to distinguish awareness of what is knowable about a subject (what knowledge exists) from the status of
acquiring this knowledge. There is a lot of value to be gained from using this structure (Figure 4.8) to shape
your approach to thinking about data exploration.
The known knowns are aspects of knowledge about your subject and about the qualities present in your data
that you are aware of – you are aware that you know these things. The nature of these known knowns might
mean you have confidence that the origin curiosity was relevant and the available insights that emerged in
response are suitably interesting. You cannot afford to be complacent, though. You will need to challenge
yourself to check that these curiosities are still legitimate and relevant. To support this, you should continue to
look and learn about the subject through research, topping up your awareness of the most potentially relevant
dynamics of the subject, and continue to interrogate your data accordingly.
Additionally, you should not just concentrate on this potentially quite narrow viewpoint. As I mentioned
earlier, it is important to give yourself as broad a view as possible across your subject and its data to optimise
your decisions about what other interesting enquiries might be available. This is where you need to consider
the other quadrants in this diagram.
F igure 4.8 Making Sense of the Known Knowns
133
On occasion, though I would argue rarely, there may be unknown knowns, things you did not realise you
knew or perhaps did not wish to acknowledge that you knew about a subject. This may relate to previous
understandings that have been forgotten, consciously ignored or buried. Regardless, you need to acknowledge
these.
For the knowledge that has yet to be acquired – the known unknowns and the even more elusive unknown
unknowns – tactics are needed to help plug these gaps as far, as deep and as wide as possible. You cannot
possibly achieve mastery of all the domains you work with. Instead, you need to have the capacity and be in
position to turn as many unknowns as possible into knowns, and in doing so optimise your understanding of a
subject. Only then will you be capable of appreciating the full array of viewpoints the data offers.
To make the best decisions you first need to be aware of all the options. This activity is about broadening your
awareness of the potentially interesting things you could show – and could say – about your data. The
resulting luxury of choice is something you will deal with in the next stage.
134
Exploratory Data Analysis
As I have stated, the aim throughout this book is to create a visualisation that will facilitate understanding for
others. That is the end goal. At this stage of the workflow the deficit in understanding lies with you. The task
of addressing the unknowns you have about a subject, as well as substantiating what knowns already exist,
involves the use of exploratory data analysis (EDA). This integrates statistical methods with visual analysis to
offer a way of extracting deeper understanding and widening the view to unlock as much of the potential as
possible from within your data.
The chart in Figure 4.9 is a great demonstration of the value in combining statistical and visual techniques to
understand your data better. It shows the results of nearly every major and many minor (full) marathon from
around the world. On the surface, the distribution of finishing times reveals the common bell shape found in
plots about many natural phenomenon, such as the height measurements of a large group of people. However,
when you zoom in closer the data reveals some really interesting threshold patterns for finishing times on or
just before the three-, four- and five-hour marks. You can see that the influence of runners setting themselves
targets, often rounded to the hourly milestones, genuinely appeared to affect the results achieved.
F igure 4.9 What Good Marathons and Bad Investments Have in Common
Although statistical analysis of this data would have revealed many interesting facts, these unique patterns were
only realistically discoverable through studying the visual display of the data. This is the essence of EDA but
there is no instruction manual for it. As John Tukey, the father of EDA, described: ‘Exploratory data analysis
is an attitude, a flexibility, and a reliance on display, not a bundle of techniques’. There is no single path to
undertaking this activity effectively; it requires a number of different technical, practical and conceptual
capabilities.
Ins tinct o f the analy s t: This is the primary matter. The attitude and flexibility that Turkey
describes are about recognising the importance of the analyst’s traits. Effective EDA is not about the
tool. There are many vendors out there pitching their devices as the magic option where we just have to
‘point and click’ to uncover a deep discovery. Technology inevitably plays a key role in facilitating this
endeavour but the value of a good analyst cannot be underestimated: it is arguably more influential than
135
the differentiating characteristics between one tool and the next. In the absence of a defined procedure
for conducting EDA, an analyst needs to possess the capacity to recognise and pursue the scent of
enquiry. A good analyst will have that special blend of natural inquisitiveness and the sense to know
what approaches (statistical or visual) to employ and when. Furthermore, when these traits collide with
a strong subject knowledge this means better judgments are made about which findings from the
analysis are meaningful and which are not.
R eas o ning: Efficiency is a particularly important aspect of this exploration stage. The act of
interrogating data, waiting for it to volunteer its secrets, can take a lot of time and energy. Even with
smaller datasets you can find yourself tempted into trying out myriad combinations of analyses, driven
by the desire to find the killer insight in the shadows.
‘At the beginning, there’s a process of “interviewing” the data – first evaluating their source and means of
collection/aggregation/computation, and then trying to get a sense of what they say – and how well they say it
via quick sketches in Excel with pivot tables and charts. Do the data, in various slices, say anything
interesting? If I’m coming into this with certain assumptions, do the data confirm them, or refute them?’
Alyson Hu rt, News Grap hics Ed itor, NPR
Reasoning is an attempt to help reduce the size of the prospect. You cannot afford to try everything. There are
so many statistical methods and, as you will see, so many visual means for seeing views of data that you simply
cannot expect to have the capacity to try to unleash the full exploratory artillery. EDA is about being smart,
recognising that you need to be discerning about your tactics.
In academia there are two distinctions in approaches to reasoning – deductive and inductive – that I feel are
usefully applied in this discussion:
Deductive reasoning is targeted: You have a specific curiosity or hypothesis, framed by subject
knowledge, and you are going to interrogate the data in order to determine whether there is any evidence
of relevance or interest in the concluding finding. I consider this adopting a detective’s mindset
(Sherlock Holmes).
Inductive reasoning is much more open in nature: You will ‘play around’ with the data, based on your
sense or instinct about what might be of interest, and wait and see what emerges. In some ways this is
like prospecting, hoping for that moment of serendipity when you unearth gold.
In this exploration process you ideally need to accommodate both approaches. The deductive process will
focus on exploring further targeted curiosities, the inductive process will give you a fighting chance of finding
more of those slippery ‘unknowns’, often almost by accident. It is important to give yourself room to embark
on these somewhat less structured exploratory journeys.
I often think about EDA in the context of a comparison with the challenge of a ‘Where’s Wally?’ visual puzzle.
The process of finding Wally feels somewhat unscientific. Sometimes you let your eyes race around the scene
like a dog who has just been let out of the car and is torpedoing across a field. However, after the initial burst
of randomness, perhaps subconsciously, you then go through a more considered process of visual analysis.
Elimination takes place by working around different parts of the scene and sequentially declaring ‘Wally-free’
zones. This aids your focus and strategy for where to look next. As you then move across each mini-scene you
are pattern matching, looking out for the giveaway characteristics of the boy wearing glasses, a red-and-whitestriped hat and jumper, and blue trousers.
136
The objective of this task is clear and singular in definition. The challenge of EDA is rarely that clean. There is
a source curiosity to follow, for sure, and you might find evidence of Wally somewhere in the data. However,
unlike the ‘Where’s Wally?’ challenge, in EDA you have the chance also to find other things that might change
the definition of what qualifies as an interesting insight. In unearthing other discoveries you might determine
that you no longer care about Wally; finding him no longer represents the main enquiry.
Inevitably you are faced with a trade-off between spare capacity in time and attention and your own internal
satisfaction that you have explored as many different angles of enquiry as possible.
C hart ty pes : This is about seeing the data from all feasible angles. The power of the visual means
that we can easily rely on our pattern-matching and sense-making capabilities – in harmony with
contextual subject knowledge – to make observations about data that appear to have relevance.
The data representation gallery that you will encounter in Chapter 6 presents nearly 50 different chart
types, offering a broad repertoire of options for portraying data. The focus of the collection is on chart
types that could be used to communicate to others. However, within this gallery there are also many
chart types that help with pursuing EDA. In each chart profile, indications are given for those chart types
that be particularly useful to support your exploratory activity. As a rough estimate, I would say about
half of these can prove to be great allies in this stage of discovery.
The visual methods used in EDA do not just involve charting, they also involve selective charting –
smart charting, ‘smarting’ if you like? (No, Andy, nobody likes that). Every chart type presented in the
gallery includes helpful descriptions that will give you an idea of their role and also what observations –
and potential interpretations – they might facilitate. It is important to know now that the chart types
are organised across five main families (categorical, hierarchical, relational, temporal, and spatial)
depending on the primary focus of your analysis. The focus of your analysis will, in turn, depend on the
types of data you have and what you are trying to see.
‘I kick it over into a rough picture as soon as possible. When I can see something then I am able to ask
better questions of it – then the what-about-this iterations begin. I try to look at the same data in as many
different dimensions as possible. For example, if I have a spreadsheet of bird sighting locations and times,
first I like to see where they happen, previewing it in some mapping software. I’ll also look for patterns in
the timing of the phenomenon, usually using a pivot table in a spreadsheet. The real magic happens when a
pattern reveals itself only when seen in both dimensions at the same time.’ John Nelson, C artog rap her,
on the valu e of visu ally exp loring his d ata
R es earch: I have raised this already but make no apology for doing so again so soon. How you
conduct research and how much you can do will naturally depend on your circumstances, but it is
always important to exploit as many different approaches to learning about the domain and the data you
are working with. As you will recall, the middle stage of forming understanding – interpreting – is
about viewers translating what they have perceived from a display into meaning. They can only do this
with domain knowledge. Similarly, when it comes to conducting exploratory analysis using visual
methods, you might be able to perceive the charts you make, but without possessing or acquiring
sufficient domain knowledge you will not know if what you are seeing is meaningful. Sometimes the
consequence of this exploratory data analysis will only mean you have become better acquainted with
specific questions and more defined curiosities about a subject even if you possibly do not yet have any
answers.
The approach to research is largely common sense: you explore the places (books, websites) and consult
the people (experts, colleagues) that will collectively give you the best chance of getting accurate answers
137
to the questions you have. Good communication skills, therefore, are vital – it is not just about talking
to others, it is about listening. If you are in a dialogue with experts you will have to find an approach
that allows you to understand potentially complicated matters and also cut through to the most salient
matters of interest.
S tatis tical m etho ds : Although the value of the univariate statistical techniques profiled earlier still
applies here, what you are often looking to undertake in EDA is multivariate analysis. This concerns
testing out the potential existence of a correlation between quantitative variables as well as determining
the possible causation variables – the holy grail of data analysis.
Typically, I find statistical analysis plays more of a supporting role during much of the exploration
activity rather than a leading role. Visual techniques will serve up tangible observations about whether
data relationships and quantities seem relevant, but to substantiate this you will need to conduct
statistical tests of significance.
One of the main exceptions is when dealing with large datasets. Here the first approach might be more
statistical in nature due to the amount of data obstructing rapid visual approaches. Going further,
algorithmic approaches – using techniques like machine learning – might help to scale the task of
statistically exploring large dimensions of data – and the endless permutations they offer. What these
approaches gain in productivity they clearly lose in human quality. The significance of this should not
be underestimated. It may be possible to take a blended approach where you might utilise machine
learning techniques to act as an initial battering ram to help reduce the problem, identifying the major
dimensions within the data that might hold certain key statistical attributes and then conducting further
exploration ‘by hand and by eye’.
N o things : What if you have found nothing? You have hit a dead end, discovering no significant
relationships and finding nothing interesting about the shape or distribution of your data. What do you
do? In these situations you need to change your mindset: nothing is usually something. Dead ends and
discovering blind alleys are good news because they help you develop focus by eliminating different
dimensions of possible analysis. If you have traits of nothingness in your data or analysis –gaps, nulls,
zeroes and no insights – this could prove to be the insight. As described earlier, make the gaps the focus
of your story.
There is always something interesting in your data. If a value has not changed over time, maybe it was
supposed to – that is an insight. If everything is the same size, that is the story. If there is no significance
in the quantities, categories or spatial relationships, make those your insights. You will only know that
these findings are relevant by truly understanding the context of the subject matter. This is why you
must make as much effort as possible to convert your unknowns into knowns.
‘My main advice is not to be disheartened. Sometimes the data don’t show what you thought they would, or
they aren’t available in a usable or comparable form. But [in my world] sometimes that research still turns up
threads a reporter could pursue and turn into a really interesting story – there just might not be a viz in it.
Or maybe there’s no story at all. And that’s all okay. At minimum, you’ve still hopefully learned something
new in the process about a topic, or a data source (person or database), or a “gotcha” in a particular dataset
– lessons that can be applied to another project down the line.’ Alyson Hu rt, News Grap hics Ed itor,
NPR
N o t alw ay s needed: It is important to couch this discussion about exploration in pragmatic reality.
Not all visualisation challenges will involve much EDA. Your subject and your data might be
immediately understandable and you may have a sufficiently broad viewpoint of your subject (plenty of
known knowns already in place). Further EDA activity may have diminishing value. Additionally, if you
138
are faced with small tables of data this simply will not warrant multivariate investigation. You certainly
need to be ready and equipped with the capacity to undertake this type of exploration activity when it is
needed, but the key point here is to judge when.
Summary: Working with Data
This chapter first introduced key foundations for the requisite data literacy involved in visualisation,
specifically the importance of the distinction between normalised and cross-tabulated datasets as well as the
different types of data (using the TNOIR mnemonic):
Textual (qualitative): e.g. ‘Any other comments?’ data submitted in a survey.
Nominal (qualitative): e.g. The ‘gender’ selected by a survey participant.
Ordinal (qualitative): e.g. The response to a survey question, based on a scale of 1 (unhappy) to 5 (very
happy).
Interval (quantitative): e.g. The shoe size of a survey participant.
Ratio (quantitative): e.g. The age of a survey participant in years.
You then walked through the four steps involved in working with data:
Acq uis itio n Different sources and methods for getting your data.
Curated by you: primary data collection, manual collection and data foraging, extracted from pdf, web
scraping (also known as web harvesting).
Curated by others: issued to you, downloaded from the Web, system report or export, third-party
services, APIs.
E x am inatio n Developing an intimate appreciation of the characteristics of this critical raw material:
Physical properties: type, size, and condition.
Meaning: phenomenon, completeness.
T rans fo rm atio n Getting your data into shape, ready for its role in your exploratory analysis and
visualisation design:
Clean: resolve any data quality issues.
Create: consider new calculations and conversions.
Consolidate: what other data (to expand or append) or other assets could be sought to enhance your
project?
E x plo ratio n Using visual and statistical techniques to see the data’s qualities: what insights does it reveal to
you as you deepen your familiarity with it?
Tips and Tactics
Perfect data (complete, accurate, up to date, truly representative) is an almost impossible standard to
reach (given the presence of time constraints) so your decision will be when is good enough, good
enough: when do diminishing returns start to materialise?
139
Do not underestimate the demands on your time; working with data will always be consuming of your
attention and effort:
Ensure you have bui...
Purchase answer to see full
attachment