Change the world with data.
We’ll show you how.
strataconf.com
Sep 25 – 27, 2013
Boston, MA
Oct 28 – 30, 2013
New York, NY
Nov 11 – 13, 2013
London, England
©2013 O’Reilly Media, Inc. O’Reilly logo is a registered trademark of O’Reilly Media, Inc. 13110
Big Data Now: 2012 Edition
O’Reilly Media, Inc.
Big Data Now: 2012 Edition
by O’Reilly Media, Inc.
Copyright © 2012 O’Reilly Media. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://my.safaribooksonline.com). For
more information, contact our corporate/institutional sales department: (800)
998-9938 or corporate@oreilly.com.
Cover Designer: Karen Montgomery
October 2012:
Interior Designer: David Futato
First Edition
Revision History for the First Edition:
2012-10-24
First release
See http://oreilly.com/catalog/errata.csp?isbn=9781449356712 for release details.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered
trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their prod‐
ucts are claimed as trademarks. Where those designations appear in this book, and
O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed
in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher
and authors assume no responsibility for errors or omissions, or for damages resulting
from the use of the information contained herein.
ISBN: 978-1-449-35671-2
Table of Contents
1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Getting Up to Speed with Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
What Is Big Data?
What Does Big Data Look Like?
In Practice
What Is Apache Hadoop?
The Core of Hadoop: MapReduce
Hadoop’s Lower Levels: HDFS and MapReduce
Improving Programmability: Pig and Hive
Improving Data Access: HBase, Sqoop, and Flume
Coordination and Workflow: Zookeeper and Oozie
Management and Deployment: Ambari and Whirr
Machine Learning: Mahout
Using Hadoop
Why Big Data Is Big: The Digital Nervous System
From Exoskeleton to Nervous System
Charting the Transition
Coming, Ready or Not
3
4
8
10
11
11
12
12
14
14
14
15
15
15
16
17
3. Big Data Tools, Techniques, and Strategies. . . . . . . . . . . . . . . . . . . . . 19
Designing Great Data Products
Objective-based Data Products
The Model Assembly Line: A Case Study of Optimal
Decisions Group
Drivetrain Approach to Recommender Systems
Optimizing Lifetime Customer Value
Best Practices from Physical Data Products
The Future for Data Products
19
20
21
25
28
31
35
iii
What It Takes to Build Great Machine Learning Products
Progress in Machine Learning
Interesting Problems Are Never Off the Shelf
Defining the Problem
35
36
37
39
4. The Application of Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Stories over Spreadsheets
A Thought on Dashboards
Full Interview
Mining the Astronomical Literature
Interview with Robert Simpson: Behind the Project and
What Lies Ahead
Science between the Cracks
The Dark Side of Data
The Digital Publishing Landscape
Privacy by Design
41
43
43
43
48
51
51
52
53
5. What to Watch for in Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Big Data Is Our Generation’s Civil Rights Issue, and We
Don’t Know It
Three Kinds of Big Data
Enterprise BI 2.0
Civil Engineering
Customer Relationship Optimization
Headlong into the Trough
Automated Science, Deep Data, and the Paradox of
Information
(Semi)Automated Science
Deep Data
The Paradox of Information
The Chicken and Egg of Big Data Solutions
Walking the Tightrope of Visualization Criticism
The Visualization Ecosystem
The Irrationality of Needs: Fast Food to Fine Dining
Grown-up Criticism
Final Thoughts
55
60
60
62
63
64
64
65
67
69
71
73
74
76
78
80
6. Big Data and Health Care. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Solving the Wanamaker Problem for Health Care
Making Health Care More Effective
More Data, More Sources
iv
|
Table of Contents
83
85
89
Paying for Results
90
Enabling Data
91
Building the Health Care System We Want
94
Recommended Reading
95
Dr. Farzad Mostashari on Building the Health Information
Infrastructure for the Modern ePatient
96
John Wilbanks Discusses the Risks and Rewards of a Health
Data Commons
100
Esther Dyson on Health Data, “Preemptive Healthcare,” and
the Next Big Thing
106
A Marriage of Data and Caregivers Gives Dr. Atul Gawande
Hope for Health Care
112
Five Elements of Reform that Health Providers Would
Rather Not Hear About
119
Table of Contents
|
v
CHAPTER 1
Introduction
In the first edition of Big Data Now, the O’Reilly team tracked the birth
and early development of data tools and data science. Now, with this
second edition, we’re seeing what happens when big data grows up:
how it’s being applied, where it’s playing a role, and the conse‐
quences — good and bad alike — of data’s ascendance.
We’ve organized the 2012 edition of Big Data Now into five areas:
Getting Up to Speed With Big Data — Essential information on the
structures and definitions of big data.
Big Data Tools, Techniques, and Strategies — Expert guidance for
turning big data theories into big data products.
The Application of Big Data — Examples of big data in action, in‐
cluding a look at the downside of data.
What to Watch for in Big Data — Thoughts on how big data will
evolve and the role it will play across industries and domains.
Big Data and Health Care — A special section exploring the possi‐
bilities that arise when data and health care come together.
In addition to Big Data Now, you can stay on top of the latest data
developments with our ongoing analysis on O’Reilly Radar and
through our Strata coverage and events series.
1
CHAPTER 2
Getting Up to Speed with Big Data
What Is Big Data?
By Edd Dumbill
Big data is data that exceeds the processing capacity of conventional
database systems. The data is too big, moves too fast, or doesn’t fit the
strictures of your database architectures. To gain value from this data,
you must choose an alternative way to process it.
The hot IT buzzword of 2012, big data has become viable as costeffective approaches have emerged to tame the volume, velocity, and
variability of massive data. Within this data lie valuable patterns and
information, previously hidden because of the amount of work re‐
quired to extract them. To leading corporations, such as Walmart or
Google, this power has been in reach for some time, but at fantastic
cost. Today’s commodity hardware, cloud architectures and open
source software bring big data processing into the reach of the less
well-resourced. Big data processing is eminently feasible for even the
small garage startups, who can cheaply rent server time in the cloud.
The value of big data to an organization falls into two categories: an‐
alytical use and enabling new products. Big data analytics can reveal
insights hidden previously by data too costly to process, such as peer
influence among customers, revealed by analyzing shoppers’ transac‐
tions and social and geographical data. Being able to process every
item of data in reasonable time removes the troublesome need for
sampling and promotes an investigative approach to data, in contrast
to the somewhat static nature of running predetermined reports.
3
The past decade’s successful web startups are prime examples of big
data used as an enabler of new products and services. For example, by
combining a large number of signals from a user’s actions and those
of their friends, Facebook has been able to craft a highly personalized
user experience and create a new kind of advertising business. It’s no
coincidence that the lion’s share of ideas and tools underpinning big
data have emerged from Google, Yahoo, Amazon, and Facebook.
The emergence of big data into the enterprise brings with it a necessary
counterpart: agility. Successfully exploiting the value in big data re‐
quires experimentation and exploration. Whether creating new prod‐
ucts or looking for ways to gain competitive advantage, the job calls
for curiosity and an entrepreneurial outlook.
What Does Big Data Look Like?
As a catch-all term, “big data” can be pretty nebulous, in the same way
that the term “cloud” covers diverse technologies. Input data to big
data systems could be chatter from social networks, web server logs,
traffic flow sensors, satellite imagery, broadcast audio streams, bank‐
ing transactions, MP3s of rock music, the content of web pages, scans
of government documents, GPS trails, telemetry from automobiles,
financial market data, the list goes on. Are these all really the same
thing?
To clarify matters, the three Vs of volume, velocity, and variety are
commonly used to characterize different aspects of big data. They’re
a helpful lens through which to view and understand the nature of the
data and the software platforms available to exploit them. Most prob‐
ably you will contend with each of the Vs to one degree or another.
Volume
The benefit gained from the ability to process large amounts of infor‐
mation is the main attraction of big data analytics. Having more data
beats out having better models: simple bits of math can be unreason‐
ably effective given large amounts of data. If you could run that forecast
taking into account 300 factors rather than 6, could you predict de‐
mand better? This volume presents the most immediate challenge to
conventional IT structures. It calls for scalable storage, and a distribut‐
ed approach to querying. Many companies already have large amounts
of archived data, perhaps in the form of logs, but not the capacity to
process it.
4
|
Chapter 2: Getting Up to Speed with Big Data
Assuming that the volumes of data are larger than those conventional
relational database infrastructures can cope with, processing options
break down broadly into a choice between massively parallel process‐
ing architectures — data warehouses or databases such as Green‐
plum — and Apache Hadoop-based solutions. This choice is often in‐
formed by the degree to which one of the other “Vs” — variety —
comes into play. Typically, data warehousing approaches involve pre‐
determined schemas, suiting a regular and slowly evolving dataset.
Apache Hadoop, on the other hand, places no conditions on the struc‐
ture of the data it can process.
At its core, Hadoop is a platform for distributing computing problems
across a number of servers. First developed and released as open source
by Yahoo, it implements the MapReduce approach pioneered by Goo‐
gle in compiling its search indexes. Hadoop’s MapReduce involves
distributing a dataset among multiple servers and operating on the
data: the “map” stage. The partial results are then recombined: the
“reduce” stage.
To store data, Hadoop utilizes its own distributed filesystem, HDFS,
which makes data available to multiple computing nodes. A typical
Hadoop usage pattern involves three stages:
• loading data into HDFS,
• MapReduce operations, and
• retrieving results from HDFS.
This process is by nature a batch operation, suited for analytical or
non-interactive computing tasks. Because of this, Hadoop is not itself
a database or data warehouse solution, but can act as an analytical
adjunct to one.
One of the most well-known Hadoop users is Facebook, whose model
follows this pattern. A MySQL database stores the core data. This is
then reflected into Hadoop, where computations occur, such as cre‐
ating recommendations for you based on your friends’ interests. Face‐
book then transfers the results back into MySQL, for use in pages
served to users.
Velocity
The importance of data’s velocity — the increasing rate at which data
flows into an organization — has followed a similar pattern to that of
What Is Big Data?
|
5
volume. Problems previously restricted to segments of industry are
now presenting themselves in a much broader setting. Specialized
companies such as financial traders have long turned systems that cope
with fast moving data to their advantage. Now it’s our turn.
Why is that so? The Internet and mobile era means that the way we
deliver and consume products and services is increasingly instrumen‐
ted, generating a data flow back to the provider. Online retailers are
able to compile large histories of customers’ every click and interaction:
not just the final sales. Those who are able to quickly utilize that in‐
formation, by recommending additional purchases, for instance, gain
competitive advantage. The smartphone era increases again the rate
of data inflow, as consumers carry with them a streaming source of
geolocated imagery and audio data.
It’s not just the velocity of the incoming data that’s the issue: it’s possible
to stream fast-moving data into bulk storage for later batch processing,
for example. The importance lies in the speed of the feedback loop,
taking data from input through to decision. A commercial from
IBM makes the point that you wouldn’t cross the road if all you had
was a five-minute old snapshot of traffic location. There are times
when you simply won’t be able to wait for a report to run or a Hadoop
job to complete.
Industry terminology for such fast-moving data tends to be either
“streaming data” or “complex event processing.” This latter term was
more established in product categories before streaming processing
data gained more widespread relevance, and seems likely to diminish
in favor of streaming.
There are two main reasons to consider streaming processing. The first
is when the input data are too fast to store in their entirety: in order to
keep storage requirements practical, some level of analysis must occur
as the data streams in. At the extreme end of the scale, the Large Ha‐
dron Collider at CERN generates so much data that scientists must
discard the overwhelming majority of it — hoping hard they’ve not
thrown away anything useful. The second reason to consider stream‐
ing is where the application mandates immediate response to the data.
Thanks to the rise of mobile applications and online gaming this is an
increasingly common situation.
6
|
Chapter 2: Getting Up to Speed with Big Data
Product categories for handling streaming data divide into established
proprietary products such as IBM’s InfoSphere Streams and the lesspolished and still emergent open source frameworks originating in the
web industry: Twitter’s Storm and Yahoo S4.
As mentioned above, it’s not just about input data. The velocity of a
system’s outputs can matter too. The tighter the feedback loop, the
greater the competitive advantage. The results might go directly into
a product, such as Facebook’s recommendations, or into dashboards
used to drive decision-making. It’s this need for speed, particularly on
the Web, that has driven the development of key-value stores and col‐
umnar databases, optimized for the fast retrieval of precomputed in‐
formation. These databases form part of an umbrella category known
as NoSQL, used when relational models aren’t the right fit.
Variety
Rarely does data present itself in a form perfectly ordered and ready
for processing. A common theme in big data systems is that the source
data is diverse, and doesn’t fall into neat relational structures. It could
be text from social networks, image data, a raw feed directly from a
sensor source. None of these things come ready for integration into an
application.
Even on the Web, where computer-to-computer communication
ought to bring some guarantees, the reality of data is messy. Different
browsers send different data, users withhold information, they may be
using differing software versions or vendors to communicate with you.
And you can bet that if part of the process involves a human, there will
be error and inconsistency.
A common use of big data processing is to take unstructured data and
extract ordered meaning, for consumption either by humans or as a
structured input to an application. One such example is entity reso‐
lution, the process of determining exactly what a name refers to. Is this
city London, England, or London, Texas? By the time your business
logic gets to it, you don’t want to be guessing.
The process of moving from source data to processed application data
involves the loss of information. When you tidy up, you end up throw‐
ing stuff away. This underlines a principle of big data: when you can,
keep everything. There may well be useful signals in the bits you throw
away. If you lose the source data, there’s no going back.
What Is Big Data?
|
7
Despite the popularity and well understood nature of relational data‐
bases, it is not the case that they should always be the destination for
data, even when tidied up. Certain data types suit certain classes of
database better. For instance, documents encoded as XML are most
versatile when stored in a dedicated XML store such as MarkLogic.
Social network relations are graphs by nature, and graph databases
such as Neo4J make operations on them simpler and more efficient.
Even where there’s not a radical data type mismatch, a disadvantage
of the relational database is the static nature of its schemas. In an agile,
exploratory environment, the results of computations will evolve with
the detection and extraction of more signals. Semi-structured NoSQL
databases meet this need for flexibility: they provide enough structure
to organize data, but do not require the exact schema of the data before
storing it.
In Practice
We have explored the nature of big data and surveyed the landscape
of big data from a high level. As usual, when it comes to deployment
there are dimensions to consider over and above tool selection.
Cloud or in-house?
The majority of big data solutions are now provided in three forms:
software-only, as an appliance or cloud-based. Decisions between
which route to take will depend, among other things, on issues of data
locality, privacy and regulation, human resources and project require‐
ments. Many organizations opt for a hybrid solution: using ondemand cloud resources to supplement in-house deployments.
Big data is big
It is a fundamental fact that data that is too big to process conven‐
tionally is also too big to transport anywhere. IT is undergoing an
inversion of priorities: it’s the program that needs to move, not the
data. If you want to analyze data from the U.S. Census, it’s a lot easier
to run your code on Amazon’s web services platform, which hosts such
data locally, and won’t cost you time or money to transfer it.
Even if the data isn’t too big to move, locality can still be an issue,
especially with rapidly updating data. Financial trading systems crowd
into data centers to get the fastest connection to source data, because
that millisecond difference in processing time equates to competitive
advantage.
8
|
Chapter 2: Getting Up to Speed with Big Data
Big data is messy
It’s not all about infrastructure. Big data practitioners consistently re‐
port that 80% of the effort involved in dealing with data is cleaning it
up in the first place, as Pete Warden observes in his Big Data Glossa‐
ry: “I probably spend more time turning messy source data into some‐
thing usable than I do on the rest of the data analysis process com‐
bined.”
Because of the high cost of data acquisition and cleaning, it’s worth
considering what you actually need to source yourself. Data market‐
places are a means of obtaining common data, and you are often able
to contribute improvements back. Quality can of course be variable,
but will increasingly be a benchmark on which data marketplaces
compete.
Culture
The phenomenon of big data is closely tied to the emergence of data
science, a discipline that combines math, programming, and scientific
instinct. Benefiting from big data means investing in teams with this
skillset, and surrounding them with an organizational willingness to
understand and use data for advantage.
In his report, “Building Data Science Teams,” D.J. Patil characterizes
data scientists as having the following qualities:
• Technical expertise: the best data scientists typically have deep
expertise in some scientific discipline.
• Curiosity: a desire to go beneath the surface and discover and
distill a problem down into a very clear set of hypotheses that can
be tested.
• Storytelling: the ability to use data to tell a story and to be able to
communicate it effectively.
• Cleverness: the ability to look at a problem in different, creative
ways.
The far-reaching nature of big data analytics projects can have un‐
comfortable aspects: data must be broken out of silos in order to be
mined, and the organization must learn how to communicate and in‐
terpet the results of analysis.
What Is Big Data?
|
9
Those skills of storytelling and cleverness are the gateway factors that
ultimately dictate whether the benefits of analytical labors are absor‐
bed by an organization. The art and practice of visualizing data is be‐
coming ever more important in bridging the human-computer gap to
mediate analytical insight in a meaningful way.
Know where you want to go
Finally, remember that big data is no panacea. You can find patterns
and clues in your data, but then what? Christer Johnson, IBM’s leader
for advanced analytics in North America, gives this advice to busi‐
nesses starting out with big data: first, decide what problem you want
to solve.
If you pick a real business problem, such as how you can change your
advertising strategy to increase spend per customer, it will guide your
implementation. While big data work benefits from an enterprising
spirit, it also benefits strongly from a concrete goal.
What Is Apache Hadoop?
By Edd Dumbill
Apache Hadoop has been the driving force behind the growth of the
big data industry. You’ll hear it mentioned often, along with associated
technologies such as Hive and Pig. But what does it do, and why do
you need all its strangely named friends, such as Oozie, Zookeeper,
and Flume?
Hadoop brings the ability to cheaply process large amounts of data,
regardless of its structure. By large, we mean from 10-100 gigabytes
and above. How is this different from what went before?
Existing enterprise data warehouses and relational databases excel at
processing structured data and can store massive amounts of data,
though at a cost: This requirement for structure restricts the kinds of
data that can be processed, and it imposes an inertia that makes data
warehouses unsuited for agile exploration of massive heterogenous
data. The amount of effort required to warehouse data often means
that valuable data sources in organizations are never mined. This is
where Hadoop can make a big difference.
This article examines the components of the Hadoop ecosystem and
explains the functions of each.
10
|
Chapter 2: Getting Up to Speed with Big Data
The Core of Hadoop: MapReduce
Created at Google in response to the problem of creating web search
indexes, the MapReduce framework is the powerhouse behind most
of today’s big data processing. In addition to Hadoop, you’ll find Map‐
Reduce inside MPP and NoSQL databases, such as Vertica or Mon‐
goDB.
The important innovation of MapReduce is the ability to take a query
over a dataset, divide it, and run it in parallel over multiple nodes.
Distributing the computation solves the issue of data too large to fit
onto a single machine. Combine this technique with commodity Linux
servers and you have a cost-effective alternative to massive computing
arrays.
At its core, Hadoop is an open source MapReduce implementation.
Funded by Yahoo, it emerged in 2006 and, according to its creator
Doug Cutting, reached “web scale” capability in early 2008.
As the Hadoop project matured, it acquired further components to
enhance its usability and functionality. The name “Hadoop” has come
to represent this entire ecosystem. There are parallels with the emer‐
gence of Linux: The name refers strictly to the Linux kernel, but it has
gained acceptance as referring to a complete operating system.
Hadoop’s Lower Levels: HDFS and MapReduce
Above, we discussed the ability of MapReduce to distribute computa‐
tion over multiple servers. For that computation to take place, each
server must have access to the data. This is the role of HDFS, the Ha‐
doop Distributed File System.
HDFS and MapReduce are robust. Servers in a Hadoop cluster can fail
and not abort the computation process. HDFS ensures data is repli‐
cated with redundancy across the cluster. On completion of a calcu‐
lation, a node will write its results back into HDFS.
There are no restrictions on the data that HDFS stores. Data may be
unstructured and schemaless. By contrast, relational databases require
that data be structured and schemas be defined before storing the data.
With HDFS, making sense of the data is the responsibility of the de‐
veloper’s code.
Programming Hadoop at the MapReduce level is a case of working
with the Java APIs, and manually loading data files into HDFS.
What Is Apache Hadoop?
|
11
Improving Programmability: Pig and Hive
Working directly with Java APIs can be tedious and error prone. It also
restricts usage of Hadoop to Java programmers. Hadoop offers two
solutions for making Hadoop programming easier.
• Pig is a programming language that simplifies the common tasks
of working with Hadoop: loading data, expressing transforma‐
tions on the data, and storing the final results. Pig’s built-in oper‐
ations can make sense of semi-structured data, such as log files,
and the language is extensible using Java to add support for custom
data types and transformations.
• Hive enables Hadoop to operate as a data warehouse. It superim‐
poses structure on data in HDFS and then permits queries over
the data using a familiar SQL-like syntax. As with Pig, Hive’s core
capabilities are extensible.
Choosing between Hive and Pig can be confusing. Hive is more suit‐
able for data warehousing tasks, with predominantly static structure
and the need for frequent analysis. Hive’s closeness to SQL makes it an
ideal point of integration between Hadoop and other business intelli‐
gence tools.
Pig gives the developer more agility for the exploration of large data‐
sets, allowing the development of succinct scripts for transforming
data flows for incorporation into larger applications. Pig is a thinner
layer over Hadoop than Hive, and its main advantage is to drastically
cut the amount of code needed compared to direct use of Hadoop’s
Java APIs. As such, Pig’s intended audience remains primarily the
software developer.
Improving Data Access: HBase, Sqoop, and Flume
At its heart, Hadoop is a batch-oriented system. Data are loaded into
HDFS, processed, and then retrieved. This is somewhat of a computing
throwback, and often, interactive and random access to data is re‐
quired.
Enter HBase, a column-oriented database that runs on top of HDFS.
Modeled after Google’s BigTable, the project’s goal is to host billions
of rows of data for rapid access. MapReduce can use HBase as both a
source and a destination for its computations, and Hive and Pig can
be used in combination with HBase.
12
| Chapter 2: Getting Up to Speed with Big Data
In order to grant random access to the data, HBase does impose a few
restrictions: Hive performance with HBase is 4-5 times slower than
with plain HDFS, and the maximum amount of data you can store in
HBase is approximately a petabyte, versus HDFS’ limit of over 30PB.
HBase is ill-suited to ad-hoc analytics and more appropriate for inte‐
grating big data as part of a larger application. Use cases include log‐
ging, counting, and storing time-series data.
The Hadoop Bestiary
Ambari
Deployment, configuration and monitoring
Flume
Collection and import of log and event data
HBase
Column-oriented database scaling to billions of rows
HCatalog
Schema and data type sharing over Pig, Hive and MapReduce
HDFS
Distributed redundant file system for Hadoop
Hive
Data warehouse with SQL-like access
Mahout
Library of machine learning and data mining algorithms
MapReduce Parallel computation on server clusters
Pig
High-level programming language for Hadoop computations
Oozie
Orchestration and workflow management
Sqoop
Imports data from relational databases
Whirr
Cloud-agnostic deployment of clusters
Zookeeper
Configuration management and coordination
Getting data in and out
Improved interoperability with the rest of the data world is provided
by Sqoop and Flume. Sqoop is a tool designed to import data from
relational databases into Hadoop, either directly into HDFS or into
Hive. Flume is designed to import streaming flows of log data directly
into HDFS.
Hive’s SQL friendliness means that it can be used as a point of inte‐
gration with the vast universe of database tools capable of making
connections via JBDC or ODBC database drivers.
What Is Apache Hadoop?
|
13
Coordination and Workflow: Zookeeper and Oozie
With a growing family of services running as part of a Hadoop cluster,
there’s a need for coordination and naming services. As computing
nodes can come and go, members of the cluster need to synchronize
with each other, know where to access services, and know how they
should be configured. This is the purpose of Zookeeper.
Production systems utilizing Hadoop can often contain complex pipe‐
lines of transformations, each with dependencies on each other. For
example, the arrival of a new batch of data will trigger an import, which
must then trigger recalculations in dependent datasets. The Oozie
component provides features to manage the workflow and dependen‐
cies, removing the need for developers to code custom solutions.
Management and Deployment: Ambari and Whirr
One of the commonly added features incorporated into Hadoop by
distributors such as IBM and Microsoft is monitoring and adminis‐
tration. Though in an early stage, Ambari aims to add these features
to the core Hadoop project. Ambari is intended to help system ad‐
ministrators deploy and configure Hadoop, upgrade clusters, and
monitor services. Through an API, it may be integrated with other
system management tools.
Though not strictly part of Hadoop, Whirr is a highly complementary
component. It offers a way of running services, including Hadoop, on
cloud platforms. Whirr is cloud neutral and currently supports the
Amazon EC2 and Rackspace services.
Machine Learning: Mahout
Every organization’s data are diverse and particular to their needs.
However, there is much less diversity in the kinds of analyses per‐
formed on that data. The Mahout project is a library of Hadoop im‐
plementations of common analytical computations. Use cases include
user collaborative filtering, user recommendations, clustering, and
classification.
14
| Chapter 2: Getting Up to Speed with Big Data
Using Hadoop
Normally, you will use Hadoop in the form of a distribution. Much as
with Linux before it, vendors integrate and test the components of the
Apache Hadoop ecosystem and add in tools and administrative fea‐
tures of their own.
Though not per se a distribution, a managed cloud installation of Ha‐
doop’s MapReduce is also available through Amazon’s Elastic MapRe‐
duce service.
Why Big Data Is Big: The Digital Nervous
System
By Edd Dumbill
Where does all the data in “big data” come from? And why isn’t big
data just a concern for companies such as Facebook and Google? The
answer is that the web companies are the forerunners. Driven by social,
mobile, and cloud technology, there is an important transition taking
place, leading us all to the data-enabled world that those companies
inhabit today.
From Exoskeleton to Nervous System
Until a few years ago, the main function of computer systems in society,
and business in particular, was as a digital support system. Applica‐
tions digitized existing real-world processes, such as word-processing,
payroll, and inventory. These systems had interfaces back out to the
real world through stores, people, telephone, shipping, and so on. The
now-quaint phrase “paperless office” alludes to this transfer of preexisting paper processes into the computer. These computer systems
formed a digital exoskeleton, supporting a business in the real world.
The arrival of the Internet and the Web has added a new dimension,
bringing in an era of entirely digital business. Customer interaction,
payments, and often product delivery can exist entirely within com‐
puter systems. Data doesn’t just stay inside the exoskeleton any more,
but is a key element in the operation. We’re in an era where business
and society are acquiring a digital nervous system.
Why Big Data Is Big: The Digital Nervous System
|
15
As my sketch below shows, an organization with a digital nervous sys‐
tem is characterized by a large number of inflows and outflows of data,
a high level of networking, both internally and externally, increased
data flow, and consequent complexity.
This transition is why big data is important. Techniques developed to
deal with interlinked, heterogenous data acquired by massive web
companies will be our main tools as the rest of us transition to digitalnative operation. We see early examples of this, from catching fraud
in financial transactions to debugging and improving the hiring pro‐
cess in HR: and almost everybody already pays attention to the massive
flow of social network information concerning them.
Charting the Transition
As technology has progressed within business, each step taken has
resulted in a leap in data volume. To people looking at big data now, a
reasonable question is to ask why, when their business isn’t Google or
Facebook, does big data apply to them?
The answer lies in the ability of web businesses to conduct 100% of
their activities online. Their digital nervous system easily stretches
from the beginning to the end of their operations. If you have factories,
shops, and other parts of the real world within your business, you’ve
further to go in incorporating them into the digital nervous system.
But “further to go” doesn’t mean it won’t happen. The drive of the Web,
social media, mobile, and the cloud is bringing more of each business
16
| Chapter 2: Getting Up to Speed with Big Data
into a data-driven world. In the UK, the Government Digital Service
is unifying the delivery of services to citizens. The results are a radical
improvement of citizen experience, and for the first time many de‐
partments are able to get a real picture of how they’re doing. For any
retailer, companies such as Square, American Express, and Four‐
square are bringing payments into a social, responsive data ecosystem,
liberating that information from the silos of corporate accounting.
What does it mean to have a digital nervous system? The key trait is
to make an organization’s feedback loop entirely digital. That is, a di‐
rect connection from sensing and monitoring inputs through to prod‐
uct outputs. That’s straightforward on the Web. It’s getting increasingly
easier in retail. Perhaps the biggest shifts in our world will come as
sensors and robotics bring the advantages web companies have now
to domains such as industry, transport, and the military.
The reach of the digital nervous system has grown steadily over the
past 30 years, and each step brings gains in agility and flexibility, along
with an order of magnitude more data. First, from specific application
programs to general business use with the PC. Then, direct interaction
over the Web. Mobile adds awareness of time and place, along with
instant notification. The next step, to cloud, breaks down data silos
and adds storage and compute elasticity through cloud computing.
Now, we’re integrating smart agents, able to act on our behalf, and
connections to the real world through sensors and automation.
Coming, Ready or Not
If you’re not contemplating the advantages of taking more of your op‐
eration digital, you can bet your competitors are. As Marc Andreessen
wrote last year, “software is eating the world.” Everything is becoming
programmable.
It’s this growth of the digital nervous system that makes the techniques
and tools of big data relevant to us today. The challenges of massive
data flows, and the erosion of hierarchy and boundaries, will lead us
to the statistical approaches, systems thinking, and machine learning
we need to cope with the future we’re inventing.
Why Big Data Is Big: The Digital Nervous System |
17
CHAPTER 3
Big Data Tools, Techniques,
and Strategies
Designing Great Data Products
By Jeremy Howard, Margit Zwemer, and Mike Loukides
In the past few years, we’ve seen many data products based on predic‐
tive modeling. These products range from weather forecasting to rec‐
ommendation engines to services that predict airline flight times more
accurately than the airlines themselves. But these products are still just
making predictions, rather than asking what action they want some‐
one to take as a result of a prediction. Prediction technology can be
interesting and mathematically elegant, but we need to take the next
step. The technology exists to build data products that can revolu‐
tionize entire industries. So, why aren’t we building them?
To jump-start this process, we suggest a four-step approach that has
already transformed the insurance industry. We call it the Drivetrain
Approach, inspired by the emerging field of self-driving vehicles. En‐
gineers start by defining a clear objective: They want a car to drive safely
from point A to point B without human intervention. Great predictive
modeling is an important part of the solution, but it no longer stands
on its own; as products become more sophisticated, it disappears into
the plumbing. Someone using Google’s self-driving car is completely
unaware of the hundreds (if not thousands) of models and the peta‐
bytes of data that make it work. But as data scientists build increasingly
19
sophisticated products, they need a systematic design approach. We
don’t claim that the Drivetrain Approach is the best or only method;
our goal is to start a dialog within the data science and business com‐
munities to advance our collective vision.
Objective-based Data Products
We are entering the era of data as drivetrain, where we use data not
just to generate more data (in the form of predictions), but use data to
produce actionable outcomes. That is the goal of the Drivetrain Ap‐
proach. The best way to illustrate this process is with a familiar data
product: search engines. Back in 1997, AltaVista was king of the algo‐
rithmic search world. While their models were good at finding relevant
websites, the answer the user was most interested in was often buried
on page 100 of the search results. Then, Google came along and trans‐
formed online search by beginning with a simple question: What is
the user’s main objective in typing in a search query?
The four steps in the Drivetrain Approach.
Google realized that the objective was to show the most relevant search
result; for other companies, it might be increasing profit, improving
the customer experience, finding the best path for a robot, or balancing
the load in a data center. Once we have specified the goal, the second
step is to specify what inputs of the system we can control, the levers
we can pull to influence the final outcome. In Google’s case, they could
control the ranking of the search results. The third step was to consider
what new data they would need to produce such a ranking; they real‐
ized that the implicit information regarding which pages linked to
which other pages could be used for this purpose. Only after these first
three steps do we begin thinking about building the predictive mod‐
els. Our objective and available levers, what data we already have and
what additional data we will need to collect, determine the models we
can build. The models will take both the levers and any uncontrollable
variables as their inputs; the outputs from the models can be combined
to predict the final state for our objective.
20
|
Chapter 3: Big Data Tools, Techniques, and Strategies
Step 4 of the Drivetrain Approach for Google is now part of tech his‐
tory: Larry Page and Sergey Brin invented the graph traversal algo‐
rithm PageRank and built an engine on top of it that revolutionized
search. But you don’t have to invent the next PageRank to build a great
data product. We will show a systematic approach to step 4 that doesn’t
require a PhD in computer science.
The Model Assembly Line: A Case Study of Optimal
Decisions Group
Optimizing for an actionable outcome over the right predictive models
can be a company’s most important strategic decision. For an insur‐
ance company, policy price is the product, so an optimal pricing model
is to them what the assembly line is to automobile manufacturing.
Insurers have centuries of experience in prediction, but as recently as
10 years ago, the insurance companies often failed to make optimal
business decisions about what price to charge each new customer.
Their actuaries could build models to predict a customer’s likelihood
of being in an accident and the expected value of claims. But those
models did not solve the pricing problem, so the insurance companies
would set a price based on a combination of guesswork and market
studies.
This situation changed in 1999 with a company called Optimal Deci‐
sions Group (ODG). ODG approached this problem with an early use
of the Drivetrain Approach and a practical take on step 4 that can be
applied to a wide range of problems. They began by defining the ob‐
jective that the insurance company was trying to achieve: setting a price
that maximizes the net-present value of the profit from a new customer
over a multi-year time horizon, subject to certain constraints such as
maintaining market share. From there, they developed an optimized
pricing process that added hundreds of millions of dollars to the in‐
surers’ bottom lines. [Note: Co-author Jeremy Howard founded ODG.]
ODG identified which levers the insurance company could control:
what price to charge each customer, what types of accidents to cover,
how much to spend on marketing and customer service, and how to
react to their competitors’ pricing decisions. They also considered in‐
puts outside of their control, like competitors’ strategies, macroeco‐
nomic conditions, natural disasters, and customer “stickiness.” They
considered what additional data they would need to predict a cus‐
tomer’s reaction to changes in price. It was necessary to build this da‐
Designing Great Data Products
|
21
taset by randomly changing the prices of hundreds of thousands of
policies over many months. While the insurers were reluctant to con‐
duct these experiments on real customers, as they’d certainly lose some
customers as a result, they were swayed by the huge gains that opti‐
mized policy pricing might deliver. Finally, ODG started to design the
models that could be used to optimize the insurer’s profit.
Drivetrain Step 4: The Model Assembly Line. Picture a Model Assembly
Line for data products that transforms the raw data into an actionable
outcome. The Modeler takes the raw data and converts it into slightly
more refined predicted data.
The first component of ODG’s Modeler was a model of price elasticity
(the probability that a customer will accept a given price) for new pol‐
icies and for renewals. The price elasticity model is a curve of price
versus the probability of the customer accepting the policy conditional
on that price. This curve moves from almost certain acceptance at very
low prices to almost never at high prices.
The second component of ODG’s Modeler related price to the insur‐
ance company’s profit, conditional on the customer accepting this
price. The profit for a very low price will be in the red by the value of
expected claims in the first year, plus any overhead for acquiring and
servicing the new customer. Multiplying these two curves creates a
final curve that shows price versus expected profit (see Expected Profit
figure, below). The final curve has a clearly identifiable local maximum
that represents the best price to charge a customer for the first year.
22
|
Chapter 3: Big Data Tools, Techniques, and Strategies
Expected profit.
ODG also built models for customer retention. These models predic‐
ted whether customers would renew their policies in one year, allowing
for changes in price and willingness to jump to a competitor. These
additional models allow the annual models to be combined to predict
profit from a new customer over the next five years.
This new suite of models is not a final answer because it only identifies
the outcome for a given set of inputs. The next machine on the as‐
sembly line is a Simulator, which lets ODG ask the “what if ” questions
to see how the levers affect the distribution of the final outcome. The
expected profit curve is just a slice of the surface of possible outcomes.
To build that entire surface, the Simulator runs the models over a wide
range of inputs. The operator can adjust the input levers to answer
specific questions like, “What will happen if our company offers the
customer a low teaser price in year one but then raises the premiums
in year two?” They can also explore how the distribution of profit is
shaped by the inputs outside of the insurer’s control: “What if the
economy crashes and the customer loses his job? What if a 100-year
flood hits his home? If a new competitor enters the market and our
Designing Great Data Products
|
23
company does not react, what will be the impact on our bottom line?”
Because the simulation is at a per-policy level, the insurer can view the
impact of a given set of price changes on revenue, market share, and
other metrics over time.
The Simulator’s result is fed to an Optimizer, which takes the surface
of possible outcomes and identifies the highest point. The Optimizer
not only finds the best outcomes, it can also identify catastrophic out‐
comes and show how to avoid them. There are many different opti‐
mization techniques to choose from (see “Optimization in the Real
World” (page 24)), but it is a well-understood field with robust and
accessible solutions. ODG’s competitors use different techniques to
find an optimal price, but they are shipping the same over-all data
product. What matters is that using a Drivetrain Approach combined
with a Model Assembly Line bridges the gap between predictive mod‐
els and actionable outcomes. Irfan Ahmed of CloudPhysics provides
a good taxonomy of predictive modeling that describes this entire as‐
sembly line process:
When dealing with hundreds or thousands of individual components
models to understand the behavior of the full-system, a search has to
be done. I think of this as a complicated machine (full-system) where
the curtain is withdrawn and you get to model each significant part
of the machine under controlled experiments and then simulate the
interactions. Note here the different levels: models of individual com‐
ponents, tied together in a simulation given a set of inputs, iterated
through over different input sets in a search optimizer.
Optimization in the Real World
Optimization is a classic problem that has been studied by Newton and
Gauss all the way up to mathematicians and engineers in the present
day. Many optimization procedures are iterative; they can be thought
of as taking a small step, checking our elevation and then taking another
small uphill step until we reach a point from which there is no direction
in which we can climb any higher. The danger in this hill-climbing
approach is that if the steps are too small, we may get stuck at one of
the many local maxima in the foothills, which will not tell us the best
set of controllable inputs. There are many techniques to avoid this
problem, some based on statistics and spreading our bets widely, and
others based on systems seen in nature, like biological evolution or the
cooling of atoms in glass.
24
|
Chapter 3: Big Data Tools, Techniques, and Strategies
Optimization is a process we are all familiar with in our daily lives, even
if we have never used algorithms like gradient descent or simulated
annealing. A great image for optimization in the real world comes up
in a recent TechZing podcast with the co-founders of data-mining
competition platform Kaggle. One of the authors of this paper was
explaining an iterative optimization technique, and the host says, “So,
in a sense Jeremy, your approach was like that of doing a startup, which
is just get something out there and iterate and iterate and iterate.” The
takeaway, whether you are a tiny startup or a giant insurance company,
is that we unconsciously use optimization whenever we decide how to
get to where we want to go.
Drivetrain Approach to Recommender Systems
Let’s look at how we could apply this process to another industry:
marketing. We begin by applying the Drivetrain Approach to a familiar
example, recommendation engines, and then building this up into an
entire optimized marketing strategy.
Recommendation engines are a familiar example of a data product
based on well-built predictive models that do not achieve an optimal
objective. The current algorithms predict what products a customer
will like, based on purchase history and the histories of similar cus‐
tomers. A company like Amazon represents every purchase that has
ever been made as a giant sparse matrix, with customers as the rows
and products as the columns. Once they have the data in this format,
data scientists apply some form of collaborative filtering to “fill in the
matrix.” For example, if customer A buys products 1 and 10, and cus‐
tomer B buys products 1, 2, 4, and 10, the engine will recommend that
A buy 2 and 4. These models are good at predicting whether a customer
will like a given product, but they often suggest products that the cus‐
Designing Great Data Products
|
25
tomer already knows about or has already decided not to buy. Amazon’s
recommendation engine is probably the best one out there, but it’s easy
to get it to show its warts. Here is a screenshot of the “Customers Who
Bought This Item Also Bought” feed on Amazon from a search for the
latest book in Terry Pratchett’s “Discworld series:”
All of the recommendations are for other books in the same series, but
it’s a good assumption that a customer who searched for “Terry Pratch‐
ett” is already aware of these books. There may be some unexpected
recommendations on pages 2 through 14 of the feed, but how many
customers are going to bother clicking through?
Instead, let’s design an improved recommendation engine using the
Drivetrain Approach, starting by reconsidering our objective. The ob‐
jective of a recommendation engine is to drive additional sales by sur‐
prising and delighting the customer with books he or she would not
have purchased without the recommendation. What we would really
like to do is emulate the experience of Mark Johnson, CEO of Zite,
who gave a perfect example of what a customer’s recommendation
experience should be like in a recent TOC talk. He went into Strand
bookstore in New York City and asked for a book similar to Toni Mor‐
rison’s Beloved. The girl behind the counter recommended William
Faulkner’s Absolom Absolom. On Amazon, the top results for a similar
query leads to another book by Toni Morrison and several books by
well-known female authors of color. The Strand bookseller made a
brilliant but far-fetched recommendation probably based more on the
character of Morrison’s writing than superficial similarities between
Morrison and other authors. She cut through the chaff of the obvious
to make a recommendation that will send the customer home with a
new book, and returning to Strand again and again in the future.
This is not to say that Amazon’s recommendation engine could not
have made the same connection; the problem is that this helpful rec‐
ommendation will be buried far down in the recommendation feed,
beneath books that have more obvious similarities to Beloved. The
26
|
Chapter 3: Big Data Tools, Techniques, and Strategies
objective is to escape a recommendation filter bubble, a term which
was originally coined by Eli Pariser to describe the tendency of per‐
sonalized news feeds to only display articles that are blandly popular
or further confirm the readers’ existing biases.
As with the AltaVista-Google example, the lever a bookseller can con‐
trol is the ranking of the recommendations. New data must also be
collected to generate recommendations that will cause new sales. This
will require conducting many randomized experiments in order to
collect data about a wide range of recommendations for a wide range
of customers.
The final step in the drivetrain process is to build the Model Assembly
Line. One way to escape the recommendation bubble would be to build
a Modeler containing two models for purchase probabilities, condi‐
tional on seeing or not seeing a recommendation. The difference be‐
tween these two probabilities is a utility function for a given recom‐
mendation to a customer (see Recommendation Engine figure, be‐
low). It will be low in cases where the algorithm recommends a familiar
book that the customer has already rejected (both components are
small) or a book that he or she would have bought even without the
recommendation (both components are large and cancel each other
out). We can build a Simulator to test the utility of each of the many
possible books we have in stock, or perhaps just over all the outputs of
a collaborative filtering model of similar customer purchases, and then
build a simple Optimizer that ranks and displays the recommended
books based on their simulated utility. In general, when choosing an
objective function to optimize, we need less emphasis on the “function”
and more on the “objective.” What is the objective of the person using
our data product? What choice are we actually helping him or her
make?
Designing Great Data Products
|
27
Recommendation Engine.
Optimizing Lifetime Customer Value
This same systematic approach can be used to optimize the entire
marketing strategy. This encompasses all the interactions that a retailer
has with its customers outside of the actual buy-sell transaction,
whether making a product recommendation, encouraging the cus‐
tomer to check out a new feature of the online store, or sending sales
promotions. Making the wrong choices comes at a cost to the retailer
in the form of reduced margins (discounts that do not drive extra
sales), opportunity costs for the scarce real-estate on their homepage
(taking up space in the recommendation feed with products the cus‐
tomer doesn’t like or would have bought without a recommendation)
or the customer tuning out (sending so many unhelpful email pro‐
motions that the customer filters all future communications as spam).
We will show how to go about building an optimized marketing strat‐
egy that mitigates these effects.
28
|
Chapter 3: Big Data Tools, Techniques, and Strategies
As in each of the previous examples, we begin by asking: “What ob‐
jective is the marketing strategy trying to achieve?” Simple: we want
to optimize the lifetime value from each customer. Second question:
“What levers do we have at our disposal to achieve this objective?”
Quite a few. For example:
1. We can make product recommendations that surprise and delight
(using the optimized recommendation outlined in the previous
section).
2. We could offer tailored discounts or special offers on products the
customer was not quite ready to buy or would have bought else‐
where.
3. We can even make customer-care calls just to see how the user is
enjoying our site and make them feel that their feedback is valued.
What new data do we need to collect? This can vary case by case, but
a few online retailers are taking creative approaches to this step. Online
fashion retailer Zafu shows how to encourage the customer to partic‐
ipate in this collection process. Plenty of websites sell designer denim,
but for many women, high-end jeans are the one item of clothing they
never buy online because it’s hard to find the right pair without trying
them on. Zafu’s approach is not to send their customers directly to the
clothes, but to begin by asking a series of simple questions about the
customers’ body type, how well their other jeans fit, and their fashion
preferences. Only then does the customer get to browse a recom‐
mended selection of Zafu’s inventory. The data collection and recom‐
mendation steps are not an add-on; they are Zafu’s entire business
model — women’s jeans are now a data product. Zafu can tailor their
recommendations to fit as well as their jeans because their system is
asking the right questions.
Designing Great Data Products
|
29
Starting with the objective forces data scientists to consider what ad‐
ditional models they need to build for the Modeler. We can keep the
“like” model that we have already built as well as the causality model
for purchases with and without recommendations, and then take a
staged approach to adding additional models that we think will im‐
prove the marketing effectiveness. We could add a price elasticity
model to test how offering a discount might change the probability
that the customer will buy the item. We could construct a patience
model for the customers’ tolerance for poorly targeted communica‐
tions: When do they tune them out and filter our messages straight to
spam? (“If Hulu shows me that same dog food ad one more time, I’m
gonna stop watching!”) A purchase sequence causality model can be
used to identify key “entry products.” For example, a pair of jeans that
is often paired with a particular top, or the first part of a series of novels
that often leads to a sale of the whole set.
Once we have these models, we construct a Simulator and an Opti‐
mizer and run them over the combined models to find out what rec‐
ommendations will achieve our objectives: driving sales and improv‐
ing the customer experience.
30
|
Chapter 3: Big Data Tools, Techniques, and Strategies
A look inside the Modeler.
Best Practices from Physical Data Products
It is easy to stumble into the trap of thinking that since data exists
somewhere abstract, on a spreadsheet or in the cloud, that data prod‐
ucts are just abstract algorithms. So, we would like to conclude by
showing you how objective-based data products are already a part of
the tangible world. What is most important about these examples is
that the engineers who designed these data products didn’t start by
building a neato robot and then looking for something to do with it.
They started with an objective like, “I want my car to drive me places,”
and then designed a covert data product to accomplish that task. En‐
gineers are often quietly on the leading edge of algorithmic applica‐
tions because they have long been thinking about their own modeling
challenges in an objective-based way. Industrial engineers were among
the first to begin using neural networks, applying them to problems
like the optimal design of assembly lines and quality control. Brian
Ripley’s seminal book on pattern recognition gives credit for many
ideas and techniques to largely forgotten engineering papers from the
1970s.
When designing a product or manufacturing process, a drivetrain-like
process followed by model integration, simulation and optimization
is a familiar part of the toolkit of systems engineers. In engineering, it
Designing Great Data Products
|
31
is often necessary to link many component models together so that
they can be simulated and optimized in tandem. These firms have
plenty of experience building models of each of the components and
systems in their final product, whether they’re building a server farm
or a fighter jet. There may be one detailed model for mechanical sys‐
tems, a separate model for thermal systems, and yet another for elec‐
trical systems, etc. All of these systems have critical interactions. For
example, resistance in the electrical system produces heat, which needs
to be included as an input for the thermal diffusion and cooling model.
That excess heat could cause mechanical components to warp, pro‐
ducing stresses that should be inputs to the mechanical models.
The screenshot below is taken from a model integration tool designed
by Phoenix Integration. Although it’s from a completely different en‐
gineering discipline, this diagram is very similar to the Drivetrain Ap‐
proach we’ve recommended for data products. The objective is clearly
defined: build an airplane wing. The wing box includes the design
levers like span, taper ratio, and sweep. The data is in the wing mate‐
rials’ physical properties; costs are listed in another tab of the appli‐
cation. There is a Modeler for aerodynamics and mechanical structure
that can then be fed to a Simulator to produce the Key Wing Outputs
of cost, weight, lift coefficient, and induced drag. These outcomes can
be fed to an Optimizer to build a functioning and cost-effective air‐
plane wing.
32
|
Chapter 3: Big Data Tools, Techniques, and Strategies
Screenshot from a model integration tool designed by Phoenix Integra‐
tion.
As predictive modeling and optimization become more vital to a wide
variety of activities, look out for the engineers to disrupt industries
that wouldn’t immediately appear to be in the data business. The in‐
spiration for the phrase “Drivetrain Approach,” for example, is already
on the streets of Mountain View. Instead of being data driven, we can
now let the data drive us.
Suppose we wanted to get from San Francisco to the Strata 2012 Con‐
ference in Santa Clara. We could just build a simple model of distance /
speed-limit to predict arrival time with little more than a ruler and a
road map. If we want a more sophisticated system, we can build an‐
other model for traffic congestion and yet another model to forecast
weather conditions and their effect on the safest maximum speed.
There are plenty of cool challenges in building these models, but by
themselves, they do not take us to our destination. These days, it is
trivial to use some type of heuristic search algorithm to predict the
drive times along various routes (a Simulator) and then pick the short‐
est one (an Optimizer) subject to constraints like avoiding bridge tolls
or maximizing gas mileage. But why not think bigger? Instead of the
femme-bot voice of the GPS unit telling us which route to take and
where to turn, what would it take to build a car that would make those
decisions by itself? Why not bundle simulation and optimization en‐
gines with a physical engine, all inside the black box of a car?
Let’s consider how this is an application of the Drivetrain Approach.
We have already defined our objective: building a car that drives itself.
The levers are the vehicle controls we are all familiar with: steering
wheel, accelerator, brakes, etc. Next, we consider what data the car
needs to collect; it needs sensors that gather data about the road as well
as cameras that can detect road signs, red or green lights, and unex‐
pected obstacles (including pedestrians). We need to define the mod‐
els we will need, such as physics models to predict the effects of steer‐
ing, braking and acceleration, and pattern recognition algorithms to
interpret data from the road signs.
As one engineer on the Google self-driving car project put it in a recent
Wired article, “We’re analyzing and predicting the world 20 times a
second.” What gets lost in the quote is what happens as a result of that
prediction. The vehicle needs to use a simulator to examine the results
of the possible actions it could take. If it turns left now, will it hit that
Designing Great Data Products
|
33
pedestrian? If it makes a right turn at 55 mph in these weather condi‐
tions, will it skid off the road? Merely predicting what will happen isn’t
good enough. The self-driving car needs to take the next step: after
simulating all the possibilities, it must optimize the results of the sim‐
ulation to pick the best combination of acceleration and braking,
steering and signaling, to get us safely to Santa Clara. Prediction only
tells us that there is going to be an accident. An optimizer tells us how
to avoid accidents.
Improving the data collection and predictive models is very important,
but we want to emphasize the importance of beginning by defining a
clear objective with levers that produce actionable outcomes. Data
science is beginning to pervade even the most bricks-and-mortar el‐
ements of our lives. As scientists and engineers become more adept at
applying prediction and optimization to everyday problems, they are
expanding the art of the possible, optimizing everything from our
personal health to the houses and cities we live in. Models developed
to simulate fluid dynamics and turbulence have been applied to im‐
proving traffic and pedestrian flows by using the placement of exits
and crowd control barriers as levers. This has improved emergency
evacuation procedures for subway stations and reduced the danger of
crowd stampedes and trampling during sporting events. Nest is de‐
signing smart thermostats that learn the home-owner’s temperature
preferences and then optimizes their energy consumption. For motor
vehicle traffic, IBM performed a project with the city of Stockholm to
optimize traffic flows that reduced congestion by nearly a quarter, and
increased the air quality in the inner city by 25%. What is particularly
interesting is that there was no need to build an elaborate new data
collection system. Any city with metered stoplights already has all the
necessary information; they just haven’t found a way to suck the
meaning out of it.
In another area where objective-based data products have the power
to change lives, the CMU extension in Silicon Valley has an active
project for building data products to help first responders after natural
or man-made disasters. Jeannie Stamberger of Carnegie Mellon Uni‐
versity Silicon Valley explained to us many of the possible applications
of predictive algorithms to disaster response, from text-mining and
sentiment analysis of tweets to determine the extent of the damage, to
swarms of autonomous robots for reconnaissance and rescue, to lo‐
gistic optimization tools that help multiple jurisdictions coordinate
their responses. These disaster applications are a particularly good
34
|
Chapter 3: Big Data Tools, Techniques, and Strategies
example of why data products need simple, well-designed interfaces
that produce concrete recommendations. In an emergency, a data
product that just produces more data is of little use. Data scientists
now have the predictive tools to build products that increase the com‐
mon good, but they need to be aware that building the models is not
enough if they do not also produce optimized, implementable out‐
comes.
The Future for Data Products
We introduced the Drivetrain Approach to provide a framework for
designing the next generation of great data products and described
how it relies at its heart on optimization. In the future, we hope to see
optimization taught in business schools as well as in statistics depart‐
ments. We hope to see data scientists ship products that are designed
to produce desirable business outcomes. This is still the dawn of data
science. We don’t know what design approaches will be developed in
the future, but right now, there is a need for the data science community
to coalesce around a shared vocabulary and product design process
that can be used to educate others on how to derive value from their
predictive models. If we do not do this, we will find that our models
only use data to create more data, rather than using data to create
actions, disrupt industries, and transform lives.
Do we want products that deliver data, or do we want products that
deliver results based on data? Jeremy Howard examined these questions
in his Strata California 12 session, “From Predictive Modelling to Opti‐
mization: The Next Frontier.” Full video from that session is available
here.
What It Takes to Build Great Machine Learning
Products
By Aria Haghighi
Machine learning (ML) is all the rage, riding tight on the coattails of
the “big data” wave. Like most technology hype, the enthusiasm far
exceeds the realization of actual products. Arguably, not since Google’s
tremendous innovations in the late ’90s/early 2000s has algorithmic
technology led to a product that has permeated the popular culture.
That’s not to say there haven’t been great ML wins since, but none have
as been as impactful or had computational algorithms at their core.
What It Takes to Build Great Machine Learning Products
|
35
Netflix may use recommendation technology, but Netflix is still Netflix
without it. There would be no Google if Page, Brin, et al., hadn’t ex‐
ploited the graph structure of the Web and anchor text to improve
search.
So why is this? It’s not for lack of trying. How many startups have aimed
to bring natural language processing (NLP) technology to the masses,
only to fade into oblivion after people actually try their products? The
challenge in building great products with ML lies not in just under‐
standing basic ML theory, but in understanding the domain and prob‐
lem sufficiently to operationalize intuitions into model design. Inter‐
esting problems don’t have simple off-the-shelf ML solutions. Progress
in important ML application areas, like NLP, come from insights spe‐
cific to these problems, rather than generic ML machinery. Often,
specific insights into a problem and careful model design make the
difference between a system that doesn’t work at all and one that people
will actually use.
The goal of this essay is not to discourage people from building amaz‐
ing products with ML at their cores, but to be clear about where I think
the difficulty lies.
Progress in Machine Learning
Machine learning has come a long way over the last decade. Before I
started grad school, training a large-margin classifier (e.g., SVM) was
done via John Platt’s batch SMO algorithm. In that case, training time
scaled poorly with the amount of training data. Writing the algorithm
itself required understanding quadratic programming and was riddled
with heuristics for selecting active constraints and black-art parameter
tuning. Now, we know how to train a nearly performance-equivalent
large-margin classifier in linear time using a (relatively) simple online
algorithm (PDF). Similar strides have been made in (probabilistic)
graphical models: Markov-chain Monte Carlo (MCMC) and varia‐
tional methods have facilitated inference for arbitrarily complex
graphical models.1 Anecdotally, take at look at papers over the last
1. Although MCMC is a much older statistical technique, its broad use in large-scale
machine learning applications is relatively recent.
36
|
Chapter 3: Big Data Tools, Techniques, and Strategies
eight years in the proceedings of the Association for Computational
Linguistics (ACL), the premiere natural language processing publica‐
tion. A top paper from 2011 has orders of magnitude more technical
ML sophistication than one from 2003.
On the education front, we’ve come a long way as well. As an undergrad
at Stanford in the early-to-mid 2000s, I took Andrew Ng’s ML
course and Daphne Koller’s probabilistic graphical model course. Both
of these classes were among the best I took at Stanford and were only
available to about 100 students a year. Koller’s course in particular was
not only the best course I took at Stanford, but the one that taught me
the most about teaching. Now, anyone can take these courses online.
As an applied ML person — specifically, natural language processing
— much of this progress has made aspects of research significantly
easier. However, the core decisions I make are not which abstract ML
algorithm, loss-function, or objective to use, but what features and
structure are relevant to solving my problem. This skill only comes
with practice. So, while it’s great that a much wider audience will have
an understanding of basic ML, it’s not the most difficult part of build‐
ing intelligent systems.
Interesting Problems Are Never Off the Shelf
The interesting problems that you’d actually want to solve are far
messier than the abstractions used to describe standard ML problems.
Take machine translation (MT), for example. Naively, MT looks like a
statistical classification problem: You get an input foreign sentence and
have to predict a target English sentence. Unfortunately, because the
space of possible English is combinatorially large, you can’t treat MT
as a black-box classification problem. Instead, like most interesting
ML applications, MT problems have a lot of structure and part of the
job of a good researcher is decomposing the problem into smaller
pieces that can be learned or encoded deterministically. My claim is
that progress in complex problems like MT comes mostly from how
we decompose and structure the solution space, rather than ML tech‐
niques used to learn within this space.
Machine translation has improved by leaps and bounds throughout
the last decade. I think this progress has largely, but not entirely, come
from keen insights into the specific problem, rather than generic ML
improvements. Modern statistical MT originates from an amazing
paper, “The mathematics of statistical machine translation” (PDF),
What It Takes to Build Great Machine Learning Products
|
37
which introduced the noisy-channel architecture on which future MT
systems would be based. At a very simplistic level, this is how the model
works:2 For each foreign word, there are potential English translations
(including the null word for foreign words that have no English equiv‐
alent). Think of this as a probabilistic dictionary. These candidate
translation words are then re-ordered to create a plausible English
translation. There are many intricacies being glossed over: how to ef‐
ficiently consider candidate English sentences and their permutations,
what model is used to learn the systematic ways in which reordering
occurs between languages, and the details about how to score the
plausibility of the English candidate (the language model).
The core improvement in MT came from changing this model. So,
rather than learning translation probabilities of individual words, to
instead learn models of how to translate foreign phrases to English
phrases. For instance, the German word “abends” translates roughly
to the English prepositional phrase “in the evening.” Before phrasebased translation (PDF), a word-based model would only get to trans‐
late to a single English word, making it unlikely to arrive at the correct
English translation.3 Phrase-based translation generally results in
more accurate translations with fluid, idiomatic English output. Of
course, adding phrased-based emissions introduces several additional
complexities, including how to how to estimate phrase-emissions giv‐
en that we never observe phrase segmentation; no one tells us that “in
the evening” is a phrase that should match up to some foreign phrase.
What’s surprising here is that there aren’t general ML improvements
that are making this difference, but problem-specific model design.
People can and have implemented more sophisticated ML techniques
for various pieces of an MT system. And these do yield improvements,
but typically far smaller than good problem-specific research insights.
Franz Och, one of the authors of the original Phrase-based papers,
went on to Google and became the principle person behind the search
company’s translation efforts. While the intellectual underpinnings of
Google’s system go back to Och’s days as a research scientist at the
Information Sciences Institute (and earlier as a graduate student),
2. The model is generative, so what’s being described here is from the point-of-view of
inference; the model’s generative story works in reverse.
3. IBM model 3 introduced the concept of fertility to allow a given word to generate
multiple independent target translation words. While this could generate the required
translation, the probability of the model doing so is relatively low.
38
|
Chapter 3: Big Data Tools, Techniques, and Strategies
much of the gains beyond the insights underlying phrase-based trans‐
lation (and minimum-error rate training, another of Och’s innova‐
tions) came from a massive software engineering effort to scale these
ideas to the Web. That effort itself yielded impressive research into
large-scale language models and other areas of NLP. It’s important to
note that Och, in addition to being a world-class researcher, is also, by
all accounts, an incredibly impressive hacker and builder. It’s this rare
combination of skill that can bring ideas all the way from a research
project to where Google Translate is today.
Defining the Problem
But I think there’s an even bigger barrier beyond ingenious model
design and engineering skills. In the case of machine translation and
speech recognition, the problem being solved is straightforward to
understand and well-specified. Many of the NLP technologies that I
think will revolutionize consumer products over the next decade are
much more vague. How, exactly, can we take the excellent research in
structured topic models, discourse processing, or sentiment analysis
and make a mass-appeal consumer product?
Consider summarization. We all know that in some way, we’ll want
products that summarize and structure content. However, for com‐
putational and research reasons, you need to restrict the scope of this
problem to something for which you can build a model, an algorithm,
and ultimately evaluate. For instance, in the summarization literature,
the problem of multi-document summarization is typically formula‐
ted as selecting a subset of sentences from the document collection
and ordering them. Is this the right problem to be solving? Is the best
way to summarize a piece of text a handful of full-length sentences?
Even if a summarization is accurate, does the Franken-sentence struc‐
ture yield summaries that feel inorganic to users?
Or, consider sentiment analysis. Do people really just want a coarsegrained thumbs-up or thumbs-down on a product or event? Or do
they want a richer picture of sentiments toward individual aspects of
an item (e.g., loved the food, hated the decor)? Do people care about
determining sentiment attitudes of individual reviewers/utterances, or
producing an accurate assessment of aggregate sentiment?
Typically, these decisions are made by a product person and are passed
off to researchers and engineers to implement. The problem with this
approach is that ML-core products are intimately constrained by what
What It Takes to Build Great Machine Learning Products
|
39
is technically and algorithmically feasible. In my experience, having a
technical understanding of the range of related ML problems can in‐
spire product ideas that might not occur to someone without this un‐
derstanding. To draw a loose analogy, it’s like architecture. So much of
the construction of a bridge is constrained by material resources and
physics that it doesn’t make sense to have people without that technical
background design a bridge.
The goal of all this is to say that if you want to build a rich ML product,
you need to have a rich product/design/research/engineering team.
All the way from the nitty gritty of how ML theory works to building
systems to domain knowledge to higher-level product thinking to
technical interaction and graphic design; preferably people who are
world-class in one of these areas but also good in several. Small talented
teams with all of these skills are better equipped to navigate the joint
uncertainty with respect to product vision as well as model design.
Large companies that have research and product people in entirely
different buildings are ill-equipped to tackle these kinds of problems.
The ML products of the future will come from startups with small
founding teams that have this full context and can all fit in the prov‐
erbial garage.
40
| Chapter 3: Big Data Tools, Techniques, and Strategies
CHAPTER 4
The Application of Big Data
Stories over Spreadsheets
By Mac Slocum
I didn’t realize how much I dislike spreadsheets until I was presented
with a vision of the future where their dominance isn’t guaranteed.
That eye-opening was offered by Narrative Science CTO Kris Ham‐
mond (@whisperspace) during a recent interview. Hammond’s com‐
pany turns data into stories: They provide sentences and paragraphs
instead of rows and columns. To date, much of the attention Narrative
Science has received has focused on the media applications. That’s a
natural starting point. Heck, I asked him about those very same things
when I first met Hammond at Strata in New York last fall. But during
our most recent chat, Hammond explored the other applications of
narrative-driven data analysis.
“Companies, God bless them, had a great insight: They wanted to make
decisions based upon the data that’s out there and the evidence in front
of them,” Hammond said. “So they started gathering that data up. It
quickly exploded. And they ended up with huge data repositories they
had to manage. A lot of their effort ended up being focused on gath‐
ering that data, managing that data, doing analytics across that data,
and then the question was: What do we do with it?”
Hammond sees an opportunity to extract and communicate the in‐
sights locked within company data. “We’ll be the bridge between the
data you have, the insights that are in there, or insights we can gather,
41
and communicating that information to your clients, to your man‐
agement, and to your different product teams. We’ll turn it into some‐
thing that’s intelligible instead of a list of numbers, a spreadsheet, or a
graph or two. You get a real narrative; a real story in that data.”
My takeaway: The journalism applications of this are intriguing, but
these other use cases are empowering.
Why? Because most people don’t speak fluent “spreadsheet.” They see
all those neat rows and columns and charts, and they know something
important is tucked in there, but what that something is and how to
extract it aren’t immediately clear. Spreadsheets require effort. That’s
doubly true if you don’t know what you’re looking for. And if data
analysis is an adjacent part of a person’s job, more effort means those
spreadsheets will always be pushed to the side. “I’ll get to those next
week when I’ve got more time…”
We all know how that plays out.
But what if the spreadsheet wasn’t our default output anymore? What
if we could take things most of us are hard-wired to understand —
stories, sentences, clear guidance — and layer it over all that vital data?
Hammond touched on that:
For some people, a spreadsheet is a great device. For most people, not
so much so. The story. The paragraph. The report. The prediction.
The advisory. Those are much more powerful objects in our world,
and they’re what we’re used to.
He’s right. Spreadsheets push us (well, most of us) into a cognitive
corner. Open a spreadsheet and you’re forced to recalibrate your focus
to see the data. Then you have to work even harder to extract meaning.
This is the best we can do?
With that in mind, I asked Hammond if the spreadsheet’s days are
numbered.
“There will always be someone who uses a spreadsheet,” Hammond
said. “But, I think what we’re finding is that the story is really going to
be the endpoint. If you think about it, the spreadsheet is for somebody
who really embraces the data. And usually what that person does is
they reduce that data down to something that they’re going to use to
communicate with someone else.”
42
|
Chapter 4: The Application of Big Data
A Thought on Dashboards
I used to view dashboards as the logical step beyond raw data and
spreadsheets. I’m not so sure about that anymore, at least in terms of
broad adoption. Dashboards are good tools, and I anticipate we’ll have
them from now until the end of time, but they’re still weighed down
by a complexity that makes them inaccessible.
It’s not that people can’t master the buttons and custom reports in
dashboards; they simply don’t have time. These people — and I include
myself among them — need something faster and knob-free. Simplic‐
ity is the thing that will ultimately democratize data reporting and data
insights. That’s why the expansion of data analysis requires a refine‐
ment beyond our current dashboards. There’s a next step that hasn’t
been addressed.
Does the answer lie in narrative? Will visualizations lead the way? Will
a hybrid format take root? I don’t know what the final outputs will look
like, but the importance of data reporting means someone will even‐
tually crack the problem.
Full Interview
You can see the entire discussion with Hammond in this interview.
Mining the Astronomical Literature
By Alasdair Allan
There is a huge debate right now about making academic literature
freely accessible and moving toward open access. But what would be
possible if people stopped talking about it and just dug in and got on
with it?
NASA’s Astrophysics Data System (ADS), hosted by the Smithsonian
Astrophysical Observatory (SAO), has quietly been working away
since the mid-’90s. Without much, if any, fanfare amongst the other
disciplines, it has moved astronomers into a world where access to the
literature is just a given. It’s something they don’t have to think about
all that much.
The ADS service provides access to abstracts for virtually all of the
astronomical literature. But it also provides access to the full text of
Mining the Astronomical Literature
|
43
more than half a million papers, going right back to the start of peerreviewed journals in the 1800s. The service has links to online data
archives, along with reference and citation information for each of the
papers, and it’s all searchable and downloadable.
Number of papers published in the three main astronomy journals each
year. Credit: Robert Simpson
The existence of the ADS, along with the arXiv pre-print server, has
meant that most astronomers haven’t seen the inside of a brick-built
library since the late 1990s.
It also makes astronomy almost uniquely well placed for interesting
data mining experiments, experiments that hint at what the rest of
academia could do if they followed astronomy’s lead. The fact that the
discipline’s literature has been scanned, archived, indexed and cata‐
logued, and placed behind a RESTful API makes it a treasure trove,
both for hypothesis generation and sociological research.
For example, the .Astronomy series of conferences is a small workshop
that brings together the best and brightest of the technical community:
researchers, developers, educators, and communicators. Billed as
“20% time for astronomers,” it gives these people space to think about
how the new technologies affect both how research and communicat‐
ing research to their peers and to the public is done.
[Disclosure: I’m a member of the advisory board to the .Astronomy con‐
ference, and I previously served as a member of the programme organ‐
ising committee for the conference series.]
44
|
Chapter 4: The Application of Big Data
It should perhaps come as little surprise that one of the more inter‐
esting projects to come out of a hack day held as part of this year’s .As‐
tronomy meeting in Heidelberg was work by Robert Simpson, Karen
Masters and Sarah Kendrew that focused on data mining the astro‐
nomical literature.
The team grabbed and processed the titles and abstracts of all the pa‐
pers from the Astrophysical Journal (ApJ), Astronomy & Astrophy‐
sics (A&A), and the Monthly Notices of the Royal Astronomical So‐
ciety (MNRAS) since each of those journals started publication — and
that’s 1827 in the case of MNRAS.
By the end of the day, they’d found some interesting results showing
how various terms have trended over time. The results were similar to
what’s found in Google Books’ Ngram Viewer.
The relative popularity of the names of telescopes in the literature. Hub‐
ble, Chandra, and Spitzer seem to have taken turns in hogging the lime‐
light, much as COBE, WMAP, and Planck have each contributed to our
knowledge of the cosmic microwave background in successive decades.
References to Planck are still on the rise. Credit: Robert Simpson.
After the meeting, however, Robert took his initial results and explored
the astronomical literature and his new corpus of data on the literature.
He has explored various visualisations of the data, including word
matrixes for related terms and for various astro-chemistry.
Mining the Astronomical Literature
|
45
Correlation between terms related to Active Galactic Nuclei (AGN). The
opacity of each square represents the strength of the correlation between
the terms. Credit: Robert Simpson.
He has also taken a look at authorship in astronomy and is starting to
find some interesting trends.
46
|
Chapter 4: The Application of Big Data
Fraction of astronomical papers published with one, two, three, four, or
more authors. Credit: Robert Simpson
You can see that single-author papers dominated for most of the 20th
century. Around 1960, we see the decline begin, as two- and threeauthor papers begin to become a significant chunk of the whole. In
1978, author papers become more prevalent than single-author pa‐
pers.
Compare the number of “active” research astronomers to the number of
papers published each year (across all the major journals). Credit: Robert
Simpson.
Mining the Astronomical Literature
|
47
Here we see that people begin to outpace papers in the 1960s. This may
reflect the fact that as we get more technical as a field, and more spe‐
cialised, it takes more people to write the same number of papers,
which is a sort of interesting result all by itself.
Interview with Robert Simpson: Behind the Project and
What Lies Ahead
I recently talked with Rob about the work he, Karen Masters, and Sarah
Kendrew did at the meeting, and the work he has been doing since
with the newly gathered data.
What made you think about data mining the ADS?
Robert Simpson: At the .Astronomy 4 Hack Day in July, Sarah Ken‐
drew had the idea to try to do an astronomy version of BrainSCANr,
a project that generates new hypotheses in the neuroscience literature.
I’ve had a go at mining ADS and arXiv before, so it seemed like a great
excuse to dive back in.
Do you think there might be actual science that could be done here?
Robert Simpson: Yes, in the form of finding questions that were un‐
expected. With such large volumes of peer-reviewed papers being pro‐
duced daily in astronomy, there is a lot being said. Most researchers
can only try to keep up with it all — my daily RSS feed from arXiv is
next to useless, it’s so bloated. In amongst all that text, there must be
connections and relationships that are being missed by the community
at large, hidden in the chatter. Maybe we can develop simple techni‐
ques to highlight potential missed links, i.e., generate new hypotheses
from the mass of words and data.
Are the results coming out of the work useful for auditing academics?
Robert Simpson: Well, perhaps, but that would be tricky territory in
my opinion. I’ve only just begun to explore the data around authorship
in astronomy. One thing that is clear is that we can see a big trend
toward collaborative work. In 2012, only 6% of papers were singleauthor efforts, compared with 70+% in the 1950s.
48
|
Chapter 4: The Application of Big Data
The above plot shows the average number of authors, per paper since
1827. Credit: Robert Simpson.
We can measure how large groups are becoming, and who is part of
which groups. In that sense, we can audit research groups, and maybe
individual people. The big issue is keeping track of people through
variations in their names and affiliations. Identifying authors is prob‐
ably a solved problem if we look at ORCID.
What about citations? Can you draw any comparisons with h-index
data?
Robert Simpson: I haven’t looked at h-index stuff specifically, at least
not yet, but citations are fun. I looked at the trends surrounding the
term dark matter and saw something interesting. Mentions of dark
matter rise steadily after it first appears in the late ’70s.
Mining the Astronomical Literature
|
49
Compare the term “dark matter” with a few other related terms: “cos‐
mology,” “big bang,” “dark energy,” and “wmap.” You can see cosmology
has been getting more popular since the 1990s, and dark energy is a recent
addition. Credit: Robert Simpson.
In the data, astronomy becomes more and more obsessed with dark
matter — the term appears in 1% of all papers by the end of the ’80s
and 6% today.
Looking at citations changes the picture. The community is writing
papers about dark matter more and more each year, but they are getting
fewer citations than they used to (the peak for this was in the late ’90s).
These trends are normalised, so the only regency effect I can think of
is that dark matter papers take more than 10 years to become citable.
Either that or dark matter studies are currently in a trough for impact.
Can you see where work is dropped by parts of the community and
picked up again?
Robert Simpson: Not yet, but I see what you mean. I need to build a
better picture of the community and its components.
Can you build a social graph of astronomers out of this data? What
about (academic) family trees?
Robert Simpson: Identifying unique authors is my next step, followed
by creating fingerprints of individuals at a given point in time. When
do people create their first-author papers, when do they have the most
impact in their careers, stuff like that.
What tools did you use? In hindsight, would you do it differently?
50
|
Chapter 4: The Application of Big Data
I’m using Ruby and Perl to grab the data, MySQL to store and query
it, JavaScript to display it (Google Charts and D3.js). I may still move
the database part to MongoDB because it was designed to store docu‐
ments. Similarly, I may switch from ADS to arXiv as the data source.
Using arXiv would allow me to grab the full text in many cases, even
if it does introduce a peer-review issue.
What’s next?
Robert Simpson: My aim is still to attempt real hypothesis generation.
I’ve begun the process by investigating correlations between terms in
the literature, but I think the power will be in being able to compare
all terms with all terms and looking for the unexpected. Terms may
correlate indirectly (via a third term, for example), so the entire corpus
needs to be processed and optimised to make it work comprehensively.
Science between the Cracks
I’m really looking forward to seeing more results coming out of Rob‐
ert’s work. This sort of analysis hasn’t really been possible before. It’s
showing a lot of promise both from a sociological angle, with the ability
to do research into how science is done and how that has changed, but
also ultimately as a hypothesis engine — something that can generate
new science in and of itself. This is just a hack day experiment. Imagine
what could be done if the literature were more open and this sort of
analysis could be done across fields?
Right now, a lot of the most interesting science is being done in the
cracks between disciplines, but the hardest part of that sort of work is
often trying to understand the literature of the discipline that isn’t your
own. Robert’s project offers a lot of hope that this may soon become
easier.
The Dark Side of Data
By Mike Loukides
Tom Slee’s “Seeing Like a Geek” is a thoughtful article on the dark side
of open data. He starts with the story of a Dalit community in India,
whose land was transferred to a group of higher cast Mudaliars
through bureaucratic manipulation under the guise of standardizing
and digitizing property records. While this sounds like a good idea, it
gave a wealthier, more powerful group a chance to erase older, tradi‐
The Dark Side of Data
|
51
tional records that hadn’t been properly codified. One effect of passing
laws requiring standardized, digital data is to marginalize all data that
can’t be standardized or digitized, and to marginalize the people who
don’t control the process of standardization.
That’s a serious problem. It’s sad to see oppression and property theft
riding in under the guise of transparency and openness. But the issue
isn’t open data, but how data is used.
Jesus said “the poor are with you always” not because the poor aren’t
a legitimate area of concern (only an American fundamentalist would
say that), but because they’re an intractable problem that won’t go
away. The poor are going to be the victims of any changes in technol‐
ogy; it isn’t surprisingly that the wealthy in India used data to mar‐
ginalize the land holdings of the poor. In a similar vein, when Euro‐
peans came to North America, I imagine they asked the natives “So,
you got a deed to all this land?,” a narrative that’s still being played
out with indigenous people around the world.
The issue is how data is used. If the wealthy can manipulate legislators
to wipe out generations of records and folk knowledge as “inaccurate,”
then there’s a problem. A group like DataKind could go in and figure
out a way to codify that older generation of knowledge. Then at least,
if that isn’t acceptable to the government, it would be clear that the
problem lies in political manipulation, not in the data itself. And note
that a government could wipe out generations of “inaccurate records”
without any requirement that the new records be open. In years past
the monied classes would have just taken what they wanted, with the
government’s support. The availability of open data gives a plausible
pretext, but it’s certainly not a prerequisite (nor should it be blamed)
for manipulation by the 0.1%.
One can see the opposite happening, too: the recent legislation in
North Carolina that you can’t use data that shows sea level rise. Open
data may be the only possible resource against forces that are interested
in suppressing science. What we’re seeing here is a full-scale retreat
from data and what it can teach us: an attempt to push the furniture
against the door to prevent...
Purchase answer to see full
attachment