CPSC6730 Big Data Analytics
Lecture # 4
Apache Hive
• Apache Hive
is part of Data
Access in the
Hadoop
ecosystem
and can be
installed
when you
install the
Hortonworks
Data Platform
The Problem
• Until recently most of the data maintained by an enterprise has been
stored in a relational database and has been analyzed using a
structured query language. As a result, most data analysts today are
familiar with a structured query language.
However, data in Hadoop is commonly analyzed using MapReduce.
Many data analysts are not familiar with MapReduce and would
require training to use it. This limits how quickly an enterprise can
derive value from a Hadoop deployment.
How do enterprises bridge this knowledge gap?
The Solution
• Apache Hive bridges the knowledge gap by enabling data analysts to
use familiar SQL-like commands that are automatically converted to
MapReduce jobs and executed across the Hadoop cluster.
Hive is a data warehouse infrastructure built on top of Hadoop. It was
designed to enable users with database experience to analyze data
using familiar SQL-like statements. Hive includes a SQL-like language
called Hive Query Language, or HQL. Hive and HQL enable an
enterprise to utilize existing skillsets to quickly derive value from a
Hadoop deployment.
OLTP or OLAP
• Hive is used for online analytical processing (OLAP) and not online
transaction processing (OLTP). This is because Hive was originally designed
to run batch jobs rather than performing interactive queries or random
table updates. Currently Hive offers no support for row-level inserts,
updates, and deletes which are commonly required for OLTP. When Hive is
run over MapReduce even the simplest Hive queries can take minutes to
complete.
If you run Hive over Tez (we will discuss it in later classes) rather than
MapReduce, Hive is still not designed for OLTP. While Tez increases
interactive performance, Hive still has no support for row-level inserts,
updates, and deletes. However, work is currently being done to add these
features to Hive.
Structuring Unstructured Data
Hive is not a relational database although, on the
surface, it can appear like one.
Hadoop was built to collect, store, and analyze
massive amounts of data. As such, the Hadoop
distributed file system, called HDFS, is a reservoir
of data from multiple sources. The data is often a
mix of unstructured, semi-structured, and
structured data. Hive provides a mechanism to
project structure onto HDFS data and then query it
using HQL. However, there is a limit to what Hive
can do. Sometimes it is necessary to use another
tool, like Apache Pig, to pre-format the
unstructured data before processing it using Hive.
Structuring Unstructured Data
If you are familiar with databases, then you understand that unstructured
data has no schema associated with it. If you are not familiar with database
schemas, they define the columns of a database along with the type of data
in each column. Data types include such things as a string, an integer, a
floating point number, or a date.
A Hive installation includes a metastore database. Several database types are
supported by Hive including an embedded Derby database used for
development or testing, or an external database like MySQL used for
production deployments. To project structure on HDFS data, HQL includes
statements to create a table with user-defined schema information. The
table schema is stored in the metastore database.
The user-defined schema is associated with the data stored in one or more
HDFS files when you use HQL statements to load the files into a table. The
format of the data on HDFS remains unchanged but it appears as structured
data when using HQL commands to submit queries.
Submitting Hive Queries
Hive includes many methods to submit
queries. Queries submitted to either the
HiveServer or newer HiveServer2 result in a
MapReduce or Tez job being submitted to
YARN. YARN, the Hadoop resource scheduler,
works in concert with HDFS to run the job in
parallel across the machines in the cluster.
The Hive CLI is used to interactively or noninteractively submit HQL commands to the
HiveServer. The illustration shows the Hive CLI
being used interactively. Users enter HQL
commands at the hive> prompt. HQL
commands can also be placed into a file and
run using hive –f file_name
Submitting Hive Queries
The remaining three methods all submit HQL queries to the newer HiveServer2.
The Beeline CLI is a new JDBC client that connects to a local or remote HiveServer2.
When connecting locally, beeline works just like the Hive CLI. Beeline can connect to a
remote HiveServer2 using a variety of methods that include TCP and HTTP. For
example, HTTP access is useful for submitting queries to a firewall protected cluster,
assuming the firewall will allow HTTP traffic.
The Web UI, called the Hive Web Interface or HWI, enables you to submit Hive queries
remotely using HTTP. Again, this is useful for submitting queries to a firewall protected
cluster. The difference between using the Web UI or beeline is that no Hive client
software has to be installed to use the Web UI.
ODBC and JDBC drivers enable you to connect to popular business intelligence tools
to query, analyze, and visualize Hive data.
Example HQL Commands
• Create a table:
CREATE TABLE stockinfo (symbol STRING, price FLOAT,
change FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’;
• Load data from file in HDFS:
LOAD DATA INPATH ‘/user/me/stockdata.csv’
OVERWRITE INTO TABLE stockinfo;
• View everything in the table:
SELECT * from stockinfo;
Example HQL Commands
• A few common HQL commands are illustrated here.
The first command creates a table named stockinfo. The table is created with
three columns named symbol, price, and change. Any data in the first column will
be treated as a string of characters. Any data in the second and third columns will
be treated as a floating point number. The ROW FORMAT clause tells Hive to
expect each row of data to be delimited by commas.
The second command loads the data in the HDFS file named
/user/me/stockdata.csv into a table named stockinfo. If any data has been
previously loaded into the table it will overwritten.
The final command displays the entire contents of the stockinfo table. Such a
command could take a long time to complete if there is a large amount of data.
HQL syntax rules require every command end with a semi-colon.
Hive Summary
• Hive was designed to enable experienced database users to analyze
data using familiar SQL-like statements.
• Hive is suitable for online analytical processing and not online
transaction processing.
• Hive over MapReduce supports batch queries. Hive over Tez supports
batch and interactive queries.
• Hive is not a relational database. It only projects structure on
unstructured data in HDFS.
• There are multiple methods to submit Hive queries.
Apache Pig
Apache Pig is
part of Data
Access in the
Hadoop
ecosystem and is
installed when
you install the
Hortonworks
Data Platform.
Why Does the World Need a Pig?
• Hadoop includes a tool named Apache Hive that enables data
analysts to use familiar SQL-like commands to analyze data. Hive can
even project a structure onto unstructured data. However, there are
limits to what Hive can do with unstructured data.
For example, Hive can read comma separated value (CSV) files. But
what if one of the fields contains a comma? Hive would incorrectly
treat the comma as a field separator. In such an instance it would be
helpful to pre-structure the data before Hive analyzes it. But is there a
tool that can do this?
Apache Pig is such a tool.
What Does a Pig Do?
• Apache Pig is a high-level platform for transforming or analyzing large datasets.
Pig includes a scripted, procedural-based language that excels at building data
pipelines to aggregate and add structure to data. Pig also provides data analysts
with tools to analyze data.
Pig was developed at Yahoo! to make it easier for data analysts to perform
complex MapReduce data transformations using a simple scripting
language. Data analysts can use Pig without having to learn the complexities of
writing MapReduce programs. Pig's infrastructure layer includes a compiler that
automatically converts a Pig script to a sequence of MapReduce jobs.
Pig and Hive are complementary tools that are often used together. Pig is often
used to pre-structure data for Hive. Hive’s query language is great for asking a
question of your data.
Pig Use-Cases
• Pig excels at performing a long series of data operations.
• This makes it ideal for three categories of big data jobs:
• Extract-transform-load (ETL) pipelines
• Research on raw data
• Iterative data processing
Extract-Transform-Load
ETL systems are commonly used to integrate data from multiple applications,
typically developed and supported by different vendors.
For example, data in Hadoop is generated by a variety of sources including
databases, logging programs, applications, social media, point-of-sale terminals,
and sensors. This data can be loaded by Pig, transformed by Pig scripts, and saved
back to HDFS. The transformed data can be loaded by Hive, or other tools, for
analysis.
Research on Raw Data
Pig is also an analysis tool and is used to perform research on raw data.
Because Pig is a distributed application that runs in Hadoop, it can process
massive amounts of data. You can achieve more detailed and accurate
results by analyzing all the raw data rather than just a subset.
It is better not to sample data. The idea of sampling is to take a portion of
the data, and run analysis on that portion. The problem with not analyzing
all the data it that an analyst might miss key data points that could form
an important relationship.
Let us take two different spell checkers as an example. Engineers at one
software company sampled the most common spelling errors that people
made, such as doubled letters or transposed letters. From this sample
they built patterns to make educated guesses about users’ intentions.
Another software company solved the spelling-correction problem
entirely differently, and much more efficiently, by simply looking at a huge
database of users correcting their own errors. What did users most often
type next after failing to find what they wanted? Today this company is
recognized as having one of the best spell checkers in the world.
Iterative Data Processing
Iterative data processing examines the same dataset
multiple times, but in different ways. The purpose is to
find different patterns and correlations in the data that are
used to influence business decisions. This is made easier
by the fact that neither Pig nor Hadoop alters the raw data
between iterations. The same raw data can be analyzed
over and over again but with a different purpose each
time.
The process operates as follows. A source dataset is
loaded into the Hadoop cluster. A Pig script processes the
dataset and any results are output. Optionally the dataset
might be sent to another program for further analysis and
results. This cycle repeats using the same dataset until
there is no more useful information available from the
data.
What Do Pigs Speak?
Pig's language is Pig Latin. Pig Latin is a simple scripting language for expressing
data transformations. It includes commands for merging datasets, filtering datasets,
and applying functions to records or groups of records. Pig features a variety of
built-in functions for data processing. It also supports user-defined functions that
you design for any unique data processing needs.
Pig processes structured data that has a schema or unstructured data without a
schema. This includes textual data as well as binary data. Pig was designed to eat
anything.
Pig Latin uses SQL-like syntax that permits users to focus on semantics and flow
while allowing the Pig compiler to optimize execution.
Pig scripts read data from, and write data to, the Hadoop distributed file system,
called HDFS.
Submitting Pig Latin Statements
Pig includes three methods to submit Pig
Latin statements.
Pig includes a command-line interface
named the Grunt shell. You may enter Pig
Latin statements interactively using the
Grunt shell.
You may enter Pig Latin statements into a
text file to create a Pig script. To execute the
Pig script use the pig command followed by
the file name. For example, enter pig
script_name.
You may also enter Pig Latin statements from
within a Java program using the
PigServer class.
Example Pig Statements
• A data processing step that creates or alters a dataset results in a relation.
• A relation represents a new dataset.
• Relations are assigned a name called an alias.
• In the following examples stockinfo and IBM_only are aliases.
• Load data from a file and apply a schema:
stockinfo = LOAD ‘stockdata.csv’ AS (symbol STRING,
price FLOAT, change FLOAT);
• Display the data in stockinfo:
DUMP stockinfo;
• Filter the stockinfo data and write the filtered data to HDFS:
IBM_only = FILTER stockinfo BY (symbol == ‘IBM’);
STORE IBM_only INTO ‘ibm_stockinfo’;
Example Pig Statements
To understand Pig Latin, you must understand the concept of a relation. A data processing
step in a Pig Latin script that creates or alters a dataset results in a new relation. The
relation represents a new dataset.
Relations are assigned a name called an alias. An alias can be as simple as a single letter,
like A, or more descriptive like only_IBM_stock_info. In the illustration, stock_info and
IBM_only are both aliases for different relations, and each relation represents a different
dataset.
In the first example, the HDFS file stockdata.csv is loaded into a relation named stockinfo. A
schema is applied to the dataset. The schema defines three columns named symbol, price,
and change. Any data in the first column will be treated as a string of characters. Any data
in the second and third columns will be treated as a floating point number.
The second example displays the entire contents of stockinfo to the screen. Such a
command could take a long time to complete if there is a large amount of data.
The third example creates a new dataset—a new relation—named IBM_only. The new
dataset is created by filtering the original stockinfo dataset. The new dataset contains only
IBM stock information. The new dataset is written to a file in HDFS named ibm_stockinfo.
Pig Latin syntax rules require every statement end with a semi-colon.
Pig Summary
• Pig includes a scripted, procedural-based language named Pig Latin
that is automatically converted to a MapReduce job.
• Pig is useful for extract-transform-load operations.
• Pig and Hive are complementary tools.
• You can achieve more detailed and accurate results by analyzing all
the raw data rather than just a subset.
• There are multiple methods to submit Pig Latin statements.
• A Pig relation is the result of a data processing step.
Purchase answer to see full
attachment