Big Data Analytics. Research paper

User Generated

zvpunrynyyra

Writing

Data Science big data analytics

ST Xaviers College

Description

Assignment # 3 is a research assignment.

We studied Apache Pig in lecture # 4. You are supposed to doonlineresearch and find out one case study where Apache Pig was used to solve a particular problem. I am expecting 3 page write-up. Please provide as much technical details as possible about solution through Apache Pig. Please draw technical diagrams to explain the solution.

I am expecting maximum one page for business problem and 2 pages of technical solution. I want everyone to do research and provide their own write-up.

I am not happy that some students are copying from websites and not putting their effort to do research. These assignments will formulate your final grade so if you want to score high grades then show originality in your research.

Unformatted Attachment Preview

CPSC6730 Big Data Analytics Lecture # 4 Apache Hive • Apache Hive is part of Data Access in the Hadoop ecosystem and can be installed when you install the Hortonworks Data Platform The Problem • Until recently most of the data maintained by an enterprise has been stored in a relational database and has been analyzed using a structured query language. As a result, most data analysts today are familiar with a structured query language. However, data in Hadoop is commonly analyzed using MapReduce. Many data analysts are not familiar with MapReduce and would require training to use it. This limits how quickly an enterprise can derive value from a Hadoop deployment. How do enterprises bridge this knowledge gap? The Solution • Apache Hive bridges the knowledge gap by enabling data analysts to use familiar SQL-like commands that are automatically converted to MapReduce jobs and executed across the Hadoop cluster. Hive is a data warehouse infrastructure built on top of Hadoop. It was designed to enable users with database experience to analyze data using familiar SQL-like statements. Hive includes a SQL-like language called Hive Query Language, or HQL. Hive and HQL enable an enterprise to utilize existing skillsets to quickly derive value from a Hadoop deployment. OLTP or OLAP • Hive is used for online analytical processing (OLAP) and not online transaction processing (OLTP). This is because Hive was originally designed to run batch jobs rather than performing interactive queries or random table updates. Currently Hive offers no support for row-level inserts, updates, and deletes which are commonly required for OLTP. When Hive is run over MapReduce even the simplest Hive queries can take minutes to complete. If you run Hive over Tez (we will discuss it in later classes) rather than MapReduce, Hive is still not designed for OLTP. While Tez increases interactive performance, Hive still has no support for row-level inserts, updates, and deletes. However, work is currently being done to add these features to Hive. Structuring Unstructured Data Hive is not a relational database although, on the surface, it can appear like one. Hadoop was built to collect, store, and analyze massive amounts of data. As such, the Hadoop distributed file system, called HDFS, is a reservoir of data from multiple sources. The data is often a mix of unstructured, semi-structured, and structured data. Hive provides a mechanism to project structure onto HDFS data and then query it using HQL. However, there is a limit to what Hive can do. Sometimes it is necessary to use another tool, like Apache Pig, to pre-format the unstructured data before processing it using Hive. Structuring Unstructured Data If you are familiar with databases, then you understand that unstructured data has no schema associated with it. If you are not familiar with database schemas, they define the columns of a database along with the type of data in each column. Data types include such things as a string, an integer, a floating point number, or a date. A Hive installation includes a metastore database. Several database types are supported by Hive including an embedded Derby database used for development or testing, or an external database like MySQL used for production deployments. To project structure on HDFS data, HQL includes statements to create a table with user-defined schema information. The table schema is stored in the metastore database. The user-defined schema is associated with the data stored in one or more HDFS files when you use HQL statements to load the files into a table. The format of the data on HDFS remains unchanged but it appears as structured data when using HQL commands to submit queries. Submitting Hive Queries Hive includes many methods to submit queries. Queries submitted to either the HiveServer or newer HiveServer2 result in a MapReduce or Tez job being submitted to YARN. YARN, the Hadoop resource scheduler, works in concert with HDFS to run the job in parallel across the machines in the cluster. The Hive CLI is used to interactively or noninteractively submit HQL commands to the HiveServer. The illustration shows the Hive CLI being used interactively. Users enter HQL commands at the hive> prompt. HQL commands can also be placed into a file and run using hive –f file_name Submitting Hive Queries The remaining three methods all submit HQL queries to the newer HiveServer2. The Beeline CLI is a new JDBC client that connects to a local or remote HiveServer2. When connecting locally, beeline works just like the Hive CLI. Beeline can connect to a remote HiveServer2 using a variety of methods that include TCP and HTTP. For example, HTTP access is useful for submitting queries to a firewall protected cluster, assuming the firewall will allow HTTP traffic. The Web UI, called the Hive Web Interface or HWI, enables you to submit Hive queries remotely using HTTP. Again, this is useful for submitting queries to a firewall protected cluster. The difference between using the Web UI or beeline is that no Hive client software has to be installed to use the Web UI. ODBC and JDBC drivers enable you to connect to popular business intelligence tools to query, analyze, and visualize Hive data. Example HQL Commands • Create a table: CREATE TABLE stockinfo (symbol STRING, price FLOAT, change FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’; • Load data from file in HDFS: LOAD DATA INPATH ‘/user/me/stockdata.csv’ OVERWRITE INTO TABLE stockinfo; • View everything in the table: SELECT * from stockinfo; Example HQL Commands • A few common HQL commands are illustrated here. The first command creates a table named stockinfo. The table is created with three columns named symbol, price, and change. Any data in the first column will be treated as a string of characters. Any data in the second and third columns will be treated as a floating point number. The ROW FORMAT clause tells Hive to expect each row of data to be delimited by commas. The second command loads the data in the HDFS file named /user/me/stockdata.csv into a table named stockinfo. If any data has been previously loaded into the table it will overwritten. The final command displays the entire contents of the stockinfo table. Such a command could take a long time to complete if there is a large amount of data. HQL syntax rules require every command end with a semi-colon. Hive Summary • Hive was designed to enable experienced database users to analyze data using familiar SQL-like statements. • Hive is suitable for online analytical processing and not online transaction processing. • Hive over MapReduce supports batch queries. Hive over Tez supports batch and interactive queries. • Hive is not a relational database. It only projects structure on unstructured data in HDFS. • There are multiple methods to submit Hive queries. Apache Pig Apache Pig is part of Data Access in the Hadoop ecosystem and is installed when you install the Hortonworks Data Platform. Why Does the World Need a Pig? • Hadoop includes a tool named Apache Hive that enables data analysts to use familiar SQL-like commands to analyze data. Hive can even project a structure onto unstructured data. However, there are limits to what Hive can do with unstructured data. For example, Hive can read comma separated value (CSV) files. But what if one of the fields contains a comma? Hive would incorrectly treat the comma as a field separator. In such an instance it would be helpful to pre-structure the data before Hive analyzes it. But is there a tool that can do this? Apache Pig is such a tool. What Does a Pig Do? • Apache Pig is a high-level platform for transforming or analyzing large datasets. Pig includes a scripted, procedural-based language that excels at building data pipelines to aggregate and add structure to data. Pig also provides data analysts with tools to analyze data. Pig was developed at Yahoo! to make it easier for data analysts to perform complex MapReduce data transformations using a simple scripting language. Data analysts can use Pig without having to learn the complexities of writing MapReduce programs. Pig's infrastructure layer includes a compiler that automatically converts a Pig script to a sequence of MapReduce jobs. Pig and Hive are complementary tools that are often used together. Pig is often used to pre-structure data for Hive. Hive’s query language is great for asking a question of your data. Pig Use-Cases • Pig excels at performing a long series of data operations. • This makes it ideal for three categories of big data jobs: • Extract-transform-load (ETL) pipelines • Research on raw data • Iterative data processing Extract-Transform-Load ETL systems are commonly used to integrate data from multiple applications, typically developed and supported by different vendors. For example, data in Hadoop is generated by a variety of sources including databases, logging programs, applications, social media, point-of-sale terminals, and sensors. This data can be loaded by Pig, transformed by Pig scripts, and saved back to HDFS. The transformed data can be loaded by Hive, or other tools, for analysis. Research on Raw Data Pig is also an analysis tool and is used to perform research on raw data. Because Pig is a distributed application that runs in Hadoop, it can process massive amounts of data. You can achieve more detailed and accurate results by analyzing all the raw data rather than just a subset. It is better not to sample data. The idea of sampling is to take a portion of the data, and run analysis on that portion. The problem with not analyzing all the data it that an analyst might miss key data points that could form an important relationship. Let us take two different spell checkers as an example. Engineers at one software company sampled the most common spelling errors that people made, such as doubled letters or transposed letters. From this sample they built patterns to make educated guesses about users’ intentions. Another software company solved the spelling-correction problem entirely differently, and much more efficiently, by simply looking at a huge database of users correcting their own errors. What did users most often type next after failing to find what they wanted? Today this company is recognized as having one of the best spell checkers in the world. Iterative Data Processing Iterative data processing examines the same dataset multiple times, but in different ways. The purpose is to find different patterns and correlations in the data that are used to influence business decisions. This is made easier by the fact that neither Pig nor Hadoop alters the raw data between iterations. The same raw data can be analyzed over and over again but with a different purpose each time. The process operates as follows. A source dataset is loaded into the Hadoop cluster. A Pig script processes the dataset and any results are output. Optionally the dataset might be sent to another program for further analysis and results. This cycle repeats using the same dataset until there is no more useful information available from the data. What Do Pigs Speak? Pig's language is Pig Latin. Pig Latin is a simple scripting language for expressing data transformations. It includes commands for merging datasets, filtering datasets, and applying functions to records or groups of records. Pig features a variety of built-in functions for data processing. It also supports user-defined functions that you design for any unique data processing needs. Pig processes structured data that has a schema or unstructured data without a schema. This includes textual data as well as binary data. Pig was designed to eat anything. Pig Latin uses SQL-like syntax that permits users to focus on semantics and flow while allowing the Pig compiler to optimize execution. Pig scripts read data from, and write data to, the Hadoop distributed file system, called HDFS. Submitting Pig Latin Statements Pig includes three methods to submit Pig Latin statements. Pig includes a command-line interface named the Grunt shell. You may enter Pig Latin statements interactively using the Grunt shell. You may enter Pig Latin statements into a text file to create a Pig script. To execute the Pig script use the pig command followed by the file name. For example, enter pig script_name. You may also enter Pig Latin statements from within a Java program using the PigServer class. Example Pig Statements • A data processing step that creates or alters a dataset results in a relation. • A relation represents a new dataset. • Relations are assigned a name called an alias. • In the following examples stockinfo and IBM_only are aliases. • Load data from a file and apply a schema: stockinfo = LOAD ‘stockdata.csv’ AS (symbol STRING, price FLOAT, change FLOAT); • Display the data in stockinfo: DUMP stockinfo; • Filter the stockinfo data and write the filtered data to HDFS: IBM_only = FILTER stockinfo BY (symbol == ‘IBM’); STORE IBM_only INTO ‘ibm_stockinfo’; Example Pig Statements To understand Pig Latin, you must understand the concept of a relation. A data processing step in a Pig Latin script that creates or alters a dataset results in a new relation. The relation represents a new dataset. Relations are assigned a name called an alias. An alias can be as simple as a single letter, like A, or more descriptive like only_IBM_stock_info. In the illustration, stock_info and IBM_only are both aliases for different relations, and each relation represents a different dataset. In the first example, the HDFS file stockdata.csv is loaded into a relation named stockinfo. A schema is applied to the dataset. The schema defines three columns named symbol, price, and change. Any data in the first column will be treated as a string of characters. Any data in the second and third columns will be treated as a floating point number. The second example displays the entire contents of stockinfo to the screen. Such a command could take a long time to complete if there is a large amount of data. The third example creates a new dataset—a new relation—named IBM_only. The new dataset is created by filtering the original stockinfo dataset. The new dataset contains only IBM stock information. The new dataset is written to a file in HDFS named ibm_stockinfo. Pig Latin syntax rules require every statement end with a semi-colon. Pig Summary • Pig includes a scripted, procedural-based language named Pig Latin that is automatically converted to a MapReduce job. • Pig is useful for extract-transform-load operations. • Pig and Hive are complementary tools. • You can achieve more detailed and accurate results by analyzing all the raw data rather than just a subset. • There are multiple methods to submit Pig Latin statements. • A Pig relation is the result of a data processing step.
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Please find the attached file below. Please feel free to ask me incase of any question. Thanks!

Running Head: Big Data Analytics.

Your name
Instructor
Course
Date

Running Head: Big Data Analytics.
Sentiment Analysis Using Apache Pig; Case Twitter Data Analysis.
Business Problem.
Companies launching new products or analyzing the quality of service are interested in
knowing what their customers think about their products. They would like to know whether they
are satisfied, unsatisfied or have some proposals for modifications. This would help companies
to improve accordingly by modifying or totally altering their products to meet customer
demands.
Manual data collection about customer sentiments is much challenging and costly. The
emergence of social media has enabled companies to solicit customer sentiments directly from
customers over the media. Customers express their attitudes, feelings, and thoughts about the
products that companies provide. This is because social media has created a large pool of
consumers where business can interact with their customers freely. However, this has only
partially solved the problem.
Over the social media, companies have solicited for customers to provide their sentiments
about the products or services they provide. At most instances, the large pool of customers
creates a challenge of attending to each sentiment and classifying it as positive or negative.
Classifying the emotions based on customer expression which is part of social data analysis is
becoming more technical with the increase in the number of customers interacting with their
product manuf...

Similar Content

Related Tags