University Of Michigan Flint Chapter 10 Hadoop and its Core Components Paper

Content Type

User Generated

User

xevfu8

Subject

Writing

School

University Of Michigan Flint

Description

After reading Chapter 10 in your textbook, please provide a brief response to the following assessment questions.

In your own words and understanding after reading Chapter 10:

What is Hadoop?
Name the two main components of Hadoop and discuss the roles of those two components during system failures?

Your initial post should be at least 100 words.

***Please follow APA format***

Youtube link: youtube.com/watch?v=MfF750YVDxM

https://hadoopecosystemtable.github.io/

Unformatted Attachment Preview

Chap 10: Adv. Analytics – Tech & Tools: MapReduce and Hadoop 10.1 Analytics for Unstructured Data 10.1.1 Use Cases ◼ IBM Watson – Jeopardy playing machine ◼ To educate Watson, Hadoop was utilized to process data sources ◼ ◼ LinkedIn – network of over 250 million users in 200 countries ◼ ◼ Encyclopedias, dictionaries, news wire feeds, literature, Wikipedia, etc. Hadoop is used to process daily transaction logs, examine users’ activities, feed extracted data back to production systems, restructure the data, develop and test analytic models Yahoo! – large Hadoop deployment ◼ Search index creation and maintenance, Webpage content optimization, spam filters, etc. 10.1 Analytics for Unstructured Data 10.1.2 MapReduce ◼ ◼ The MapReduce paradigm breaks a large task into smaller tasks, runs the tasks in parallel, and consolidates the outputs of the individual tasks into the final output Map ◼ ◼ ◼ Reduce ◼ ◼ ◼ Applies an operation to a piece of data Provides some intermediate output Consolidates the intermediate outputs from the map steps Provides the final output Each step uses key/value pairs, denoted as input and output 10.1 Analytics for Unstructured Data 10.1.2 MapReduce MapReduce word count example 10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop ◼ ◼ MapReduce is a simple paradigm to understand but not easy to implement Executing a MapReduce job requires ◼ ◼ ◼ ◼ ◼ ◼ Jobs scheduled based on system’s workload Input data spread across cluster of machines Map step spread across distributed system Intermediate outputs collected and provided to proper machines for reduce step Final output made available to another user, another application, or another MapReduce job Next few slides present overview of Hadoop environment 10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop ◼ Hadoop Distributed File System (HDFS) ◼ ◼ File system that distributes data across a cluster to take advantage of the parallel processing of MapReduce HDFS uses three Java daemons (background processors) 1. 2. 3. NameNode – determines and tracks where various blocks of data are stored DataNode – manages the data stored on each machine Secondary NameNode – performs some of the NameNode tasks to reduce the load on NameNode 10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop ◼ Structuring a MapReduce Job in Hadoop ◼ A typical MapReduce Java program has three classes ◼ ◼ ◼ Driver – provides details such as input file locations, names of mapper and reducer classes, location of reduce class output, etc. Mapper – provides logic to process each data block Reducer – reduces the data provided by the mapper 10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop ◼ Structuring a MapReduce Job in Hadoop A file stored in HDFS 10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop ◼ Additional Considerations in Structuring a MapReduce Job Shuffle and Sort 10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop Using a combiner 10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop Using a custom partitioner 10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop ◼ ◼ ◼ Developing and Executing a Hadoop MapReduce Program Common practice is to use an IDE tool such as Eclipse The MapReduce program consists of three Java files ◼ ◼ Driver code, map code, and reduce code Java code is compiled and stored in a JAR file and executed against the specified HDFS input files 10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop ◼ ◼ ◼ ◼ Yet Another Resource Negotiator (YARN) Hadoop continues to undergo development An important one was to separate the MapReduce functionality from the management of running jobs and the distributed environment This rewrite is sometimes called ◼ Yet Another Resource Negotiator (YARN) 10.2 The Hadoop Ecosystem ◼ Tools have been developed to make Hadoop easier to use and provide additional functionality and features ◼ ◼ ◼ ◼ Pig – provides a high-level data-flow programming language Hive – provides SQL-like access HBase – provides real-time reads and writes Mahout – provides analytical tools 10.2 The Hadoop Ecosystem 10.2.1 Pig ◼ Pig consists of ◼ ◼ ◼ A data flow language called Pig Latin An environment to execute the Pig code Example of Pig commands 10.2 The Hadoop Ecosystem 10.2.1 Pig – Built-in Pig Functions 10.2 The Hadoop Ecosystem 10.2.2 Hive ◼ ◼ The Hive language, Hive Query Language (HiveQL), resembles SQL rather than a scripting language Example Hive code 10.2 The Hadoop Ecosystem 10.2.3 HBase ◼ Unlike Pig and Hive, intended for batch applications, HBase can provide real-time read and write access to huge datasets ◼ Example - Choosing a shipping address at checkout 10.2 The Hadoop Ecosystem 10.2.4 Mahout ◼ ◼ ◼ Mahout permits the application of analytical techniques within the Hadoop environment Mahout provides Java code that implements the usual classification, clustering, and recommenders/collaborative filtering algorithms Users can download Apache Hadoop directly from www.apache.org or use commercial packages ◼ For example, the Pivital company provides Pivital HD Enterprise 10.2 The Hadoop Ecosystem 10.2.4 Mahout ◼ Components of Pivotal HD Enterprise 10.3 NoSQL ◼ ◼ NoSQL = Not only Structured Query Language Four major categories of NoSQL tools ◼ ◼ ◼ ◼ Key/value stores – contains data (the value) accessed by the key Document stores – good when the value of the key/value pair is a file Column family stores – good for sparse datasets Graph stores – good for items and relationships between them ◼ Social networks like Facebook and LinkedIn 10.3 NoSQL ◼ Examples of NoSQL Data Stores
Purchase answer to see full attachment

Tags: big data IT information systems data science open-source software

User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

This work is ready!See the attached file and get back if you need changes.

Running head: HADOOP AND ITS CORE COMPONENTS

Hadoop and its Core Components
Student’s Name
Professor
Course
Date

1

HADOOP AND ITS CORE COMPONENTS

2

Hadoop and its Core Components
Hadoop is an open-source software that is used to store and run large amounts of data. It is
an important part of the...