IOT and Big data , computer science assignment help

Content Type

User Generated

User

ybpb_02

Subject

Computer Science

Description

Write a 3-pages (minimum) report on Internet of things and Big Data. You can use the resources I uploaded but feel free to find additional information for this assignment. Also be sure to cite sources in APA format. For the report, there should be three parts: 1. A summary / explanation of “Big Data” and why it is important. 2. A summary / explanation of IoT and its importance and 3. Describe how IoT and Big data are related. Be sure to proof read and write carefully. You may use double spacing, and be sure to use 12 pt font or smaller

Unformatted Attachment Preview

Why is BIG Data Important? | March 2012 | 1 Why is BIG Data Important? A Navint Partners White Paper May 2012 www.navint.com Why is BIG Data Important? | March 2012 | 2 What is Big Data? Big data is a term that refers to data sets or combinations of data sets whose size (volume), complexity (variability), and rate of growth (velocity) make them difficult to be captured, managed, processed or analyzed by conventional technologies and tools, such as relational databases and desktop statistics or visualization packages, within the time necessary to make them useful. While the size used to determine whether a particular data set is considered big data is not firmly defined and continues to change over time, most analysts and practitioners currently refer to data sets from 30-50 terabytes(10 12 or 1000 gigabytes per terabyte) to multiple petabytes (1015 or 1000 terabytes per petabyte) as big data. The complex nature of big data is primarily driven by the unstructured nature of much of the data that is generated by modern technologies, such as that from web logs, radio frequency Id (RFID), sensors embedded in devices, machinery, vehicles, Internet searches, social networks such as Facebook, portable computers, smart phones and other cell phones, GPS devices, and call center records. In most cases, in order to effectively utilize big data, it must be combined with structured data (typically from a relational database) from a more conventional business application, such as Enterprise Resource Planning (ERP) or Customer Relationship Management (CRM). Similar to the complexity, or variability, aspect of big data, its rate of growth, or velocity aspect, is largely due to the ubiquitous nature of modern on-line, real-time data capture devices, systems, and networks. It is expected that the rate of growth of big data will continue to increase for the foreseeable future. Specific new big data technologies and tools have been and continue to be developed. Much of the new big data technology relies heavily on massively parallel processing (MPP) databases, which can concurrently distribute the processing of very large sets of data across many servers. As another example, specific database query tools have been developed for working with the massive amounts of unstructured data that are being generated in big data environments. BIG Data – Growth and Size Facts (*MGI Estimates)  There were 5 billion mobile phones in use in 2010.  There are 30 billion pieces of content shared on Facebook each month.  There is a 40% projected growth in global data generated per year vs. 5% growth in global IT spending.  There were 235 terabytes of data collected by the US Library of Congress in April 2011.  15 out of 17 major business sectors in the United States have more data stored per company that the US Library of Congress. Big Data – Value Potential(*)  $300 billion annual value to US healthcare – more than twice the total annual healthcare spending in Spain.  $600 billion – potential annual consumer surplus from using personal location data globally.  60% - potential increase in retailers’ operating margins possible via use of big data. Big Data – Industry Examples  Major utility company integrates usage data recorded from smart meters in semi real-time into their analysis of the national energy grid.  Pay television providers have begun to customize ads based on individual household demographics and viewing patterns.  A major entertainment company is able to analyze its data and customer patterns across its many and varied enterprises – e.g. using park attendance, on-line purchase, and television viewership data.  The security arm of a financial services firm detects fraud by correlating activities across multiple data sets. As new fraud methods are detected and understood, they are used to encode new algorithms into the fraud detection system. www.navint.com contactus@navint.com 888.607.6575 This message and any attachment are confidential and may be privileged or otherwise protected from disclosure and solely for the use of the person(s) or entity to which it is intended. If you are not the intended recipient, be advised that any use of this message is prohibited and may be unlawful, and you must not copy this message or attachment or disclose the contents to any other person. Why is BIG Data Important? | March 2012 | 3 Why is Big Data Important? When big data is effectively and efficiently captured, processed, and analyzed, companies are able to gain a more complete understanding of their business, customers, products, competitors, etc. which can lead to efficiency improvements, increased sales, lower costs, better customer service, and/or improved products and services. For example: – Manufacturing companies deploy sensors in their products to return a stream of telemetry. Sometimes this is used to deliver services like OnStar, that delivers communications, security and navigation services. Perhaps more importantly, this telemetry also reveals usage patterns, failure rates and other opportunities for product improvement that can reduce development and assembly costs.(**Oracle) – The proliferation of smart phones and other GPS devices offers advertisers an opportunity to target consumers when they are in close proximity to a store, a coffee shop or a restaurant. This opens up new revenue for service providers and offers many businesses a chance to target new customers.(**) – Retailers usually know who buys their products. Use of social media and web log files from their ecommerce sites can help them understand who didn’t buy and why they chose not to, information not available to them today. This can enable much more effective micro customer segmentation and targeted marketing campaigns, as well as improve supply chain efficiencies.(**) – Other widely-cited examples of the effective use of big data exist in the following areas:  Using information technology (IT) logs to improve IT troubleshooting and security breach detection, speed, effectiveness, and future occurrence prevention.  Use of voluminous historical call center information more quickly, in order to improve customer interaction and satisfaction.  Use of social media content in order to better and more quickly understand customer sentiment about you/your customers, and improve products, services, and customer interaction.  Fraud detection and prevention in any industry that processes financial transactions online, such as shopping, banking, investing, insurance and health care claims.  Use of financial market transaction information to more quickly assess risk and take corrective action. www.navint.com contactus@navint.com 888.607.6575 This message and any attachment are confidential and may be privileged or otherwise protected from disclosure and solely for the use of the person(s) or entity to which it is intended. If you are not the intended recipient, be advised that any use of this message is prohibited and may be unlawful, and you must not copy this message or attachment or disclose the contents to any other person. Why is BIG Data Important? | March 2012 | 4 Key Big Data Challenges  Understanding and Utilizing Big Data – It is a daunting task in most industries and companies that deal with big data just to understand the data that is available to be used, determining the best use of that data based on the companies’ industry, strategy, and tactics. Also, these types of analyses need to be performed on an ongoing basis as the data landscape changes at an everincreasing rate, and as executives develop more and more of an appetite for analytics based on all available information.  New, Complex, and Continuously Emerging Technologies – Since much of the technology that is required in order to utilize big data is new to most organizations, it will be necessary for these organizations to learn about these new technologies at an ever-accelerating pace, and potentially engage with different technology providers and partners than they have used in the past. Like with all technology, firms entering into the world of big data will need to balance the business needs associated with big data with the associated costs of entering into and remaining engaged in big data capture, storage, processing, and analysis.  Cloud Based Solutions – A new class of business software applications has emerged whereby company data is managed and stored in data centers around the globe. While these solutions range from ERP, CRM, Document Management, Data Warehouses and Business Intelligence to many others, the common issue remains the safe keeping and management of confidential company data. These solutions often offer companies tremendous flexibility and cost savings opportunities compared to more traditional on premise solutions but it raises a new dimension related to data security and the overall management of an enterprise’s Big Data paradigm.  Privacy, Security, and Regulatory Considerations - Given the volume and complexity of big data, it is challenging for most firms to obtain a reliable grasp on the content of all of their data and to capture and secure it adequately, so that confidential and/or private business and customer data are not accessed by and/or disclosed to unauthorized parties. The costs of a data privacy breach can be enormous. For instance, in the health care field, class action lawsuits have been filed, where the plaintiff has sought $1000 per patient record that has been inappropriately accessed or lost. In the regulatory area, for instance, the proper storage and transmission of personally identifiable information (PII), including that contained in unstructured data such as emails can be problematic and necessitate new and improved security measures and technologies. For companies doing business globally there are significant differences in privacy laws between the U.S. and other countries. Lastly, it will be very important for most forms to tightly integrate their big data, data security/privacy, and regulatory functions.  Archiving and Disposal of Big Data – Since big data will lose its value to current decisionmaking over time, and since it is voluminous and varied in content and structure, it is necessary to utilize new tools, technologies, and methods to archive and delete big data, without sacrificing the effectiveness of using your big data for current business needs.  The Need for IT, Data Analyst, and Management Resources – It is estimated that there is a need for approximately 140,000 to 190,000 more workers with “deep analytical” expertise and 1.5 million more data-literate managers, either retrained or hired. Therefore, it is likely that any firm that undertakes a big data initiative will need to either retrain existing people, or engage new people in order for their initiative to be successful. www.navint.com contactus@navint.com 888.607.6575 This message and any attachment are confidential and may be privileged or otherwise protected from disclosure and solely for the use of the person(s) or entity to which it is intended. If you are not the intended recipient, be advised that any use of this message is prohibited and may be unlawful, and you must not copy this message or attachment or disclose the contents to any other person. Why is BIG Data Important? | March 2012 | 5 Developing a Big Data Strategy Big Data Basics Business Systems Big Data Assessment Big Data Strategy Sources and Uses Organization Impacts Social Data Unstructured Data Volumes & Metrics Opportunity Analysis Estimated Growth Methods and Tools Process Data Privacy & Regulatory Compliance Impact/ Value Potential Business Case & ROI About Navint Partners Navint is a different kind of management consulting firm, excelling in large scale business process change. With offices in New York, Chicago, Boston, Pittsburgh, Philadelphia and Rochester, Navint’s consultants specialize in managing the alignment of people, processes and technology when organizations face operational restructuring and IT transformation. A unique blend of experience and innovative thinking allows Navint consultants to address clients' business challenges in imaginative ways. http://www.navint.com/ www.navint.com contactus@navint.com 888.607.6575 This message and any attachment are confidential and may be privileged or otherwise protected from disclosure and solely for the use of the person(s) or entity to which it is intended. If you are not the intended recipient, be advised that any use of this message is prohibited and may be unlawful, and you must not copy this message or attachment or disclose the contents to any other person. 1 Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 View Point – Phil Shelley, CTO, Sears Holdings Making it Real – Industry Use Cases Retail – Extreme Personalization. . . . . . . . . . . . . . . . . . . . . . . . . . 6 Airlines – Smart Pricing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Auto – Warranty and Insurance Efficiency . . . . . . . . . . . . . . . . . . . . 12 Financial Services – Fraud Detection. . . . . . . . . . . . . . . . . . . . . . . 16 Energy – Tapping Intelligence in Smart Grid / Meters. . . . . . . . . . . . . . . 19 Data warehousing – Faster and Cost effective. . . . . . . . . . . . . . . . . . 22 View Point – Doug Cutting, Co-founder, Apache Hadoop Making it Real – Key Challenges Protecting Privacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Integrating with Enterprise Systems. . . . . . . . . . . . . . . . . . . . . . . . 30 Handling Real Time Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Leveraging Cloud Computing. . . . . . . . . . . . . . . . . . . . . . . . . . . 37 View Point – S. Gopalakrishnan (Kris), Co-Chairman, Infosys Making it Real – Infosys Adoption Enablers Accelerators – Solution and Expertise. . . . . . . . . . . . . . . . . . . . . . . 43 Services – Extreme Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Product – Voice of Customer Analytics . . . . . . . . . . . . . . . . . . . . . . 51 Platform – Social Edge for Big Data. . . . . . . . . . . . . . . . . . . . . . . . 53 Introduction What is Big Data? Today we live in the digital world. With increased digitization the amount of structured and unstructured data being created and stored is exploding. The data is being generated from various sources - transactions, social media, sensors, digital images, videos, audios and clickstreams for domains including healthcare, retail, energy and utilities. In addition to business and organizations, individuals contribute to the data volume. For instance, 30 billion content are being shared on Facebook every month; the photos viewed every 16 seconds in Picasa could cover a football field. It gets more interesting. IDC terms this as the ‘Digital Universe’ and predicts that this digital universe is set to explode to an unimaginable 8 Zeta bytes by the year 2015. This would roughly be a stack of DVD’s from Earth all the way to Mars. The term “Big Data” was coined to address this massive volume of data storage and processing. a Dat Soc i al ta Da sa an Tr Big Data (Volume, Velocity, Variety) Locatio n / G eo Da t n Data ctio Me d ia Cli ck st am re a S e ns or Da ta It is increasingly becoming imperative for organizations to mine this data to stay Competitive. Analyzing data can provide significant competitive advantage for an enterprise. The data when analyzed properly leads to a wealth of information which helps the businesses to redefine strategies. However the current volume of big data sets are too complicated to be managed and processed by conventional relational databases & data warehousing technologies. The volume, variety and velocity of Big Data causes performance problems when being created, managed and analyzed using the conventional data processing techniques. Using conventional techniques for Big Data storage and analysis is less efficient as memory access is slower. The data collection is also challenging as the volume and variety of data has to be derived from sources of different types. The other major challenge in using the existing techniques is they require high end hardware to handle the data with a huge volume, velocity and variety. Big Data is a relatively new phenomenon. As with any new adoption, the adoption of Big Data depends on the tangible benefits it provides to Business. Large data sets which are considered as information overload are invariably treasure troves for business insights. The volume of data sets has immense value that can improve the business forecast, help in decision making, deciding business strategies over the competitors. For instance, Facebook, blogs and twitter data gives insights on current business trends. 1 Data Mining Faster Business Decisions Reports Aggregated Intelligence Forecasting Growth of Information Assets Real Time Analytics Innovative Business Value Storage The data sets are beyond the capability of humans to analyze manually. Big data tools have the ability to run ad-hoc queries against the large data sets in less time with a reasonable performance. For instance, in retail domain understanding what makes a buyer to look into a product online, sentiment analysis of a product based on the Facebook, tweet and blogs are of great value to the business. This will enable the business to improve their services for customers. Big Data analysis enables the executives to get the relevant data in less time for making decisions. Big Data can pave way for fraudulent analysis, customer segmentation based on the store behavior analysis, loyalty programs that identifies and targets the customers. This enables us to perform innovative analysis which indeed changes the way we think about data. Exploring Big Data Spectrum With unstructured data dominating the world of data, the way to exploit it is just becoming clearer. Information proliferation is playing a vital role in leveraging the opportunities and is also presenting a plethora of challenges. The industry opportunities presented by the plethora of data are plenty. To understand how to leverage Big Data opportunities is a clear need to the business.Big Data spectrum covers use case from five different industries Retail, Airlines, Auto, Financial Services and Energy. All opportunities come with a set of challenges. The way to know and address these challenges is discussed in the Key Challenges section. To name a few: Data Privacy, 2 Data Security, Integrating various technologies, catering to real time flow of data and leveraging cloud computing. To dive deep into Big Data technology with the goal of having a quick, managed and quality implementation, a set of enablers were designed by the Architects at Infosys. The section on Adoption Enablers gives the insights into these enablers. The sections are interleaved with the viewpoints from Phil Shelly, CTO, Sears Holdings Corporation, Doug Cutting, Co-founder of Apache Hadoop (popularly known as father of Big Data) and Kris Gopalakrishnan, Co-Chairman, Infosys Ltd. 3 Phil Shelley CTO, Sears Holdings Corporation Dr. Shelley is a member of CIO forum, Big Data Chicago forum Phil, Sears is one of early adopter of Big Data. What are the sweet spot use cases in the retail industry? Transactional data such as POS, Web-based activity, loyalty-based activity, product push, seasonality, weather patterns and major trends that affect retail, business value that can be mined from this data. In addition to this you add the data in the social space and the sheer amount of this data is way beyond what traditional database solutions can handle. That is where Hadoop plays a role, to capture and keep this data in the finest level of detail. What kind of challenges are you facing in implementing Big Data solutions? Hadoop is relatively low cost to implement. However, to get started, you still need some kind of business case. It is a good idea to start small and have a very specific use case in mind. At the same time, a picture of where big data will be valuable long term is important as well. Focusing on a key use case that can demonstrate business value is probably the way to start, for any company a big-bang approach is not something I would recommend. How Big Data management integrates with your Enterprise systems? Hadoop is not a panacea. Big data solutions will be a hybrid of traditional databases, data warehouse appliances and Hadoop. The combination of high-speed SQL access and the heavy lifting of Hadoop can work together very well. This means that you need to synchronize the data between these data sources. One way to do this is to aggregate your data before you send it to an appliance. What data needs to be shared and how the synchronization happens will need some careful thinking. At least for the next few years it’s going to be an ecosystem of Hadoop combined with a more traditional database system. How to handle Data Privacy and Security issues in the Big Data management? Personally I would not put very sensitive data on a public cloud because the risk of exposure could be catastrophic to the company. A private cloud which is co-located with my data center or, a virtual private cloud that is physically caged, are approaches I would recommend. Out of the box Hadoop has security limitations. You have to explicitly design your data for security. There are ways of securing credit card and 4 personal information. I would recommend that anyone looking to secure such data look to some help on how to structure a big data solution and not expose themselves. This is an area that is somewhat new and prone to lax security. Phil, One last question, as a CTO of Sears how do you see the connection between real-time Digital Enterprise and Big Data? Today Hadoop can have near real time copies of transactional data and near real time batch reporting with as little as minutes of latency. You can process the data near real time and then access from a data mart in real time, which create many possibilities. But it is really going to be an ecosystem of the right tools for the right jobs. 5 Making it Real – Industry Use Cases Retail – Extreme Personalization Use case Context There was a complete directory of the whole World Web in the beginning. One knew about all the servers that were in existence. Later other web directories appeared ex: Yahoo, AltaVista etc. which kept a hierarchy of the web pages based on their topics. In 1998 Google changed everything. Crawling, Indexing and Searching paved the way for Personalization and Clickstream analysis, by which personalized recommendations based on social, geographical and navigation could be made. This personal flavor gives higher value and causes customers to become more loyal and more profitable for businesses. Now, the channels by which voice of customer can be collected have multiplied leading to the big data explosion, and retailers have a huge opportunity to present customers with even more personalized promotions, deals and recommendations. Challenges One of the significant challenges in architecting such a personalization system is the amount and diversity of data that has to be handled. For example, websites today generate user activity data that could easily run into Terabytes in a matter of months. Equally problematic is the different formats and system interfaces. Once the data is loaded, the system applies correlation techniques to correlate the data and draw inferences about the preferences of individual customers. The traditional relational data warehousing OLAP based systems struggle to process this massive amount of high velocity data and provide insights. There is a high latency between the user’s shopping activity and the generation of recommendation as well as limited granularity which decreases the relevance of the recommendation to the end customer. Solution and Architecture The personalization System sources information about the customer profile, orders, preferences and opinion from multiple sources – some within the enterprise and some outside. Examples of sources are the enterprise order management system (or channel specific order management systems where applicable), customer browse analytics data from the website, opinions and reviews from social networks, etc. In addition to customer data, some master data like the sales catalog and marketing campaigns are loaded. There are 3 basic levels of inferences that can be done about what a consumer might be interested in: 1. Based on past actions and opinions. 2. Based on similar users (actual friends as defined by the consumer in social networks, statistically derived segments based on behavior). 3. Based on general public behavior 6 Transaction data from all channels Marketing User profile data Campaigns Data Master and channel specific catalogs Reviews/ ratings Web Analytics Social Networks Twitter Mapping Transformation Load Distributed and unified file system and map reduce processing infrastructure Disiributed Cache management Navigation related services Product context based services Promotion related services Rest API infrastructure Website Application StoresPOS Phone / Catalog sales Inferences from the personalization system can be consumed by multiple scenarios and systems. A website can use that to show content that is more relevant to the customer. The store systems can provide more relevant suggestions and customer service representatives might know more about the consumers than the consumers themselves! A personalization system has to process the data from various sources mentioned above and come up with personalized recommendations and business intelligence for individual customers. This is not a simple task. This involves processing tera/ peta bytes of data obtained from various sources. Traditional systems are not efficient and scalable in storing and processing this much amount of data. Big data processing solutions are the need for such a system. Typical Architecture of a Big Data processing system using Apache Hadoop Stack is shown below. In the context of extreme personalization, we can use the data extraction tools to load the data into a Hadoop system and use Mahout which is a scalable machine learning library which comes with a lot of algorithms for pattern mining, collaborative filtering and recommendations. The latest trend is the evolution of real-time systems like HStreaming and Twitter Storm which perform the analysis in an almost instant fashion compared to Hadoop which is run in batch mode. 7 Big Data Input Smart Data Output Apache Hadoop Cluster Data Extraction Orchestration Mapreduce based Processing Log/Sensor information Flume Zookeeper Mahout User generated Social Data Sqoop Oozie Hive Chukwa Azkaban Pig Business Txn data Other unstructured data Recommendation, Pattern AdhocQuery Results Analysis Output Data Summary Storage HBase HDFS Business Value Real-time Personalized insights: By combining inputs from various channels (social, location, history etc.) and analyzing them real-time, customers can be presented with almost instant recommendations. For example, if a customer tweets “I like Xbox” , the system can provide recommendations related to Xbox when she logs into the ecommerce site or as an ad on her social network profile or even send an Xbox promotion coupon to her mobile if she’s shopping in-store. This kind of highly personalized and instant recommendation is being experimented and will become more prevalent going forward. Personalized Presentation Filtering: One of the fundamental things that can be offered is the ability to present content which is tuned to the preferences of the consumer. This could be in terms of product types (Wii related vs. Xbox related or full sleeve vs. half sleeve), brands (Sony vs. Samsung) or Prices (costliest vs. cheapest) or something else that we know about her. This can be provided as filtered navigation in a website or as a suggestive selling tip to a customer service representative while they are speaking to the consumer. Context-specific and Personalized External Content Aggregation: Presentation of context-specific information that makes sense for the consumer is a key capability. A good example is the relevance of Social context. If we are showing a product and can show the consumer that out of the 1500 people who said that they liked the product 15 are friends (with the capability to know who those 15 were as well), the impact and relevance of that would be significant. This service is relevant only for electronic channels. Personalized Promotional Content: Different consumers get attracted by different value propositions. Some like direct price cuts, some like more for the same money 8 (please note that they are exactly not the same), while some others believe in getting more loyalty points. Showing the most appropriate promotion/offer based on their interests is another important capability that a personalized system can provide. Big data processing solutions can process vast amount of data (tera/peta bytes) from various sources like browsing history, social network behavior, brand loyality, general public opinion about the product obtained from various social networks etc. This kind of extremely useful and tailor-made information can only be obtained using Big Data processing solutions and retailers must leverage them to make their business more appealing and personal to customers. Airlines – Smart Pricing Use Case Context Air transportation is one of the toughest and most dynamic industries in the world. Constantly troubled by factors like oil prices, thin profit margins, environmental concerns, employee agitations etc, it is truly a world of survival of the fittest. Understanding the end consumer needs and discovering the competitive and attractive price to consumer is always a challenge. Airlines strive to achieve key differentiators like providing a personalized experience, enhance the brand, predictive intelligent systems for identifying profit and loss factors etc. In the airline industry, a few high-value customers can generate much more revenue than a number of low-value customers, which highlights the importance of CRM systems. Some of the questions faced by the airlines in relation to CRM include: • How are the customers segmented? What share of profits does each customer segment bring in? • Some customers deserve greater attention than others. How to identify them from the frequent flyers? • What tactics should be adopted to acquire, convert, retain and engage customers? Personalization and Fare processing involves large data sets. While the current systems leverage conventional enterprise information, new data sources such as social media, web logs, call center logs & competitor pricings would have to be considered for personalization requirements. Here the challenges include highly scattered data sources, huge effort for data integration, and high cost of data warehousing and storage solutions. Scenario 1: Fare Processing: Context and Challenges Airlines classify fares into the main classes of service (Economy, Business, First etc.), and each of these are subdivided into booking classes. The number of booking classes 9 depends on factors like aircraft type, sector of the flight, flight date etc. The objective here is to have a greater level of control over the type of fares sold. When making a fare change, the fare is re-priced, which means recalculating each fare, taking into consideration factors like: • Flown data / passenger booking information • Forecasted data - load factor information • Inventory • Currency exchange rate at different points of sale Information about these factors will be distributed across various databases. Naturally, an attempt to process these data will run into all the above mentioned problems. And these data sources may not be sufficient to discover a price aligned to the customer’s point of view. Opportunity – Big Data for Fare Processing: Fare processing involves extracting and consolidating information from external sources such as Agent systems, External Pricing systems, and internal systems such as Revenue Accounting, Forecasting, Inventory and Yield Management systems. This data itself will run into tens of terabytes. Newer data sources such as Social Media conversations about pricing decisions & competitors, and un-structured data in enterprise like customer service e-mail data, call center logs etc. will push the data volumes even further. In this context, the high-cost data warehousing solutions are confronted with the following challenges. • Expensive hardware and software : Cost grows with data size • Analytics process needs to be 100% customizable, and not be governed by the confines of SQL • Analytics process should not get extended by the size of the data; terabytes should be processed in minutes rather than days or hours • Data loss is unacceptable, the solution requires high availability and failover • Learning curve should not be steep Scenarios 2: Personalization for high-value customers Context and Challenges The airline industry is one where a few high value customers are more significant than many low value ones. Naturally, the winners are those who can do predictive analytics on the data and provide a personalized flying experience. A wealth of customer behavioral data can be gleaned from websites – like route and class preferences, frequency of flying, baggage, dining information, geo-location etc. - to name a few. 10 This type of analytics requires accumulation of data from multiple sources, frequent processing of high volume data, flexibility and agility in the processing logic. These are all pain areas for traditional data warehouse solutions and these get compounded as the data volume grows. Opportunity – Holistic data analytics and newer data Analyzing the customer’s data from various systems including Loyalty, CRM, and Sales & Marketing etc. and data from partners systems along with data from social media based on customer social profile can help airlines to create deeper personalization preference of customers and also understand their current social status and preferences. Solution and Architecture Big Data Architecture leveraging Hadoop stack provides ability to extract, aggregate, load and process large volume of data in a distributed manner which in turn reduce the complexity and overall turn-around time in processing large volume of data. Proposed solution is aimed at primarily addressing the following challenges: • Ability to manage and pre-process large volume of enterprise data, partners data and logs data and transform these data into meaningful data for analytics (Analysis, Preprocessing for Fare calculation and personalization services) • Run analytics on un-structured data from social and other sources to derive newer dimension such as sentiment, buzz-words, root causes from customer interactions and user created content Data extracts from various feeder systems are copied to the Big Data file system. The Analytics platform will trigger the processing of this data, while responding to CRM Analyst Pricing Analyst Fares & Pricing data sources Personalization data sources Revenue Accounting Personalization CRM Tool Departure Control Ticketing Sales& Marketing Loyalty Program Social Media Analytics Fare Analytics CRM Datamart Fare Datamart RES External pricing Systems Consolidator (ATPCO) Online Agents Revenue Accounting system GDS HDFS Cluster Pig Price & Yield Mgmt. Hive Job Scheduler Map reduce Big Data Ecosystem Reservation system Fares DB Data Gathering Data Gathering Big Data Architecture Bi Personalization & Fare Processing 11 information needs for various analytical scenarios. The Analytics platforms contain the algorithms tailored to the business process, to extract/mine meaningful information out of the raw data. Business Value By switching over to a Big Data based solution, it is estimated that the fare processing times can be cut down from days to a matter of hours or even minutes. The airline industry operates in a high risk domain vulnerable to a large number of factors, and this agility will be a vital tool in maximizing revenues. Big data architecture solutions for personalization helps in understanding and deriving granular personalization parameters and understand customers social status and needs based on their interaction in social media systems. The potential benefits for an airline could include:• Increased revenue & profit ○○ better fare management ○○ efficient mechanisms to handle fare changes • Faster decision making and reduced time-to-market for key fares • Introduction of competitive fares & reactive fare response • Increased efficiencies to the airline in managing fare strategy for different sales channels • Deeper understanding of customers and personalization (N=1) Auto – Warranty & Insurance Efficiency Use case context Two of the biggest threats to the auto insurance companies are, economic downturn, which is making money unavailable, and other one is, insurance frauds. An insurance claim starts with the process of applying for a policy and setting up the right premium and then finishes with identifying the valid claims and minimize the frauds. The investigation done during the premium setting can help in minimizing the losses which can occur during the claim settlement to a great extent. These days there are lot of tools available in the market to help the companies mine the data and automate this whole process. These tools not only streamline the process but helps by providing suggestion in various steps. However, the information availability to the companies is huge and growing everyday in leaps and bounds. With the recent explosion of the information from both external channels such as the social media sites and internal channels such as BPOs, Collaboration mediums etc, it has become really challenging to mine such huge data to get the meaningful insight. But at the same time, ability of doing the same provides any insurance company a huge competitive advantage. 12 Key challenges with regards to typical Auto Insurance workflow can be: • Verifying the data collected from the customer • Profile customer behavior, and social networking influence • Identifying the right insurance premium amount • Verifying the claims raised • Fraud detections by analyzing data from the disparate systems • Exact claim reimbursement Solution and Architecture The Big Data approach brings in a new dimension in solution to the challenges mentioned above. With Big Data the approach in each stage mentioned above can be augmented by verifying the data collected from multiple internal and external sources and running artificial intelligence across all those information to identify any pattern of frauds or any other possible compromise. With Big Data technologies being in place, solution for two core challenges integral to these stages become practical and affordable – 1. Ability to store and crunch any volume of data in a cost effective way 2. Ability to model statistically a rare event like fraud which needs sample data size close to the entire population to capture and predict right type of signatures of the rare events The potential data sources which can be leveraged for the same being • • Source: Internal Systems – ○○ Collect the data about the customer from the Internal CRM system. ○○ Check the other products sold to the customer from In-house ERP, CRM systems. ○○ Check the credit history and based on that provide the necessary quotes from other internal systems. Source: External Systems – ○○ Collect the data from the social networking sites regarding the behavior of the customer – profile customers on their behavior, sentiments, social networth, usage patterns, click stream analysis for their likes and dislikes, behavior and spending patterns. ○○ Check the reviews and ratings of the product for which the insurance is required. For example if the insurance is required for the particular brand of car then reviews about the cars can be mined and by using sentiment analysis, the cars can be put in different categories. The cars with better reviews and ratings can be offered with lower premium and cars which have bad reviews and which brake down frequently can be offered with higher insurance premium. 13 ○○ Check the social network of the user. This is can help in identifying the status of the customer and even will help in identifying potential customers. ○○ Insurance companies can even make use of the data such as the customers interest in the car racing or his network which is quite active in rally racing etc. ○○ Some of the other data which can be obtained from the external systems includes the data to check the credit worthiness of the customer from the external 3rd party rating sites The solution architecture breaks the entire landscape into Acquisition, Integration and Information Delivery/Insights layers. Those three layers provide a seamless integration, and flexible options to plug-n-play additional sources and delivery channels hiding the underlying data complexity in each layer. The core to managing such architecture is towards handling the large volume of information, and the varied formats of data flowing into the system. Hence the solution’s centered on big data and data virtualization techniques to manage this burst of information. Acquisition Layer – Once the sources of both structured and un-structured data are identified, both externally and internally the right set of adapters and crawlers are required to extract the Big data from public/private locations outside the boundaries of organization. This is most critical aspect, and the right set of business rules on what needs to be pulled, requires an upfront preparation. Idea is to start with smaller number of sources which have more likelihood of having relevant set of data for desired outcome, and later extend this strategy to other sources. In parallel, is to leverage data integration and data virtualization technologies to integrate the relevant data sources which can provide auto insurance policy, customer details, product categories and characteristics, claims history. Tools like Informatica, IBM DataStage, ODI etc for data integration, and Denodo, Composite etc for data virtualization can be leveraged. Few parameters need to be factored in for better performance: • Frequency of data capture from transactional systems, and external sources • Delta feeds, trickle feeds and staging strategies can play an important role • Effective business rules, technical rules filters in place to reject unwarranted data sets Integration – This is where post extraction and filtering, the information gets consolidated under a common data model. Integration requires a strong set of mapping rules to map both structured and un-structured (transformed into structured format). The data model should support integrating following set of information: • Auto insurance policy details by vehicle type • Claim history by vehicle type and customer • Customer feedback on product features – sentiments, opinions 14 Claims visibility by Product, by Customer Premium Auto-Suggestion by Product Category Delivery Business Process Services Repair Warranty Identification Claim Submission Claims Return Material Adjudication & Authorization Credits Supplier Warranty & Recovery Holistic view of Claim Patterns and Customer Behavior on Product features Integrated Big Data Analytical Store Integration Product feedback by Customer Behavior, Sentiment Profiling •Customer feedback on product •Customer Behavior Patterns, Social Networking •Product Services, Support feedback Big Data Integration - Hadoop, GF5 etc Acquisition Crawlers and Adaptors e.g. Scribe, Flume, Sqoop Product insurance and warranty claim patterns Data Integration, Virtualization Tools like Informatica, Composite,Denodo Transactional Data, MasterData Structured Content Un-Structured Content Facebook, twitter, Blog, Mail Structured Content CRM,ERP, SEA etc Insurance, Warranty Applications Data Files,XLS • Customer behavior pattern analysis – liking for red color vehicles, sports vehicles etc • Customer social network analysis – influence, and near neighbor financial capacity etc Delivery – The final stage of delivering the insights generated from the integrated information as dashboards, scorecards, charts and reports with flexibility for business analysts to explore the details, and correlate the information sets for taking decisions on setting insurance premiums by vehicle and type of customers. The advanced analytics techniques in delivering such information will also help figuring out claim frauds, and unregulated auto insurance policies, claims. The delivery channels can be desktops via portals, mobile devices, internet based application portals etc. Business Value With Big Data technologies being in place solution for the two core challenges integral to the problem statement become practical and affordable – 1. Firstly, the ability to store and crunch any volume of data in a cost effective way 2. Secondly, the ability to model statistically a rare event like fraud which needs sample data size close to the entire population to capture and predict right type of signatures of the rare events 15 And with that the immediate business values this solution can bring, can be categorized in – • Ability to tag right pricing on Insurance Premium with holistic view and better insight • Better claims visibility and attribution to product insurance premiums/warranty • Lesser fraud case and loss While this solution addresses the specific use case of auto insurance premium advisor, the natural extension of this solution and framework can certainly be applicable to Manufacturing parts warranty, other Insurance premium and claims, fraud detection processes covering domains like Manufacturing, General Insurance and Retail. Financial Services - Fraud Detection Use case context Fraud detection use cases are diverse and their complexity can be characterized by the fact that they are either real-time/batch, use unstructured data or only structured data, use a rules engine or derive a pattern (where the rule is not known but one is looking for any sequence of seemingly unrelated events which may be interpreted as a fraud). Here we focus on one such use case for elaboration and we will detail out the applicable solution and its architecture. Let’s take an example of an online account opening form, which most of the financial service firms provide as an online service. The first step is to validate the identity of the user as entered in the form. Such forms can be misused by hackers to extract a valid combination of name, address, email id, SSNID etc using brute force. Most of these applications would be sending a request to a web-service to check the validity of this prospective client – and usually this web service is provided by an external vendor specializing in credit ratings. The requirement for the credit rating agency would be to intercept calls made to the web-service, look at the patterns of such calls and identify any fraudulent behavior. At the web-service end the characteristics of the challenges can be stated as follows • The evaluation of individual request has to be done within seconds (depending on the SLA) • Fraud evaluation would require historical data related to each request • Evaluation would be based on a set of predefined rules- which may focus on all the request sent in the past predefined time window 16 Solution and Architecture Based on the definition of the problem and its characteristics, we would have the following technical requirements: Distributed Cache With counters and write ahead logs Messages from Web Service Message Bus Distributed Columnar Data Store for Archived Data Server(node 1) RAM Disk 1 Server(node 2) RAM Disk 2 Server(node n) RAM Disk n Query Processor - either A) picks the current value of the predefined Counter B) converts Query to map-reduce functions before processing 1. The System should be able to ingest requests at the rate they come in which will vary over a period of time 2. The latency of the system has to be below specified SLA limits which may not allow the system to store the incoming data before the response is to be evaluated 3. The need is to evaluate a set of incoming response which will need a good amount of RAM to accommodate the data in memory and any derived structures created to store the aggregated over history and current stream 4. With increased load the system may be required to use multiple nodes to ingest the data while keeping the overall validity of the counters across the nodes. 5. a. The counters will be needed to be stored in a shared memory pool across the nodes. b. The counters will help reduce latency as they would be updated before the data is even written to the disk (for history updates) Distributing the system over multiple nodes will provide the solution with parallelism and also ability to develop a fault-tolerant solution a. The distributed nodes as stated in the last point will be able to handle parallel writes across multiple nodes in a peer-to-peer architecture (not in a master –slave architecture) 17 b. With increased number of nodes the probability of a node failure increases and hence replication of Historical data would be needed across multiple nodes. Similarly the data could be sharded across the nodes to help in parallelizing reads. Assuming, that the web service takes care of writing a message to the message queue for each request received with appropriate details, a Typical Architecture for such a solution will be composed of the following components 6. 7. 8. The Acquisition layer a. This component will read Messages from the message queue and distribute it to a set of worker processes which will continue with the rest of the acquisition process. b. Each worker process will be able to Look up the cache for an appropriate data structure based on the message details – if found update the counter. Else create a new one. The distributed cache - The role of the distributed cache will be to act as initial data store based on which the analysis could be done. Thus helping reduce latency between the message arrival and its impact on the measurement. This will need a. Initialization of the distributed cache while startup and also on a regular basis while data is flushed to the data disks b. Ability to Flush the data in cache to the data disk when the cache size reaches a certain water mark c. Ability to create local structure on the node where the message is received and replicate it to the copies situated on other nodes. d. Ability to create and maintain a predefined set of replicas of the data structure across the nodes to support fault tolerance The Storage/retrieval layer a. Ability to store serialized data structures for the related processing nodes with adequate copies across multiple nodes to handle the fault tolerance in the data storage layer b. Ability to provide secondary index on the Data structures for alternate Queries The historical data stored will be time series in nature and columnar distributed data stores would be an appropriate way to handle this. c. The Data could be sharded across data nodes to increase the read response. Business Value Above mentioned solution provides opportunity to • Reduce costs – with ability to handle large volumes of varying load using commodity hardware 18 • Meet Risk requirements – this kind of latency would not be possible in a traditional RDBMS where data would have to be stored and indexed before querying • Further alerts/event processing can be configured in the CEP to take appropriate action on detection of a fraudulent request. Energy – Tapping Intelligence in Smart Grid / Meters Use case context The two main issues that utility majors face across the world are environmental concerns and Power delivery limitations and disturbances. To address these issues and taking into consideration the technology advancement, electric power grids are being upgraded with smart meters installed at consumers, and other Grid sensors for efficient monitoring of the utility infrastructure. However, the true value of smart grids is unlocked only when the veritable explosion of data is ingested, processed, analyzed and translated into meaningful decisions such as; ability to forecast electricity demand, respond to peak load events, and improve sustainable use of energy by consumers. The major challenges the utility providers face today are to • Curb inadequacies in generation, transmission and distribution and inefficient use of electricity • Reduce technical and commercial losses (AT&C) that lead to substantial energy shortage • Improve quality of power supply • Increase revenue collection • Provide adequate electricity to every household & improve consumer satisfaction One solution we can look at to address the above challenges is through the implementation of smart grids with Big Data analytics Platform. Smart Grid is a real time, automated system for managing energy demands and responses for optimal efficiency. In a smart grid environment, demand response (DR) optimization is a twostep process consisting of peak demand forecasting and selecting an effective response to it. Both these tasks can greatly benefit from the availability of accurate and real time information on the actual energy use and supplementary factors that affect energy use. Analytical tools can process the consumption of data coming in from an array of smart meters and provide intelligent data which can help the utility company to plan better for capital expenditures. Hence, the software platform that collects, manages and analyzes the information plays a vital role. 19 Solution and Architecture Datasets (such as structured, unstructured and semi-structured data)coming out of smart meters are enormous and amounts to petabytes of data and processing such data in relational DBMS demands lots of investment in terms of cost and time. Big Data technologies come in handy in storing such huge amounts of data. Big Data Analytics Platform (BDAP) can help in analyzing these datasets and provide meaningful information which can be used by the utility major while taking instantaneous as well as short term to long term decisions. Thus BDAP is one such solution to assuage power industry’s pain point. Efficient Information integration and data mining contributes to an architecture that can effectively address this need. The primary tasks are • Ingest information coming from smart meters • Detect critical anomalies • Proceed with non-critical task of annotating smart meter data with domain ontologies (The collective set of information models used by the electricity industry can be viewed as a federation of ontologies) • Updating the demand forecast using latest information • Responding to peak load or other events that are detected by interacting with consumer. This entire process implicitly includes a feedback, since any response taken will impact the consumer energy usage, which is measured by subsequent readings of the smart meters. Smart Meter Information Emergency Notifications Stream Processing System Electricity Usage data AMI’s Enriched Information Critical Response Policy Database Demand Forecasting Data sharing API’s and Services Semantic Privacy Filter Evaluation Information Layer Integration Integrated Database Domain Database Big Data Analytics Platform Major tasks to be performed Predictive Analytics Policy Models Integrated Information CRM Billing Public/Private Cloud Can be inside or outside cloud Integrated Information 20 The technologies that will enable these tasks include scalable stream processing systems, evaluation layer, semantic information integration and data mining systems. The scalable stream processing system is an open-architecture system that manages data from many different collection systems and provides secure, accurate, reliable data to a wide array of utility billing and analysis systems. The system accept meter readings streaming over internet or other communication protocols and detect/react to emergency situations based on defined policies. The evaluation layer captures the raw events and result sets for predictive modeling and sends the information to semantic information integration system. The semantic information integration plays a vital role by using domain knowledge base to integrate and enhance management of transmission and distribution grid capabilities with diverse information and improve operational efficiency across the utility value chain. The data mining systems uses data driven mining algorithms to identify patterns among a large class of information attributes to predict power usage and supply-demand mismatch. All of these tools will run on scalable platforms that combine public and private Cloud infrastructure, and allow information sharing over Web service APIs while enforcing data privacy rules. A mix of both public and private Clouds is necessary due to data privacy, security and reliability factors. A core set of internal, regulated services may be hosted within the utility’s privately hosted Cloud while the public Cloud is used for a different set of public facing services and to off-load applications that exceed the local computational capacity. For more accurate analytics and better demand forecast, the data needs to be integrated with the billing and CRM systems as well. Integrating Billing and CRM systems inside the cloud may prove to be expensive, so in such case it is better to keep the analytics outside the cloud. Business Value Smart metering with big data analytics gives an opportunity to focus on accounting and energy auditing, to address theft and billing problems which have vexed the industry. The following lists the Business value that can be realized in using big data analytics: • Reduce AT&C (Aggregate Technical and Commercial) losses: Enhanced analytics can be used to visualize where energy is being consumed and provide insight into how customers are using energy. It helps to identify the peak load demand and thereby decreasing generation as well as consumption of energy and thus reducing losses. Enhanced analytics also enables the provider to come up with fixed consumption schedule at a fixed price thus reducing commercial losses. • Analyzing consumer usage and behavior: Big data can be used for enhanced analytics that visualizes where energy is being consumed and provide insight into how customers are using energy. This increases the efficiency of smart grid solutions, allowing utilities to provide smarter and cleaner energy to their customers at an economical rate. A significant amount of value is anticipated to reside in secondary consumer data –behavioral analytics of consumer usage 21 data will have value to utilities, service providers, and vendors, in addition to the owners (consumers) of that data. Utilities and other energy service providers need this type of consumer data to effectively enlist support for future energy efficiency and demand response campaigns and programs that reward changes in energy consumption. CRM and analytics applications can deliver valuable information to let utilities act as “trusted advisors” to consumers to reduce or shape energy use. • Manage Load Congestion and shortfall: Analytical tools that processes consumption data can help identify when demand is low or high. With this analysis the utility provider can act upon when to begin shedding load or fire up peaker plants to avoid brown outs and black outs. Long term analysis of grids can provide more detailed information on seasonal and annual changes in both generation and demand, which can be used to model future demand and generation trends. • In addition to the above benefits smart grid implementation with Big data analytics will play a key role in addressing global issues like energy security and climate change Data warehousing – Faster and Cost effective Use case Context EDW (Enterprise Data Warehouse) are increasingly becoming the lifeline for Enterprise Business. Businesses routinely use the EDW environment to generate reports, gather intelligence about their business and derive strategies for future. Multiple database vendors like Teradata, Oracle, IBM, Microsoft and many others have invested heavily in the field of EDW and have robust products. Till now, these products have been able to serve us well. All the data in the enterprise, land in the EDW environment in some form or the other. The three characteristics of Big Data viz., Volume, Variety and Velocity poses challenge not just to store the large data-set, but processing and make it available for downstream consumption. As a result the impact of Big Data on the EDW is huge. The current set of EDW products are based on an architecture where it is difficult to scale. The volume of data overwhelms the relational systems which have the concept of a controller (the DB engine). The controller becomes the bottleneck and handling of Big Data becomes expensive. In addition, handling semi-structured and un-structured data are not effectively handled by the current EDW products. Solution and Architecture To address the above challenges, we are now seeing a series of innovation in the Big Data space. There are parallel relational data warehouses, shared nothing architecture, DW appliance and MPP (massively parallel processing) architecture. Some key products in the area are Hadoop/Hive, EMC Greenplum, Oracle Exadata, IBM Netezza and Vertica. Each of the solution has infinite scalability and cost-effectiveness at its core. 22 A solution based on Hadoop and HIVE provides a compelling alternative to the traditional EDW environment. Hadoop is a top level Apache Project and was developed by Doug Cutting while working at Yahoo. Similarly, HIVE is another open sourced project and is a data warehouse infrastructure built on top of Hadoop. Some key benefits of using a combination of Hadoop/HIVE include: • Cost effective: Both Hadoop and HIVE are open sourced and the initial software cost is zero. In addition, they are designed to run on commodity hardware and the Infrastructure cost is a relatively lesser when compared to conventional EDW hardware. • Infinite Scalability: Hadoop can scale to thousands of nodes using commodity hardware • Strong eco-system: Hadoop has now become main-stream and there is a massive support in the industry. We are seeing the eco-system grow at a rapid pace coexisting with landscape. To address the end to end need of Enterprise Data Warehouse, we need to effectively handle the following: • Data Ingestion: Data needs to be ingested from a variety of data sources to the Big Data environment (Hadoop + HIVE in this case). The data can either be transactional data stored in RDBMS or could be any other unstructured data that an Enterprise might want to use in their EDW environment Enterprise Data Warehouse Source Systems ODS+Data Warehouse Online Systems Web logs, Click stream Data Processing Data Ingestion HDFS/Hive HDFS/Hive Hadoop Socia Social Networks Data Marts Hadoop Processed Data Data publishing Hadoop Staging Reports HDFS/Hive Low Latency Systems Dashboards In Memory DB Mobile Apps Data publishing Deep Analysis Ad-hoc Data Access Outbound Systems Statistical 23 • Data Processing: Once data is ingested to the platform, this needs to be processed to provide business value. Processing can be in terms of aggregation, analytics or semantic analysis. It is interesting to note that, unlike conventional EDW platform, the Hadoop + HIVE environment is well suited to handle the unstructured or semi-structured data. Companies like Facebook, Google and Yahoo routinely process huge volume of unstructured data and derive structured information • Data Publishing: The processed data needs to be published to a variety of systems for end user consumption. This can include various BI solutions, Dashboard applications or other outbound systems. As support for Hadoop continues to skyrocket, we are seeing many vendors providing a variety of adapters to connect to Hadoop. Business Value The platform provides a compelling alternative to the conventional EDW, especially in the world of Big Data. This kind of architecture is being evaluated by many of our clients. Development of accelerators can help package the above solution as a full-fledged platform and facilitate smoother adoption. Following are some key accelerators that Enterprise can look to develop in the short to medium term: • Technical Accelerators (Level 0): Big Data Aggregator Framework, Parallel Data ingestion framework, Common Data adapter • Frameworks (Level 1): Analytics Factory, Semantic Aggregation Framework, Matching Engine, Statistical Analytics framework, Clustering graphs • Solutions (Level 2): Financial reconciliation, Risk Analysis, Fraud Management, Retail Use Profiling 24 Doug Cutting Co-founder , Apache Hadoop project Chairman of The Apache Software Foundation and Architect at Cloudera Doug, first Google, Yahoo and Facebook learned to manage Big Data and now large enterprises have started to leverage Big Data. What are the most common use cases you are seeing in the context of the enterprise? Most companies are motivated to start using Apache Hadoop by a specific task. They have an important data set that they cannot effectively process with other technologies. At companies with large websites, this “initial task” is often log analysis. For example, most websites are composed of many web servers, and a given user’s requests may be logged on a number of these servers. Hadoop lets companies easily collate the logged requests from all servers to reconstruct each user’s sessions. Such “Sessionization” permits a company to see how its users actually move through its website and then optimize that site. In other sectors, we have observed different initial tasks. Banks have a lot of data about their customers, bill payments, ATM transactions, deposits, etc. For example, banks can combine analysis of this data to better estimate credit worthiness. Improving the accuracy of this estimation directly increases a bank’s profitability. Retailers have a lot of data about sales, inventory and shelf space, that – when they can analyze it over multiple years – can help them optimize purchasing and pricing. The use cases vary by industry. Once companies have a Hadoop installation they tend to load data from more sources into it and find additional uses. The trends seem clear though: businesses continue to generate more data and Hadoop can help to harness it profitably. What are the challenges enterprises are facing for the adoption of Big Data? There’s a big learning curve. It requires a different way of thinking about data processing than has been taught and practiced for the past few decades, so business and technical employees need to re-learn what’s possible. IT organizations can also be reluctant to deploy these new technologies. They’re often comfortable with the way they’ve been doing things and may resist requests to support new, unfamiliar systems like Hadoop. Often the initial installation starts 25 as a proof of concept project - implemented by a business group, and only after its utility to the company has been demonstrated is the IT organization brought in to help support production deployment. Another challenge is simply that the technology stack is young. Tools and Best practices have not yet been developed for many industry-specific vertical applications. The landscape is rapidly changing direction, but conventional enterprise technology has a multi-decade head start, so we’ll be catching up for a while yet. Fortunately, there are lots of applications that don’t require much specific business logic; many companies find they can start using Hadoop today and expand it to more applications as the technology continues to mature. Is Hadoop the only credible technology solution for Big Data management? Are there any alternates? And how does Hadoop fit into enterprise systems? Hadoop is effectively the kernel of an operating system for Big Data. Nearly all the popular Big Data tools build on Hadoop in one way or another. I don’t yet see any credible alternatives. The platform is architected so that if a strong alternative were to appear it should be possible to replace Hadoop. The stack is predominantly open source and there seems to be a strong preference for this approach. I don’t believe that a core component that’s not open source would gain much traction in this space, although I expect we’ll start to see more proprietary applications on top, especially in vertical areas. Hadoop started as a sub-project to a search engine and then it became a main project. Now, the Hadoop ecosystem has more than a dozen projects around it! How did this evolution happen in a short span of time? It’s a testament to the utility of the technology and its open source development model. People find Hadoop useful from the start. Then they want to enhance it, building new systems on top. Apache’s community-based approach to software development lets users productively collaborate with other companies to build technologies they can all profitably share. Doug, one last question: Hadoop Creator, Chairman of The Apache Software Foundation and Architect at Cloudera – which role do you enjoy the most? Hadoop is the product of a community. I contributed the name and parts of the software and am proud of these contributions. The Apache Software Foundation has been a wonderful home for my work over the past decade and I am pleased to be able to help sustain it. I enjoy working with the capable teams at Cloudera, bringing Hadoop to enterprises that would otherwise have taken much longer to adopt it. In the end, I still get most of my personal satisfaction from writing code, collaborating with developers from around the world to create useful software. 26 Making it Real – Key Challenges Context Digitization of various business functions and adoption of digital channels by the consumers has been resulting in a deluge of information. This is resulting in huge volumes of data getting generated at increasing pace and in various forms and varieties. Big Data is disrupting value chains for several industries and offering significant business benefits to those organizations which are able to exploit them. The data volumes are increasing while the costs for storage and processing are reducing and at the same time a whole new set of Big Data technologies like Mapreduce, NoSQL solutions etc have emerged. These technologies are enabling storage and processing of data at higher order of magnitude at much lower costs than what was possible with traditional technologies. Big Data Challenges There are several challenges that enterprises are facing in capturing, processing and extracting value from the “Big Data”. This section looks into some of the key challenges and the emerging solutions for those challenges. Some of the key challenges include 1. Protecting Privacy 2. Integration of Big Data technologies into enterprise landscape 3. Addressing increasing real-time needs with increasing data volumes and varieties 4. Leveraging Cloud computing for Big Data Storage and Processing Protecting Privacy Data mining techniques provides the backbone to harnessing information quickly and efficiently on Big Data. However, this also means there is a potential for extracting personal information by compromising on user privacy (see Sidebar -Privacy Violation Scenarios). In this chapter, we initially describe principles that can be used to protect the privacy of end users at various stages of data life cycle. Subsequently we explore technical aspects of protecting privacy while processing Big Data. Lifecycle of Data & Privacy Typically the Big Data lifecycle involves four stages 1) Collection 2) Storage 3) Processing and derive the knowledge and 4) Usage of the knowledge. Privacy concerns can arise in all of these stages. A combination of policy decisions, technical and legal mechanisms are used to address privacy concerns. A brief description of some of the major principles for protecting the privacy of data in its lifecycle is given below. 27 Privacy Violation Scenarios • Misusing user password and biometric databases (identity theft) • Selling transaction databases, credit card databases for monetary gain • Detecting web access patterns of a particular user from a database of web accesses. • Identifying a person with a particular disease in a healthcare database. • Behavioral discovery of a user by correlating activities within a social networking site and also outside • Setting up monitoring mechanisms to infer behavior patterns of users • Disclosure of private and confidential information in public • Command history patterns of a user • Deep packet inspection of network data to identify personal information like passwords and credit card related transactions • Exporting sensitive data from a computer through malware, spyware, botnets, Trojan horses, rootkits etc. 1. Data collection limitation: This principle limits the unnecessary excessive collection of personal data. Once the purpose for which data is collected is known, collected data should be just sufficient enough for that purpose. This principle is clearly a policy decision on the part of collector. 2. Usage limitation: While collecting sensitive or personal data, collector needs to specify what for and how the data is used and limit the usage of collected data for other purposes than the original one. 3. Security of data: It is an obligation of data collector to keep the data safe once collected. Adequate security mechanisms should be in place to protect it from breaches. 4. Retention and destruction: There is a lifetime associated with the data, once its usage is over the data collected needs to be destroyed safely. It prevents wrong usage and leakage to wrong hands. 5. Transfer policy: Often the usage of data is governed by laws which are prevalent at the place of collection and usage. If the data is moved outside the jurisdiction (which is common with the advent of Cloud computing), where the law enforcement is not prevalent in the new place it carries the danger of misuse. Thus either data is not allowed to be transferred or transferred only to those places where similar law enforcement holds good. 6. Accountability: When dealing with third party data, the party may ask to designate a person who is the point of contact and take onus of safekeeping, processing and usage of data. 28 Collection limitation is a policy decision on part of the data collector, usage limitation, securing data, retention and destruction and transfer policy can be addressed by technical means and the last one, accountability is addressed by having a legal team sign a declaration. Using the collected data for analysis and deriving insight into the data is an important technical step, in the next section we describe some of the privacy preserving data mining techniques. Privacy Preserving Data Mining Techniques Objective of Privacy preserving data mining is to reveal interesting patterns without compromising on user privacy. There are variety of techniques used depending on the type of database and type of mining algorithm. Following are the key techniques for privacy preserving data mining. 1. Anonymization techniques: Replace sensitive attribute values with some other values. This prevents disclosure of private data. In some cases, this simple replacement alone will not suffice and sophisticated anonymization techniques need to be employed. For example replacing name and address will not suffice, pseudo identifiers like age, sex and social security numbers can be used to identify or else minimize the scope of possibilities. K-Anonymity framework is one good example of this class. In order to reduce the risk of identification, this technique requires that every tuple in the table be indistinguishably related to no fewer than k respondents. There are other techniques like l-diversity model, t-closeness model etc. which fall under this category. 2. Generalization: Rare attributes in data items are replaced with generic terms. For example let us consider persons who have a Ph.D degree are very less in employee database. A query which collects user qualifications and their age can be correlated with the result of a query which lists salary and age of the person can reveal the identity of person. This can be prevented by replacing qualification (Ph.D) with another generic term called graduation which makes it difficult to correlate and infer. 3. Randomization: In this technique noise is added to the fields of records. This prevents retrieving correct personal information however the aggregate results are preserved. For example salary and age of employees is randomized however queries like average age and average salary will result in correct reply. One of the advantage of randomization techniques is, it can be used with individual records (noise can be added at the time of collection of data) and does not require knowledge of other record values, hence this method is more suitable for data which is generated as a stream. There are variety of randomization techniques such as additive randomization, multiplicative randomization and data swapping techniques in this class. 4. Probabilistic or no results for queries: Modify the query results which potentially compromise user privacy in such a way that, rather than giving exact results they either give probabilistic answers or null results. 29 Privacy Preserving Data Publishing Once the data is processed and insight is gained about the data, in some cases the knowledge is shared either publicly or to limited audience. Precautions should also be taken to prevent misuse of such knowledge. One of the key questions while sharing results is - Is it the raw data or just the inference? If raw data is shared then adequate care should be taken to mask the data so that analysis is reproducible however individual identities are hidden. To a major extent publishing data is bounded by legal means. For a snapshot of privacy laws look at the Sidebar-Data Privacy Protection Laws Data Privacy Protection Laws • Payment Card Industry Data Security Standard (PCI DSS): Defined to protect financial transaction data from potential breaches • Health Insurance Portability and Accountability Act (HIPAA): This US law regulates the use and disclosure of Protected Health Information held by “covered entities” • Sarbanes–Oxley Act (SOX): Is a standard for public boards and public accounting firms in US • Personal Data Privacy and Security Act: Deals with prevention and mitigation of identity thefts. • US privacy act: Limits the collection of personal data from US federal agencies. • Data Protection Directive (95/46/EC): An European law which mandates the processing of European citizens personal data within Europe. • UK data protection act: Serves the same purpose US privacy law does. Integration of Big Data technologies into enterprise landscape Enterprise Data Warehousing (EDW) and Business Intelligence solutions form an integral part of business decision making in enterprises today. Large enterprises typically would have one or more of these products already in use. Emerging Big Data technologies and solutions are largely complementary to some of these and sometimes provide alternatives addressing extreme volumes or velocity or variety requirements at lower price points. A key challenge is determining where Big Data technologies fit in a typical enterprise and how they are used in conjunction with all other existing products in the enterprise. A graphical representative of some of the typical BI/EDW capabilities and representative players is shown below. (Note: It is a representative list, not comprehensive and not indicative of any rankings and also there are vendors/solutions that cut across several functions) 30 The figure gives an idea of where the Big Data Solutions fit in the current enterprise context. Big Data technologies are today being used primarily for storing and performing analytics on large amounts of data. Solutions like Hadoop and its associated frameworks like Pig, Hive etc. help distribute processing across a cluster of commodity hardware to perform analytic functionalities on data. Hadoop based data stores as well as NoSQL data stores provide a low cost and highly scalable infrastructure for storing large amounts of data. Informatica Actuate Business Objects Information Builders IBM Cognos Flux Microsoft Kettle IBM Data Stage Delivery - Visualization ETL & Integration Talend SAS/Data Oracle SAS SPSS Stata SAP Microstrategy Scribe Cloudera Flume Tableau IBM Sqoop Jasper Reports Big Data Solutions Qlik Tech Oracle SAS Teradata Hadoop (Cloudera, MapR, HortonWorks) Oracle Exadata Karmasphere Greenplum Cassandra Datameer Aster Data RHIPE MongoDB Analytics Storage & Mining R Business Objects Trillium Software Vertica Microsoft SSAS Matlab NetApp Sybase IBM Netezza HDS EMC SAS DataFlux Data Quality & MDM Informatica IBM Oracle Initiate Systems Siperian Challenges with Leveraging Big Data Technologies in Enterprise Landscape Enterprises that want to adopt Big Data solutions have been facing a number of challenges in getting these tools to integrate with the existing enterprise BI/EDW/ Storage solutions from vendors like as Teradata, Oracle, Informatica, Business Objects, SAS, etc. Some of the challenges include: Data Capture and Integration: Lack of a proper ETL tools that can load data from existing data sources into a big data solutions like Hadoop Distributed File System, Cassandra, MongoDB, GraphDB etc. Enterprises have large amounts of data stored in traditional data stores including file system, RDBMS, DW systems, etc. To use this information to derive any useful analytics out of them, it is required to load this data into the Big data solutions. Most of the ETL systems do not support such bulk loading of data from traditional data sources to the options available in Big Data space. There are several specialized open source solutions like Key-Value stores (Cassandra, Redis, Riak, CouchBase) , Document Stores (CouchDB, MongoDB, RavenDB), Big Table/Column Stores ( HBase, HyperTable), Graph databases (Neo4j, GraphDB) etc but the solutions for integration into enterprise data stores like CRM, ERP systems etc. are very limited. Lack of Data Quality: Traditional Data Quality solutions provided by vendors like IBM, DataFlux, Business Objects provide extensive capabilities for metadata 31 management and addressing data quality issues. However there is limited integration of these with the Big Data technologies. This is resulting in lot of custom solutions for scenarios where big data technologies are used. Richness of Analytics/Mining capabilities: Big Data solutions and frameworks available today for analytics like Apache Mahout provide a limited number of algorithm implementations and their usability is also limited compared to the kind of features business analysts have been used to with commercial solutions. Limited Data Visualization and Delivery capabilities: There is limited support for visualization of analysis results in existing Big Data solutions. A major requirement for business users is ability to view the analyzed data in a visually comprehensible manner. The BI/DW reporting solutions allow users to generate these visual charts and reports by connecting with traditional BI solutions easily. Support for Big data solutions such as Hive, HBase, MongoDB, etc. in such popular reporting tools in limited at this point of time. Limited integration with Stream/Event Processing solutions: Several Big Data frameworks like Hadoop provide good results for batch requirements but they are not architected for real-time processing requirements. There are several solutions like CEP which address real-time processing needs but their integration with Big Data solutions is limited. Limited integration with EDW/BI products: Traditional BI/EDW solutions provide advanced features like OLAP enabling easy slicing & dicing of information and also enabling users to define and analyze the data through a user friendly UI. This allows business analysts with limited technical expertise use these solutions to address business requirements. The user experience maturity aspects are still in very early stages with the Big Data Solutions available currently. A lot of work has to go in making them more user-friendly. Emerging Solutions and Where to Start Most of the initial work in developing ‘Big Data’ technologies and solutions that help manage extreme data volumes were from internet giants like Google, Yahoo, Amazon, Facebook etc. who later open sourced those solutions and now there are companies like Cloudera, MapR, Hortonworks, DataStax, etc. providing commercial support for them. Driven by increasing adoption of these solutions, a number of established enterprise players with offerings in the BI/EDW/Storage space have now started offering/integrating Big Data solutions with their product stacks. In this section, we look at some emerging solutions to overcome the challenges in integrating big data solutions with existing enterprise solutions. Big Data Integration support from ETL Tools: Several ETL tool vendors like Informatica, Microstrategy etc. have started supporting ‘Big Data’ solutions like Hadoop/Hive etc. 32 A common requirement in integration is for extracting data from online RDBMS data stores into frameworks like Hadoop for further processing. A commonly used pattern for these situations is to first use the export capabilities provided by data store solutions or using ETL tools like Informatica to extract the data into flat files. Then in next stage, frameworks like PIG are used to then load the data into Hadoop Distributed File System (HDFS) or into NoSQL data stores. After that, frameworks like Map Reduce are used for processing and aggregation operations and finally the result of the processing is loaded into a RDBMS or a NoSQL data store. There are also emerging frameworks like Flume, Scribe, Chukwa that are designed to collect data reliably from multiple sources, aggregate and load them into the ‘Big Data’ stores. Big Data Integration support with Visualization Tools: Big Data solutions are also used for several scenarios like analysis of sales, product and customer transaction information from existing data sources such as RDBMS, flat files to generate aggregation and projection reports. A challenge here is to integrate with an existing Visualization tool to show the projection output through user-friendly charts. Reporting solutions such as Tableau, JasperReports and Pentaho are providing support for directly connecting with different big data stores and generating charts/ reports. There are still some challenges like responses from a Big Data store may be too slow for interactive analysis or the features available maybe limited. A commonly used solution pattern to get around this challenge is to use frameworks like Hadoop for distributed processing, analytics and store the results in a traditional RDBMS databases so that existing visualization tools can be used. Big Data Integration support for Analytics Tools: Big Data solutions enable processing and analytics at large scale in several scenarios. For example - Large ecommerce web sites having millions of customers are expected to provide a personalized surfing experience to each visitor. Due to the large amount of data involved, Big Data solutions are used to retrieve user transaction details in the web logs and identify user preferences. Based on that, items are recommended to the user in near-real-time. A challenge here is to integrate commercial Analytics products with a Hadoop or NoSQL based data store to perform the necessary analytics. A number of commercial BI tools such as Teradata Aster MapReduce platform, IBM BigInsights are integrating Big Data processing frameworks like MapReduce into their products. There are some initiatives underway in open source solutions space like RHIPE that are looking to integrate known analytics packages like open source R with Big Data solutions like Hadoop but these are still in early stages. 33 Addressing increasing real-time needs with increasing data volumes and varieties Another key challenge is addressing the increasing needs for real-time insights and the need for interventions based on those insights at real-time. For controlling process efficiencies, business effectiveness, it is becoming more and more imperative to make decisions “on the fly” based on, “on the fly” data. Here are some of the scenarios with such needs: Online Commerce - Analyzing online customer behavior in real-time and providing personalized recommendations based on identified customer preferences. Location-based Services — Location information of millions of mobile subscribers needs to be tracked continuously and this needs to be combined with customer profile and preference information in real time to generate personalized offers related to the local businesses. Fraud detection – Financial Services firms are developing applications that detect patterns indicative of fraud based on analysis of previous transaction history and check for these patterns in real-time transaction data to prevent fraud in real-time. Network Monitoring and Protection —Prevention of malicious attacks needs continuous monitoring of application and network data in real-time and reacting to suspicious activity. Market Data solutions- Financial Services firms need to analyze external market data in real-time and arrive at recommended financial transactions before the opportunity disappears. Social Media–Twitter, Facebook or Smartphones based messaging can help reach out to trillions of subscribers in a second to market products or services. Analyzing these to identify key influencers and targeting them requires processing large volumes of data in real-time. All the above scenarios throw significant challenges from Volume, Variety and Velocity of data perspective. Challenges with real-time needs Some of the challenges and considerations are, Capture and Storage: The sources for the information are disparate such as devices, sensors, social channels, live feeds. Most of the times, the data is in the form of message streaming from these sources. The volume of information to be stored and analyzed could be huge. The variety of messages could also be very different where one message may not be in context or related to next arriving message. The velocity of messages coming in could be as high as million messages per second. 34 Processing and Analytics: The processing of data needs to happen in real time. Processing and establishing patterns amongst messages involves complex computations such as detecting and establish patterns among events (correlation), applying rules, filter, union, join, and trigger actions based on or absence of events, etc. Result Delivery: After processing, the information needs to be presented to the end user in real time in the form of appealing Dashboards, KPIs, Charts, reports, email and the intervention actions like sending alerts using user preferred channels such as Smartphone, Tablets, or Desktops (Web, thick client). Reliability, Scalability: Systems that process such information need to be highly fault tolerant where loss of data due to missing out on one message may become unaffordable at times. They also need to be scalable and elastic so that they can scale easily to cater the increased demand on processing. Due to all the above challenges, performing analytics and data mining on Big Data in real- time differs significantly from traditional BI because in real-time it is not feasible to process the messages and derive insights using the conventional architecture of storing data and processing in batch mode. Emerging Solutions and Where to Start To address above challenges, enterprises need to look at real time data processing solutions too as part of their enterprise Big Data strategy. Store and Analyze Solutions: Data is mined and analyzed for historical patterns using a combination of emerging Big Data technologies like Hadoop using MapReduce architecture pattern and traditional BI solutions. Future trends and forecasts are determined using predictive analytics techniques. All this is performed on data collected over a long period of time. Solutions like Hadoop help store and analyze large volumes of data but they are not designed for real-time response needs. Stream/Event Processing Solutions: Processing streams of data with real-time response needs can neither be handled through Historical Analytical DW nor Hadoop based architecture because of the challenges mentioned earlier. Hence the processing of such streams is done using stream centric solutions like Complex Event Processing (CEP) solutions. Complex Event Processing (CEP) is continuous and incremental processing of event streams from multiple sources based on declarative query and pattern specifications with near-zero latency. By combining these two techniques, namely Store and Analyze and Stream processing, the requirements for processing and analyzing large amounts of data over a long period of time while at the same time generating insights, forecasts and acting on them in real-time can be achieved. Historical patterns and forecast from the first stage provide inputs for the second stage which can help apply them in real-time. 35 Visualization and Data Delivery Realtime Alerts OLTP Realtime Interventions Streaming Data Source External Sources Big Data Analytics Predictive Analytics Historical Data Analytics Forecasts Streaming Data Processing Sol Realtime Data Realtime Processing Insights BigData Storage (DFS, KV-Stores, Doc-Store, Col-Store, Queues) BigData Storage Conceptual solution to address Velocity and Volume requirements simultaneously Stream/Event Processing Solution Architecture A typical Real time Streaming Solution has three key components. Data Sources –Messages coming from diverse data sources such as devices, sensors, social channels, live feeds needs to be captured. Most of the times, the message coming from same source do not have standard and consistent schema or structure. Input adapters help capture and transfer these messages from these source systems. Engine – The CEP engine needs to process the stream of messages in real-time by applying mathematical algorithms, functions, filters, business rules, establishing patterns at the same time combining the insights or reference data from enterprise information data stores. The engines generally use parallel processing, multithreading, very high speed large memory cache, and highly optimized algorithms to perform all of this in the flight. Targets- Targets owns the responsibility to present the processed information through different delivery channels. Output adapters are used to connect to these varied presentation channels. Individual presentation channel owns the responsibility of rendering the information in the required form factor in visually appealing manner. StreamBase, Tibco ActiveSpaces, IBM WebSphere Business Events, Sybase Aleri, ruleCore, ActiveInsight, Microsoft StreamInsight are some examples of commercial CEP solutions. BackType-Storm, S4, Esper are examples of open source solutions. 36 As discussed, significant number of business use cases cannot be addressed through the conventional Enterprise Big Data BI architecture strategy because of the unique challenges it offers around various velocity characteristics. To deal with this, Complex Event Processing (CEP) architecture is useful. However, CEP cannot be looked in isolation from Enterprise Big Data BI architecture and needs to be seamlessly integrated in such a way that Complex Event Processing is performed in the context of enterprise Big Data. SOURCES CEPEngine Devices Sensors Apply Rules Union Mine and Design Correlate f(x) y) (x, h High Speed Cache Memory Event stores & Database Bloomberg.com REUTERS Stock tickers & News feeds Filter Service Bus Output Adapters Input Adapters Web servers TARGETS Engine Enterprise information Adapters Reference OLTP Data base Data base Pagers & Monitoring devices KPI Dashboards SharePoint UI Trading Stations Event stores & Database Real time CEP platform Architecture Leveraging Cloud computing for Big Data Storage and Processing Cloud computing and Big Data are two key emerging disruptive technologies; these are accelerating business innovation and enabling new, disruptive solutions. The adoption footprint of these two most disruptive entrants has been a global phenomenon those are cutting across multiple industry verticals, and geographies and rate of their adoption has been quite fast in today’s enterprise landscape. Enterprises are seeking answers to some of their key business imperatives through Big Data analysis like – Modeling true risk, customer churn analysis, flexible supply chains, loyalty pricing, recommendation engines, ad targeting, precision targeting, Point-Of-Sale transaction analysis, threat analysis, trade surveillance, search quality optimization, and various different blended mashups such as location, context, time, seasonal, behavioral ad targeting. To address these business requirements a “Data Cloud” with an elastic and adaptive infrastructure such as public and private cloud platform for enterprise data warehousing and business intelligence functions is being considered. Cloud computing and Big Data together allow analysts and decision makers to discover new insights for Intelligence Analysis, as demonstrated by Google, Yahoo, and Amazon. By leveraging such “Data Cloud” platforms to store, search, mine, and distribute 37 massive amounts of data, businesses are now enabled to get answers to their ad-hoc analysis questions faster and with more precision. The following table outlines a few important types of analytics those are performed on Big Data and in many cases, can effectivel...
Purchase answer to see full attachment

Tags: technology IT report BigData IOT internet of things

User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Hi, I have finished working on your assignment, attached.Thank you.

A REPORT ON INTERNET OF THINGS AND BIG DATA

INTRODUCTION
Big data and Internet of Things (IoT) are closely matched and even more often confused to be the same
or mean the same thing. The term Big Data has been around more than the concept of Internet of
Things. Although they are related and most of the time used together, they have very different
definitions. The growth and spread of internet today has enabled the close relation which is evident in
various ways as will be discussed below.

SUMMARY OF BIG DATA
It is characterized by variety, volume, velocity and veracity. Big data is a mixture of structured and
unstructured information (variety), which can be of uncertain provenance (veracity), arriving often in
real-time speed (velocity) and comes in large amounts (volume). It is massively large amounts of data
sets which can be used by firms to look for patterns, trends and any kind of data association between
one behaviors to another. The data is generated from various sources which include sensors, digital
images, videos, audio files, transactions, social media and clickstreams for domains including retail,
healthcare, energy and utilities. The intention could be either to sell the data or to develop better and
enhanced products that fit the consumer using the data collected. Organizations are finding it necessary
to mine big data for these reasons to gain competitive advantage over other organizations in the
market. It is difficult to capture, manage, and analyze big data because of its volume sets which are too
complicated to the conventional relational databases and data warehousing technologies. The
techniques require high end hardware to handle the data which has huge volume, velocity and variety.
The conventional techniques for Big Data storage and analysis is less efficient as memory access is
slower. Data collection is also challenging as the volume and variety of data has to be derived from
different types of sources.

Data sets grow rapidly because they are increasingly gathered by cheap and numerous information sensing mobile devices, aerial, software logs, cameras, microphon...