Computer Science
ITS 836 UC Data Wrangling for Big Data Discussion

ITS 836

University of the Cumberlands


Question Description

I’m working on a Computer Science question and need guidance to help me study.


Please refer to the content listed below --> "Data Wrangling - Big -data.pdf. Do you agree with the conclusion in the article that says "Data wrangling is a problem and an opportunity"? Please present your analysis.

1. One main post and 2 response posts are required. Main post carries 60% of the points and each response post carry 20% of the points.

2. Please use additional references as per the need.

3. Please follow APA guidelines.

4. Please do not plagiarize. Do not cut and paste from sources. You should cite them (some of the posts had cut & paste).

Data Wrangling - Big data.pdf

5. Please do not get confused between the discussion post and assignment. Please read the question before attempting to answer.

Please Give me Answer as:

(1)Main post : 250-300 words(answer of this question)(Minimum 2 APA references)

(2)Two responses(additional text) : two paragraphs 150 words each (relating to this question topic)

With APA References


ITS 836 Data science & Big data analytics

Unformatted Attachment Preview

Data Wrangling for Big Data: Challenges and Opportunities Tim Furche Georg Gottlob Leonid Libkin Dept. of Computer Science Oxford University Oxford OX1 3QD, UK Dept. of Computer Science Oxford University Oxford OX1 3QD, UK School of Informatics University of Edinburgh Edinburgh EH8 9AB, UK Giorgio Orsi School. of Computer Science University of Birmingham Birmingham, B15 2TT, UK ABSTRACT Data wrangling is the process by which the data required by an application is identified, extracted, cleaned and integrated, to yield a data set that is suitable for exploration and analysis. Although there are widely used Extract, Transform and Load (ETL) techniques and platforms, they often require manual work from technical and domain experts at different stages of the process. When confronted with the 4 V’s of big data (volume, velocity, variety and veracity), manual intervention may make ETL prohibitively expensive. This paper argues that providing cost-effective, highly-automated approaches to data wrangling involves significant research challenges, requiring fundamental changes to established areas such as data extraction, integration and cleaning, and to the ways in which these areas are brought together. Specifically, the paper discusses the importance of comprehensive support for context awareness within data wrangling, and the need for adaptive, pay-as-you-go solutions that automatically tune the wrangling process to the requirements and resources of the specific application. 1. INTRODUCTION Data wrangling has been recognised as a recurring feature of big data life cycles. Data wrangling has been defined as: Norman W. Paton School of Computer Science University of Manchester Manchester M13 9PL, UK The general requirement to reorganise data for analysis is nothing new, with both database vendors and data integration companies providing Extract, Transform and Load (ETL) products [34]. ETL platforms typically provide components for wrapping data sources, transforming and combing data from different sources, and for loading the resulting data into data warehouses, along with some means of orchestrating the components, such as a workflow language. Such platforms are clearly useful, but in being developed principally for enterprise settings, they tend to limit their scope to supporting the specification of wrangling workflows by expert developers. Does big data make a difference to what is needed for ETL? Although there are many different flavors of big data applications, the 4 V’s of big data1 refer to some recurring characteristics: Volume represents scale either in terms of the size or number of data sources; Velocity represents either data arrival rates or the rate at which sources or their contents may change; Variety captures the diversity of sources of data, including sensors, databases, files and the deep web; and Veracity represents the uncertainty that is inevitable in such a complex environment. When all 4 V’s are present, the use of ETL processes involving manual intervention at some stage may lead to the sacrifice of one or more of the V’s to comply with resource and budget constraints. Currently, a process of iterative data exploration and transformation that enables analysis. ([21]) data scientists spend from 50 percent to 80 percent of their time collecting and preparing unruly digital data. ([24]) In some cases, definitions capture the assumption that there is significant manual effort in the process: and only a fraction of an expert’s time may be dedicated to valueadded exploration and analysis. In addition to the technical case for research in data wrangling, there is also a significant business case; for example, vendor revenue from big data hardware, software and services was valued at $13B in 2013, with an annual growth rate of 60%. However, just as significant is the nature of the associated activities. The UK Government’s Information Economy Strategy states: the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools. ([35]) the overwhelming majority of information economy businesses – 95% of the 120,000 enterprises in the sector – employ fewer than 10 people. ([14]) c 2016, Copyright is with the authors. Published in Proc. 19th International Conference on Extending Database Technology (EDBT), March 15-18, 2016 - Bordeaux, France: ISBN 978-3-89318-070-7, on Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 As such, many of the organisations that stand to benefit from big data will not be able to devote substantial resources to value-added 1 four-vs-big-data. data analyses unless massive automation of wrangling processes is achieved, e.g., by limiting manual intervention to high-level feedback and to the specification of exceptions. Example 1 (e-Commerce Price Intelligence). When running an eCommerce site, it is necessary to understand pricing trends among competitors. This may involve getting to grips with: Volume – thousands of sites; Velocity – sites, site descriptions and contents that are continually changing; Variety – in format, content, targeted community, etc; and Veracity – unavailability, inconsistent descriptions, unavailable offers, etc. Manual data wrangling is likely to be expensive, partial, unreliable and poorly targeted. As a result, there is a need for research into how to make data wrangling more cost effective. The contribution of this vision paper is to characterise research challenges emerging from data wrangling for the 4Vs (Section 2), to identify what existing work seems to be relevant and where it needs to be further developed (Section 3), and to provide a vision for a new research direction that is a prerequisite for widespread cost-effective exploitation of big data (Section 4). 2. DATA WRANGLING CHALLENGES – RESEARCH lecting and integrating data risks the production of data sets that are not always fit for purpose. Making well informed compromises involves: (i) capturing and making explicit the requirements and priorities of users; and (ii) enabling these requirements to permeate the wrangling process. There has been significant work on decision-support, for example in relation to multi-criteria decision making [37], that provides both languages for capturing requirements and algorithms for exploring the space of possible solutions in ways that take the requirements into account. For example, in the widely used Analytic Hierarchy Process [31], users compare criteria (such as timeliness or completeness) in terms of their relative importance, which can be taken into account when making decisions (such as which mappings to use in data integration). Although data management researchers have investigated techniques that apply specific user criteria to inform decisions (e.g. for selecting sources based on their anticipated financial value [16]) and have sometimes traded off alternative objectives (e.g. precision and recall for mapping selection and refinement [5]), such results have tended to address specific steps within wrangling in isolation, often leading to bespoke solutions. Together with high automation, adaptivity and multi-criteria optimisation are of paramount importance for cost-effective wrangling processes. 2.2 Extending the Boundaries As discussed in the introduction, there is a need for cost-effective data wrangling; the 4 V’s of big data are likely to lead to the manual production of a comprehensive data wrangling process being prohibitively expensive for many users. In practice this means that data wrangling for big data involves: (i) making compromises – as the perfect solution is not likely to be achievable, it is necessary to understand and capture the priorities of the users and to use these to target resources in a cost-effective manner; (ii) extending boundaries – as relevant data may be spread across many organisations and of many types; (iii) making use of all the available information – applications differ not only in the nature of the relevant data sources, but also in existing resources that could inform the wrangling process, and full use needs to be made of existing evidence; and (iv) adopting an incremental, pay-as-you-go approach – users need to be able to contribute effort to the wrangling process in whatever form they choose and at whatever moment they choose. The remainder of this section expands on these features, pointing out the challenges that they present to researchers. ETL processes traditionally operate on data lying within the boundaries of an organisation or across a network of partners. As soon as companies started to leverage big data and data science, it became clear that data outside the boundaries of the organisation represent both new business opportunities as well as a means to optimize existing business processes. Data wrangling solutions recently started to offer connectors to external data sources but, for now, mostly limited to open government data and established social networks (e.g., Twitter) via formalised APIs. This makes wrangling processes dependent on the availability of APIs from third parties, thus limiting the availability of data and the scope of the wrangling processes. Recent advances in web data extraction [19, 30] have shown that fully-automated, large scale collection of long-tail, business-related data, e.g., products, jobs or locations, is possible. The challenge for data wrangling processes is now to make proper use of this wealth of “wild” data by coordinating extraction, integration and cleaning processes. 2.1 Example 3 (Business Locations). Many social networks offer the ability for users to check-in to places, e.g., restaurants, offices, cinemas, via their mobile apps. This gives to social networks the ability to maintain a database of businesses, their locations, and profiles of users interacting with them that is immensely valuable for advertising purposes. On the other hand, this way of acquiring data is prone to data quality problems, e.g., wrong geo-locations, misspelled or fantasy places. A popular way to address these problems is to acquire a curated database of geo-located business locations. This is usually expensive and does not always guarantee that the data is really clean, as its quality depends on the quality of the (usually unknown) data acquisition and curation process. Another way is to define a wrangling process that collects this information right on the website of the business of interest, e.g., by wrapping the target data source directly. The extraction process can in this case be “informed” by existing integrated data, e.g., the business url and a database of already known addresses, to identify previously unknown locations and correct erroneous ones. Making Compromises Faced with an application exhibiting the 4 V’s of big data, data scientists may feel overwhelmed by the scale and difficulty of the wrangling task. It will often be impossible to produce a comprehensive solution, so one challenge is to make well informed compromises. The user context of an application specifies functional and nonfunctional requirements of the users, and the trade-offs between them. Example 2 (e-Commerce User Contexts). In price intelligence, following on from Example 1, there may be different user contexts. For example, routine price comparison may be able to work with a subset of high quality sources, and thus the user may prefer features such as accuracy and timeliness to completeness. In contrast, where sales of a popular item have been falling, the associated issue investigation may require a more complete picture for the product in question, at the risk of presenting the user with more incorrect or out-of-date data. Thus a single application may have different user contexts, and any approach to data wrangling that hard-wires a process for se- 2.3 Using All the Available Information Cost-effective data wrangling will need to make extensive use of automation for the different steps in the wrangling process. Automated processes must take advantage of all available information both when generating proposals and for comparing alternative proposals in the light of the user context. The data context of an application consists of the sources that may provide data for wrangling, and other information that may inform the wrangling process. Example 4 (e-Commerce Data Context). In price intelligence, following on from Example 1, the data context includes the catalogs of the many online retailers that sell overlapping sets of products to overlapping markets. However, there are additional data resources that can inform the process. For example, the e-Commerce company has a product catalog that can be considered as master data by the wrangling process; the company is interested in price comparison only for the products it sells. In addition, for this domain there are standard formats, for example in, for describing products and offers, and there are ontologies that describe products, such as The Product Types Ontology2 . Thus applications have different data contexts, which include not only the data that the application seeks to use, but also local and third party sources that provide additional information about the domain or the data therein. To be cost-effective, automated techniques must be able to bring together all the available information. For example, a product types ontology could be used to inform the selection of sources based on their relevance, as an input to the matching of sources that supplements syntactic matching, and as a guide to the fusion of property values from records that have been obtained from different sources. To do this, automated processes must make well founded decisions, integrating evidence of different types. In data management, there are results of relevance to data wrangling that assimilate evidence to reach decisions (e.g. [36]), but work to date tends to be focused on small numbers of types of evidence, and individual data management tasks. Cost effective data wrangling requires more pervasive approaches. 2.4 Adopting a Pay-as-you-go Approach As discussed in Section 1, potential users of big data will not always have access to substantial budgets or teams of skilled data scientists to support manual data wrangling. As such, rather than depending upon a continuous labor-intensive wrangling effort, to enable resources to be deployed on data wrangling in a targeted and flexible way, we propose an incremental, pay-as-you-go approach, in which the “payment” can take different forms. Providing a pay-as-you-go approach, with flexible kinds of payment, means automating all steps in the wrangling process, and allowing feedback in whatever form the user chooses. This requires a flexible architecture in which feedback is combined with other sources of evidence (see Section 2.3) to enable the best possible decisions to be made. Feedback of one type should be able to inform many different steps in the wrangling process – for example, the identification of several correct (or incorrect) results may inform both source selection and mapping generation. Although there has been significant work on incremental, pay-as-you-go approaches to data management, building on the dataspaces vision [18], typically this has used one or a few types of feedback to inform a single activity. As such, there is significant work to be done to provide a more integrated approach in which feedback can inform all steps of the wrangling process. Example 5 (e-Commerce Pay-as-you-go). In Example 1, automated approaches to data wrangling can be used to select sources of 2 product data, and to fuse the values from such sources to provide reports on the pricing of different products. These reports are studied by the data scientists of the e-Commerce company who are reviewing the pricing of competitors, who can annotate the data values in the report, for example, to identify which are correct or incorrect, along with their relevance to decision-making. Such feedback can trigger the data wrangling system to revise the way in which such reports are produced, for example by prioritising results from different data sources. The provision of domain-expert feedback from the data scientists is a form of payment, as staff effort is required to provide it. However, it should also be possible to use crowdsourcing, with direct financial payment of crowd workers, for example to identify duplicates, and thereby to refine the automatically generated rules that determine when two records represent the same real-world object [20]. It is of paramount importance that these feedback-induced “reactions” do not trigger a re-processing of all datasets involved in the computation but rather limit the processing to the strictly necessary data. 3. DATA WRANGLING WORK – RELATED As discussed in Section 2, cost-effective data wrangling is expected to involve best-effort approaches, in which multiple sources of evidence are combined by automated techniques, the results of which can be refined following a pay-as-you-go approach. Space precludes a comprehensive review of potentially relevant results, so in this section we focus on three areas with overlapping requirements and approaches, pointing out existing results on which data wrangling can build, but also areas in which these results need to be extended. 3.1 Knowledge Base Construction In knowledge base construction (KBC) the objective is to automatically create structured representations of data, typically using the web as a source of facts for inclusion in the knowledge base. Prominent examples include YAGO [33], Elementary [28] and Google’s Knowledge Vault [15], all of which combine candidate facts from web data sources to create or extend descriptions of entities. Such proposals are relevant to data wrangling, in providing large scale, automatically generated representations of structured data extracted from diverse sources, taking account of the associated uncertainties. These techniques have produced impressive results but they tend to have a single, implicit user context, with a focus on consolidating slowly-changing, common sense knowledge that leans heavily on the assumption that correct facts occur frequently (instance-based redundancy). For data wrangling, the need to support diverse user contexts and highly transient information (e.g., pricing) means that user requirements need to be made explicit and to inform decisionmaking throughout automated processes. In addition, the focus on fully automated KBC at web-scale, without systematic support for incremental improvement in a pay-as-you-go manner, tends to require expert input, for example through the writing of rules (e.g., [28]). As such, KBC proposals share requirements with data wrangling, but have different emphases. 3.2 Pay-as-you-go Data Management Pay-as-you-go data management, as represented by the dataspaces vision [18], involves the combination of an automated bootstrapping phase, followed by incremental improvement. There have been numerous results on different aspects of pay-as-you-go data management, across several activities of relevance to data wran- gling, such as data extraction (e.g., [12]), matching [26], mapping [5] and entity resolution [20]. We note that in these proposals a single type of feedback is used to support a single data management task. The opportunities presented by crowdsourcing have provided a recent boost to this area, in which, typically, paid micro-tasks are submitted to public cro ...
Purchase answer to see full attachment
Student has agreed that all tutoring, explanations, and answers provided by the tutor will be used to help in the learning process and in accordance with Studypool's honor code & terms of service.

Final Answer



Data Wrangling for Big Data
Date of submission




The research article focuses on crucial details on data wrangling and describes them
in details. The report also brings out the opportunities and challenges presented by data
wrangling concerning big data. I agree with the research insists that data wrangling should be
more automated and cost-effective means involves several problems (Furche, Gottlob,
Libkin, Orsi, & Paton, 2016). It is also noted that the old-fashioned ways could make the
process of Extract, Transform, and Load (ETL)quite expensive (Rattenbury, Hellerstein,
Heer, Kandel, & Carreras, 2017). By discussing these issues and carrying out this research,
the paper aims to create awareness for data wrangling in data management. In big data
management, the researcher insists that the 4v's are of great significance. The 4v's represent;
volume, variety, velocity, and veracity and when confronted with this, ETL could be difficult
without the use of data wrangling.
Data wrangling could be acquired manually from the 4v's to come up with the most
cost-effective prove,...

DoctorDickens (9452)
UT Austin

The tutor managed to follow the requirements for my assignment and helped me understand the concepts on it.

The tutor was knowledgeable, will be using the service again.

Awesome quality of the tutor. They were helpful and accommodating given my needs.