Auburn University Main Campus Chapter 2 Data Preprocessing Discussion Questions

Content Type

User Generated

User

furyylznl

Subject

Computer Science

School

Auburn University Main Campus

Description

Text book: Chapter 1- Introduction to Data Mining 2nd Edition by Pang-Ning Tan, Michael Steinbach, et al. Jan 4, 2018: ISBN 978-0133128901 After completing the reading chapters 2 & 3 answer the following questions:

Chapter 2:

What is an attribute and note the importance?
What are the different types of attributes?
What is the difference between discrete and continuous data?
Why is data quality important?
What occurs in data preprocessing?
In section 2.4, review the measures of similarity and dissimilarity, select one topic and note the key factors.

Chapter 3:

Note the basic concepts in data classification.
Discuss the general framework for classification.
What is a decision tree and decision tree modifier? Note the importance.
What is a hyper-parameter?
Note the pitfalls of model selection and evaluation.

POST 1

1. What is an attribute and note the importance?

An attribute is a property of an object which can change from time to time or is unique to a specific object. For a broader example of the attribute, we can consider columns in the database that are specifically text values or IDs representing a person’s name, zip code, region, city name, etc. All these objects contain unique values to specific objects such as a person, demographics information etc. These hold essential data of objects as they are used for reporting data in any field (Tan et al., 2019).

2. What are the different types of attributes?

The two attribute categories are Categorical (also known as Qualitative) and Numeric (also known as Quantitative). Categorical attributes are either Nominal or Ordinal type, whereas Numeric attributes are Interval and Ratio based (Tan et al., 2019).

3. What is the difference between discrete and continuous data?

A discrete attribute contains either finite or infinite data that is countable. Examples of this type of data are counts, Identification numbers, etc. Integer variables are used to denote discrete data (Tan et al., 2019).

Continuous attribute data contains only real numbers. Examples include temperature, the weight of an object, and height. Floating-point variables denote these (Tan et al., 2019).

4. Why is data quality important?

Data quality is crucial for any algorithm to run or for an analytics purpose. Without good data quality, the representation of information is not accurate and misunderstood by any audience. Data cleansing tasks are implemented in several ways based on the end goal of data representation. Data mining helps identify issues in the data and fixes them as part of the data cleaning process. Raw data is often not consumable unless it is analyzed and processes as needed (Tan et al., 2019).

5. What occurs in data preprocessing?

Data preprocessing is performed on the data to make it consumable by the data mining process. This involves several techniques such as aggregation, reduced dimensionality, feature selection and creation, binarization and discretization, transforming variables, sampling, etc. All these techniques are applied to the data to summarize valuable content that several downstream processes can consume (Tan et al., 2019).

6. In section 2.4, review the measures of similarity and dissimilarity, select one topic and note the key factors.

The similarity is the numeric measurement between two objects based on the extent to which they have features in common. When two objects are closer in their properties, they tend to have more similarities. Their values are non-negative and between 0 and 1 (Tan et al., 2019).

Dissimilarity is also a numeric measurement of two objects based on the extent to which they do not have standard features. When things are more common in their properties, they tend to be fewer dissimilarities. Distance is an alternative term for dissimilarity and serves as a particular class where their values are commonly in the range of 0 to infinity but are sometimes between 0 and 1 (Tan et al., 2019).

Transformations are used to convert a dissimilarity into a similarity or the other way round. They can help convert a proximity measure to be in the desired range, such as between 0 and 1. Proximity measures of similarities are in the range of 0 and 1 (Tan et al., 2019).

Chapter 3

1. Note the basic concepts in data classification.

The essential components for data classification are comprised of a set of records or instances. Every record is denoted by (x,y), where ‘x’ is the attribute value and ‘y’ is the class label (Tan et al., 2019). Attribute value ‘x’ can be of any type, whereas the class label ‘y’ should contain the values of the categorical type. Attribute set value x is inputted in the classification model for obtaining an output of class label ‘y’ (Tan et al., 2019).

2. Discuss the general framework for classification.

Classifiers are used for assigning labels to unlabeled data instances, and this process is known as classification. A classifier is explained as a model which has sets of instances called training sets (Tan et al., 2019). For every record, these training sets comprise attribute data and class labels. A learning algorithm is used to construct a classification model from its training data, and this process is known as induction. Induction is also known as ‘building a model’ or ‘learning a model’ (Tan et al., 2019).

3. What is a decision tree and decision tree modifier? Note the importance.

A decision tree is a basic classification technique that distinguishes between two different types of objects. This helps in classifying a particular object with the help of a series of questions (Tan et al., 2019). The three node types of the decision tree are root node which does not have incoming links but has none or many outgoing links, internal nodes where each of them has only one incoming link and two or many outgoing links, leaf nodes are also known as terminal nodes have only one incoming link for each node without any outgoing links (Tan et al., 2019).

4. What is a hyper-parameter?

Hyper-parameter is a constraint of learning algorithms that should be finalized before a classification model is learned (Tan et al., 2019). They are denoted with special characters such as alpha that are not available in the final classification model to classify unlabeled instances (Tan et al., 2019).

5. Note the pitfalls of model selection and evaluation.

The pitfalls of model selection and evaluation lead to conclusions that are not correct and usually misleading. Some pitfalls are simple and can easily be avoided, whereas some are understated and complex (Tan et al., 2019). Here are the two pitfalls that are commonly occurring.

Overlap between training and test sets (Tan et al., 2019).
Use of validation errors as generalization errors (Tan et al., 2019).

POST 2

1. What is an attribute and note the importance?

Attributes are usually defining the characteristics or features of the place, person, etc. For example, the shape of a box can be either rectangular or square, which defines its attributes. The importance can be explained by considering an example from a shopkeeper's perspective. They might weigh and measure different items asked by customers based on their attributes and decide which one to suggest to the buyer. This is just like a trial, whereas the importance lies when the shopkeeper weighs all the attributes before suggesting any product or item to any buyer.

2. What are the different types of attributes?

There are five types of attributes Simple, Composite, Single-valued, Multi-valued, and Derived attribute.

3. What is the difference between discrete and continuous data?

Discrete values are fixed values and take concrete values, for example, the number of children enrolling in swimming this summer or sales made by businesses each month. Continuous data, on the other hand, can take any value. The values can also change over time—for example, the weight of a man.

4. Why is data quality important?

Data quality is important because data quality helps to build trust. This helps to provide accurate and up-to-date information. This also ensures run efficient services and helps downstream people to make standard decisions based on the accuracy of the data. Data quality can help to achieve various services for an organisation also, high quality data can improve opportunity to attain top services in an organization (Sidi et al., 2012).

5. What occurs in data pre-processing?

Data pre-processing is a key step that involves converting the data into a meaningful format. Data from the real world can be incomplete, inconsistent and it might lack certain key trends. This data might also have some errors which required to be cleaned up.

6. In section 2.4, review the similarity and dissimilarity measures, select one topic, and note the key factors.

Distance measures help to place similar points into similar clusters and different data points into different ones. Performance is mostly decided using two-three dimensional spaces. To overcome the same, the framework is suggested to examine and standardize the cause of various similarity measures on distance-based clustering algorithms.

Chapter 3:

1. Note the basic concepts in data classification.

Classification is the concept of determining that a new observation will belong to which set. It distinguishes and differentiates classes and concepts.

2. Discuss the general framework for classification.

The basic framework means to design something which in the future grows and expands to a useful element. The framework can be treated like a set of formulas and how they connect.

3. What is a decision tree and decision tree modifier? Note the importance.

Decision trees use various algorithms to split a node into multiple nodes. This helps in improving the purity of the node. Decision tree modifier has its importance where we consider the example of Modifier -25. The main fundamental lies in when and when not use the modifier. This has a walk-through of the page when one should or should not use the tool and takes the help of a decision tree.

4. What is a hyper-parameter?

Parameters defined before the learning process starts are called hyper-parameter. These can be tuned, for example, the number of branches in the decision tree.

5. Note the pitfalls of model selection and evaluation.

This is the phase where we examine our model. We analyze the performance and decide on steps to improve the model. It can also be considered a difference in the model, which performs well and performs very well. With evaluation, we come to know what the model can evaluate and predict better. This helps us to improve the performance up to 90%.

Explanation & Answer:

11 Questions

Tags: model selection Data Quality Continuous data data preprocessing hyper parameter

User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

View attached explanation and answer. Let me know if you have any questions.

1

Chapter 2 and 3 Discussion Questions
Student’s Name
Institutional Affiliation

2
Chapter 2:
1. What is an attribute and note the importance?
Generally, an attribute is a property or characteristic, i.e., colour. In database
management, an attribute is a variable characteristic of a database’s component that
can be set to different values. An attribute is significant for it ensures collection of
data that is required.
2. What are the different types of attributes?
Attributes can either be optional or required; single-valued or multi-valued; and
simple or composite. Attributes can also be categorized as derived if their value is
obtained from another one. Additionally, an attribute can be classified as key or nonkey, depending on if they identify an entity, or not.
3. What is the difference between discrete and continuous data?
Discrete data are data or information that can be assigned specific values. Continuous
data are those that can assume any value. Discrete data include shoe sizes a...