Computer Science Question

User Generated

hxabjzrlrnu

Computer Science

Intro to Data Mining

Rivier College

Description

Term Paper Topic and References

In this assignment, submit your topic and preliminary references, in APA format, that you will use when completing your final research paper. Your submission should include the following elements:

  1. Provide the title of your term paper (note: you may change the wording in the official title in the final version however, you cannot change the topic once you select one). The topic can be anything topic relating to data mining.
  2. Include an introduction on the topic. This introduction should be one- to two-pages in length.
  3. A minimum of 3-5 references in proper APA format.

Your final research paper will be due in week 8.

Example :

Implementation of data mining in securing the cyber medium

Data mining is the method of recognizing designs in massive datasets. Data mining methods are gradually used in experimental research (in order to prepare large quantities of raw experimental data) as well as in marketing, frequently to assume statistics and relevant information to improve customer associations and marketing policies (Ge, Song, Ding, & Huang, 2017). Data mining has additionally established a valuable instrument in cybersecurity resolutions for identifying vulnerabilities and choosing indicators for baselining. Cybersecurity resolutions are traditionally immobile and based on the signature.

The common solutions forward with the application of analytic standards, big data, and machine learning could be enhanced by trigger reduction or present relevant information to manage or limit the outcomes of warnings. This kind of creative solution is included in the meaning of Data Mining for cybersecurity (Ge, Song, Ding, & Huang, 2017). Data Mining provides an important part of cybersecurity by employing the potential of data (and big data), high operating computing, and data mining as well as machine learning to defend users toward cyber-crimes. For this design, a strong data mining project needs an efficient methodology to include all effects and implement adequate support. This article is introducing successful data mining implementation and will analyze them in accordance with cybersecurity difficulties (Groenhof, et al., 2020). A comparison analysis has further been given to describe methodologies' powers and flaws in the state of cybersecurity plans. Away from identifying malware code, data mining will be efficiently applied to discover intrusions and interpret audit events to detect unusual patterns. Malicious interventions may include interventions into systems, servers, databases, web clients, and running systems.

To identify host-based crimes, people require to explain characteristics derived from programs, while to identify network-based interventions, people require to separate network transactions. And quite like including malware detection, people will able to look for either unusual behavior or evidence of perversion. Dishonest movements can be identified with the guidance of managed and ordinary learning. With directed learning, all possible experiences are listed as either false or non-fraudulent (Ge, Song, Ding, & Huang, 2017). For this design, a strong data mining project needs an efficient methodology to include all effects and implement adequate support. This analysis is then applied for reaching a model to identify potential fraud. The principal shortcoming of this order is its incapability to discover new classes of attacks. Unsupervised learning systems help to recognize separation and security problems in data externally employing mathematical analysis.

References

Ge, Z., Song, Z., Ding, S. X., & Huang, B. (2017). Data mining and analytics in the process industry: The role of machine learning. Ieee Access, 20590-20616.

Groenhof, T., Koers, L., Blasse, E., de Groot, M., Grobbee, D., Bots, M., . . . Hoefer, I. (2020). Data mining information from electronic health records produced high yield and accuracy for current smoking status. Journal of Clinical Epidemiology, 118(1), 100-106.

Unformatted Attachment Preview

Introduction to Data Mining Instructor’s Solution Manual Pang-Ning Tan Michael Steinbach Vipin Kumar c Copyright 2006 Pearson Addison-Wesley. All rights reserved. Contents 1 Introduction 1 2 Data 5 3 Exploring Data 19 4 Classification: Basic Concepts, Decision Trees, and Model Evaluation 25 5 Classification: Alternative Techniques 45 6 Association Analysis: Basic Concepts and Algorithms 71 7 Association Analysis: Advanced Concepts 95 8 Cluster Analysis: Basic Concepts and Algorithms 125 9 Cluster Analysis: Additional Issues and Algorithms 147 10 Anomaly Detection 157 iii 1 Introduction 1. Discuss whether or not each of the following activities is a data mining task. (a) Dividing the customers of a company according to their gender. No. This is a simple database query. (b) Dividing the customers of a company according to their profitability. No. This is an accounting calculation, followed by the application of a threshold. However, predicting the profitability of a new customer would be data mining. (c) Computing the total sales of a company. No. Again, this is simple accounting. (d) Sorting a student database based on student identification numbers. No. Again, this is a simple database query. (e) Predicting the outcomes of tossing a (fair) pair of dice. No. Since the die is fair, this is a probability calculation. If the die were not fair, and we needed to estimate the probabilities of each outcome from the data, then this is more like the problems considered by data mining. However, in this specific case, solutions to this problem were developed by mathematicians a long time ago, and thus, we wouldn’t consider it to be data mining. (f) Predicting the future stock price of a company using historical records. Yes. We would attempt to create a model that can predict the continuous value of the stock price. This is an example of the 2 Chapter 1 Introduction area of data mining known as predictive modelling. We could use regression for this modelling, although researchers in many fields have developed a wide variety of techniques for predicting time series. (g) Monitoring the heart rate of a patient for abnormalities. Yes. We would build a model of the normal behavior of heart rate and raise an alarm when an unusual heart behavior occurred. This would involve the area of data mining known as anomaly detection. This could also be considered as a classification problem if we had examples of both normal and abnormal heart behavior. (h) Monitoring seismic waves for earthquake activities. Yes. In this case, we would build a model of different types of seismic wave behavior associated with earthquake activities and raise an alarm when one of these different types of seismic activity was observed. This is an example of the area of data mining known as classification. (i) Extracting the frequencies of a sound wave. No. This is signal processing. 2. Suppose that you are employed as a data mining consultant for an Internet search engine company. Describe how data mining can help the company by giving specific examples of how techniques, such as clustering, classification, association rule mining, and anomaly detection can be applied. The following are examples of possible answers. • Clustering can group results with a similar theme and present them to the user in a more concise form, e.g., by reporting the 10 most frequent words in the cluster. • Classification can assign results to pre-defined categories such as “Sports,” “Politics,” etc. • Sequential association analysis can detect that that certain queries follow certain other queries with a high probability, allowing for more efficient caching. • Anomaly detection techniques can discover unusual patterns of user traffic, e.g., that one subject has suddenly become much more popular. Advertising strategies could be adjusted to take advantage of such developments. 3 3. For each of the following data sets, explain whether or not data privacy is an important issue. (a) Census data collected from 1900–1950. No (b) IP addresses and visit times of Web users who visit your Website. Yes (c) Images from Earth-orbiting satellites. No (d) Names and addresses of people from the telephone book. No (e) Names and email addresses collected from the Web. No 2 Data 1. In the initial example of Chapter 2, the statistician says, “Yes, fields 2 and 3 are basically the same.” Can you tell from the three lines of sample data that are shown why she says that? ≈ 7 for the values displayed. While it can be dangerous to draw conclusions from such a small sample, the two fields seem to contain essentially the same information. Field 2 Field 3 2. Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity. Example: Age in years. Answer: Discrete, quantitative, ratio (a) Time in terms of AM or PM. Binary, qualitative, ordinal (b) Brightness as measured by a light meter. Continuous, quantitative, ratio (c) Brightness as measured by people’s judgments. Discrete, qualitative, ordinal (d) Angles as measured in degrees between 0◦ and 360◦ . Continuous, quantitative, ratio (e) Bronze, Silver, and Gold medals as awarded at the Olympics. Discrete, qualitative, ordinal (f) Height above sea level. Continuous, quantitative, interval/ratio (depends on whether sea level is regarded as an arbitrary origin) (g) Number of patients in a hospital. Discrete, quantitative, ratio (h) ISBN numbers for books. (Look up the format on the Web.) Discrete, qualitative, nominal (ISBN numbers do have order information, though) 6 Chapter 2 Data (i) Ability to pass light in terms of the following values: opaque, translucent, transparent. Discrete, qualitative, ordinal (j) Military rank. Discrete, qualitative, ordinal (k) Distance from the center of campus. Continuous, quantitative, interval/ratio (depends) (l) Density of a substance in grams per cubic centimeter. Discrete, quantitative, ratio (m) Coat check number. (When you attend an event, you can often give your coat to someone who, in turn, gives you a number that you can use to claim your coat when you leave.) Discrete, qualitative, nominal 3. You are approached by the marketing director of a local company, who believes that he has devised a foolproof way to measure customer satisfaction. He explains his scheme as follows: “It’s so simple that I can’t believe that no one has thought of it before. I just keep track of the number of customer complaints for each product. I read in a data mining book that counts are ratio attributes, and so, my measure of product satisfaction must be a ratio attribute. But when I rated the products based on my new customer satisfaction measure and showed them to my boss, he told me that I had overlooked the obvious, and that my measure was worthless. I think that he was just mad because our best-selling product had the worst satisfaction since it had the most complaints. Could you help me set him straight?” (a) Who is right, the marketing director or his boss? If you answered, his boss, what would you do to fix the measure of satisfaction? The boss is right. A better measure is given by Satisfaction(product) = number of complaints for the product . total number of sales for the product (b) What can you say about the attribute type of the original product satisfaction attribute? Nothing can be said about the attribute type of the original measure. For example, two products that have the same level of customer satisfaction may have different numbers of complaints and vice-versa. 4. A few months later, you are again approached by the same marketing director as in Exercise 3. This time, he has devised a better approach to measure the extent to which a customer prefers one product over other, similar products. He explains, “When we develop new products, we typically create several variations and evaluate which one customers prefer. Our standard procedure is to give our test subjects all of the product variations at one time and then 7 ask them to rank the product variations in order of preference. However, our test subjects are very indecisive, especially when there are more than two products. As a result, testing takes forever. I suggested that we perform the comparisons in pairs and then use these comparisons to get the rankings. Thus, if we have three product variations, we have the customers compare variations 1 and 2, then 2 and 3, and finally 3 and 1. Our testing time with my new procedure is a third of what it was for the old procedure, but the employees conducting the tests complain that they cannot come up with a consistent ranking from the results. And my boss wants the latest product evaluations, yesterday. I should also mention that he was the person who came up with the old product evaluation approach. Can you help me?” (a) Is the marketing director in trouble? Will his approach work for generating an ordinal ranking of the product variations in terms of customer preference? Explain. Yes, the marketing director is in trouble. A customer may give inconsistent rankings. For example, a customer may prefer 1 to 2, 2 to 3, but 3 to 1. (b) Is there a way to fix the marketing director’s approach? More generally, what can you say about trying to create an ordinal measurement scale based on pairwise comparisons? One solution: For three items, do only the first two comparisons. A more general solution: Put the choice to the customer as one of ordering the product, but still only allow pairwise comparisons. In general, creating an ordinal measurement scale based on pairwise comparison is difficult because of possible inconsistencies. (c) For the original product evaluation scheme, the overall rankings of each product variation are found by computing its average over all test subjects. Comment on whether you think that this is a reasonable approach. What other approaches might you take? First, there is the issue that the scale is likely not an interval or ratio scale. Nonetheless, for practical purposes, an average may be good enough. A more important concern is that a few extreme ratings might result in an overall rating that is misleading. Thus, the median or a trimmed mean (see Chapter 3) might be a better choice. 5. Can you think of a situation in which identification numbers would be useful for prediction? One example: Student IDs are a good predictor of graduation date. 6. An educational psychologist wants to use association analysis to analyze test results. The test consists of 100 questions with four possible answers each. 8 Chapter 2 Data (a) How would you convert this data into a form suitable for association analysis? Association rule analysis works with binary attributes, so you have to convert original data into binary form as follows: Q1 = A 1 0 Q1 = B 0 0 Q1 = C 0 1 Q1 = D 0 0 ... ... ... Q100 = A 1 0 Q100 = B 0 1 Q100 = C 0 0 Q100 = D 0 0 (b) In particular, what type of attributes would you have and how many of them are there? 400 asymmetric binary attributes. 7. Which of the following quantities is likely to show more temporal autocorrelation: daily rainfall or daily temperature? Why? A feature shows spatial auto-correlation if locations that are closer to each other are more similar with respect to the values of that feature than locations that are farther away. It is more common for physically close locations to have similar temperatures than similar amounts of rainfall since rainfall can be very localized;, i.e., the amount of rainfall can change abruptly from one location to another. Therefore, daily temperature shows more spatial autocorrelation then daily rainfall. 8. Discuss why a document-term matrix is an example of a data set that has asymmetric discrete or asymmetric continuous features. The ij th entry of a document-term matrix is the number of times that term j occurs in document i. Most documents contain only a small fraction of all the possible terms, and thus, zero entries are not very meaningful, either in describing or comparing documents. Thus, a document-term matrix has asymmetric discrete features. If we apply a TFIDF normalization to terms and normalize the documents to have an L2 norm of 1, then this creates a term-document matrix with continuous features. However, the features are still asymmetric because these transformations do not create non-zero entries for any entries that were previously 0, and thus, zero entries are still not very meaningful. 9. Many sciences rely on observation instead of (or in addition to) designed experiments. Compare the data quality issues involved in observational science with those of experimental science and data mining. Observational sciences have the issue of not being able to completely control the quality of the data that they obtain. For example, until Earth orbit- 9 ing satellites became available, measurements of sea surface temperature relied on measurements from ships. Likewise, weather measurements are often taken from stations located in towns or cities. Thus, it is necessary to work with the data available, rather than data from a carefully designed experiment. In that sense, data analysis for observational science resembles data mining. 10. Discuss the difference between the precision of a measurement and the terms single and double precision, as they are used in computer science, typically to represent floating-point numbers that require 32 and 64 bits, respectively. The precision of floating point numbers is a maximum precision. More explicity, precision is often expressed in terms of the number of significant digits used to represent a value. Thus, a single precision number can only represent values with up to 32 bits, ≈ 9 decimal digits of precision. However, often the precision of a value represented using 32 bits (64 bits) is far less than 32 bits (64 bits). 11. Give at least two advantages to working with data stored in text files instead of in a binary format. (1) Text files can be easily inspected by typing the file or viewing it with a text editor. (2) Text files are more portable than binary files, both across systems and programs. (3) Text files can be more easily modified, for example, using a text editor or perl. 12. Distinguish between noise and outliers. Be sure to consider the following questions. (a) Is noise ever interesting or desirable? Outliers? No, by definition. Yes. (See Chapter 10.) (b) Can noise objects be outliers? Yes. Random distortion of the data is often responsible for outliers. (c) Are noise objects always outliers? No. Random distortion can result in an object or value much like a normal one. (d) Are outliers always noise objects? No. Often outliers merely represent a class of objects that are different from normal objects. (e) Can noise make a typical value into an unusual one, or vice versa? Yes. 10 Chapter 2 Data 13. Consider the problem of finding the K nearest neighbors of a data object. A programmer designs Algorithm 2.1 for this task. Algorithm 2.1 Algorithm for finding K nearest neighbors. 1: for i = 1 to number of data objects do 2: Find the distances of the ith object to all other objects. 3: Sort these distances in decreasing order. (Keep track of which object is associated with each distance.) 4: return the objects associated with the first K distances of the sorted list 5: end for (a) Describe the potential problems with this algorithm if there are duplicate objects in the data set. Assume the distance function will only return a distance of 0 for objects that are the same. There are several problems. First, the order of duplicate objects on a nearest neighbor list will depend on details of the algorithm and the order of objects in the data set. Second, if there are enough duplicates, the nearest neighbor list may consist only of duplicates. Third, an object may not be its own nearest neighbor. (b) How would you fix this problem? There are various approaches depending on the situation. One approach is to to keep only one object for each group of duplicate objects. In this case, each neighbor can represent either a single object or a group of duplicate objects. 14. The following attributes are measured for members of a herd of Asian elephants: weight, height, tusk length, trunk length, and ear area. Based on these measurements, what sort of similarity measure from Section 2.4 would you use to compare or group these elephants? Justify your answer and explain any special circumstances. These attributes are all numerical, but can have widely varying ranges of values, depending on the scale used to measure them. Furthermore, the attributes are not asymmetric and the magnitude of an attribute matters. These latter two facts eliminate the cosine and correlation measure. Euclidean distance, applied after standardizing the attributes to have a mean of 0 and a standard deviation of 1, would be appropriate. 15. You are given a set of m objects that is divided into K groups, where the ith group is of size mi . If the goal is to obtain a sample of size n < m, what is the difference between the following two sampling schemes? (Assume sampling with replacement.) 11 (a) We randomly select n ∗ mi /m elements from each group. (b) We randomly select n elements from the data set, without regard for the group to which an object belongs. The first scheme is guaranteed to get the same number of objects from each group, while for the second scheme, the number of objects from each group will vary. More specifically, the second scheme only guarantes that, on average, the number of objects from each group will be n ∗ mi /m. 16. Consider a document-term matrix, where tfij is the frequency of the ith word (term) in the j th document and m is the number of documents. Consider the variable transformation that is defined by  = tfij ∗ log tfij m , dfi (2.1) where dfi is the number of documents in which the ith term appears and is known as the document frequency of the term. This transformation is known as the inverse document frequency transformation. (a) What is the effect of this transformation if a term occurs in one document? In every document? Terms that occur in every document have 0 weight, while those that occur in one document have maximum weight, i.e., log m. (b) What might be the purpose of this transformation? This normalization reflects the observation that terms that occur in every document do not have any power to distinguish one document from another, while those that are relatively rare do. 17. Assume that we apply a square root transformation to a ratio attribute x to obtain the new attribute x∗ . As part of your analysis, you identify an interval (a, b) in which x∗ has a linear relationship to another attribute y. (a) What is the corresponding interval (a, b) in terms of x? (a2 , b2 ) (b) Give an equation that relates y to x. In this interval, y = x2 . 18. This exercise compares and contrasts some similarity and distance measures. (a) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Compute the Hamming distance and the Jaccard similarity between the following two binary vectors. 12 Chapter 2 Data x = 0101010001 y = 0100011000 Hamming distance = number of different bits = 3 Jaccard Similarity = number of 1-1 matches /( number of bits - number 0-0 matches) = 2 / 5 = 0.4 (b) Which approach, Jaccard or Hamming distance, is more similar to the Simple Matching Coefficient, and which approach is more similar to the cosine measure? Explain. (Note: The Hamming measure is a distance, while the other three measures are similarities, but don’t let this confuse you.) The Hamming distance is similar to the SMC. In fact, SMC = Hamming distance / number of bits. The Jaccard measure is similar to the cosine measure because both ignore 0-0 matches. (c) Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Describe which measure, Hamming or Jaccard, you think would be more appropriate for comparing the genetic makeup of two organisms. Explain. (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and 0 otherwise.) Jaccard is more appropriate for comparing the genetic makeup of two organisms; since we want to see how many genes these two organisms share. (d) If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. (Note that two human beings share > 99.9% of the same genes.) Two human beings share >99.9% of the same genes. If we want to compare the genetic makeup of two human beings, we should focus on their differences. Thus, the Hamming distance is more appropriate in this situation. 19. For the following vectors, x and y, calculate the indicated similarity or distance measures. (a) x = (1, 1, 1, 1), y = (2, 2, 2, 2) cosine, correlation, Euclidean cos(x, y) = 1, corr(x, y) = 0/0 (undefined), Euclidean(x, y) = 2 (b) x = (0, 1, 0, 1), y = (1, 0, 1, 0) cosine, correlation, Euclidean, Jaccard cos(x, y) = 0, corr(x, y) = −1, Euclidean(x, y) = 2, Jaccard(x, y) = 0 13 (c) x = (0, −1, 0, 1), y = (1, 0, −1, 0) cosine, correlation, Euclidean cos(x, y) = 0, corr(x, y)=0, Euclidean(x, y) = 2 (d) x = (1, 1, 0, 1, 0, 1), y = (1, 1, 1, 0, 0, 1) cosine, correlation, Jaccard cos(x, y) = 0.75, corr(x, y) = 0.25, Jaccard(x, y) = 0.6 (e) x = (2, −1, 0, 2, 0, −3), y = (−1, 1, −1, 0, 0, −1) cosine, correlation cos(x, y) = 0, corr(x, y) = 0 20. Here, we further explore the cosine and correlation measures. (a) What is the range of values that are possible for the cosine measure? [−1, 1]. Many times the data has only positive entries and in that case the range is [0, 1]. (b) If two objects have a cosine measure of 1, are they identical? Explain. Not necessarily. All we know is that the values of their attributes differ by a constant factor. (c) What is the relationship of the cosine measure to correlation, if any? (Hint: Look at statistical measures such as mean and standard deviation in cases where cosine and correlation are the same and different.) For two vectors, x and y that have a mean of 0, corr(x, y) = cos(x, y). (d) Figure 2.1(a) shows the relationship of the cosine measure to Euclidean distance for 100,000 randomly generated points that have been normalized to have an L2 length of 1. What general observation can you make about the relationship between Euclidean distance and cosine similarity when vectors have an L2 norm of 1? Since all the 100,000 points fall on the curve, there is a functional relationship between Euclidean distance and cosine similarity for normalized data. More specifically, there is an inverse relationship between cosine similarity and Euclidean distance. For example, if two data points are identical, their cosine similarity is one and their Euclidean distance is zero, but if two data points have a high Euclidean distance, their cosine value is close to zero. Note that all the sample data points were from the positive quadrant, i.e., had only positive values. This means that all cosine (and correlation) values will be positive. (e) Figure 2.1(b) shows the relationship of correlation to Euclidean distance for 100,000 randomly generated points that have been standardized to have a mean of 0 and a standard deviation of 1. What general observation can you make about the relationship between Euclidean distance and correlation when the vectors have been standardized to have a mean of 0 and a standard deviation of 1? 14 Chapter 2 Data Same as previous answer, but with correlation substituted for cosine. (f) Derive the mathematical relationship between cosine similarity and Euclidean distance when each data object has an L2 length of 1. Let x and y be two vectors where each vector has an L2 length of 1. For such vectors, the variance is just n times the sum of its squared attribute values and the correlation between the two vectors is their dot product divided by n.   n  d(x, y) =  (xk − yk )2 k=1   n  =  x2k − 2xk yk + yk2 = =   k=1 1 − 2cos(x, y) + 1 2(1 − cos(x, y)) (g) Derive the mathematical relationship between correlation and Euclidean distance when each data point has been been standardized by subtracting its mean and dividing by its standard deviation. Let x and y be two vectors where each vector has an a mean of 0 and a standard deviation of 1. For such vectors, the variance (standard deviation squared) is just n times the sum of its squared attribute values and the correlation between the two vectors is their dot product divided by n.   n  d(x, y) =  (xk − yk )2 k=1   n  =  x2k − 2xk yk + yk2 = =   k=1 n − 2ncorr(x, y) + n 2n(1 − corr(x, y)) 21. Show that the set difference metric given by d(A, B) = size(A − B) + size(B − A) satisfies the metric axioms given on page 70. A and B are sets and A − B is the set difference. 1.4 1.4 1.2 1.2 Euclidean Distance Euclidean Distance 15 1 0.8 0.6 0.4 0.2 0 0 1 0.8 0.6 0.4 0.2 0.2 0.4 0.6 Cosine Similarity 0.8 (a) Relationship between Euclidean distance and the cosine measure. 1 0 0 0.2 0.4 0.6 Correlation 0.8 1 (b) Relationship between Euclidean distance and correlation. Figure 2.1. Figures for exercise 20. 1(a). Because the size of a set is greater than or equal to 0, d(x, y) ≥ 0. 1(b). if A = B, then A − B = B − A = empty set and thus d(x, y) = 0 2. d(A, B) = size(A−B)+size(B−A) = size(B−A)+size(A−B) = d(B, A) 3. First, note that d(A, B) = size(A) + size(B) − 2size(A ∩ B). ∴ d(A, B)+d(B, C) = size(A)+size(C)+2size(B)−2size(A∩B)−2size(B∩ C) Since size(A ∩ B) ≤ size(B) and size(B ∩ C) ≤ size(B), d(A, B) + d(B, C) ≥ size(A) + size(C) + 2size(B) − 2size(B) = size(A) + size(C) ≥ size(A) + size(C) − 2size(A ∩ C) = d(A, C) ∴ d(A, C) ≤ d(A, B) + d(B, C) 22. Discuss how you might map correlation values from the interval [−1,1] to the interval [0,1]. Note that the type of transformation that you use might depend on the application that you have in mind. Thus, consider two applications: clustering time series and predicting the behavior of one time series given another. For time series clustering, time series with relatively high positive correlation should be put together. For this purpose, the following transformation would be appropriate:  corr if corr ≥ 0 sim = 0 if corr < 0 For predicting the behavior of one time series from another, it is necessary to consider strong negative, as well as strong positive, correlation. In this case, the following transformation, sim = |corr| might be appropriate. Note that this assumes that you only want to predict magnitude, not direction. 16 Chapter 2 Data 23. Given a similarity measure with values in the interval [0,1] describe two ways to transform this similarity value into a dissimilarity value in the interval [0,∞]. d= 1−s s and d = − log s. 24. Proximity is typically defined between a pair of objects. (a) Define two ways in which you might define the proximity among a group of objects. Two examples are the following: (i) based on pairwise proximity, i.e., minimum pairwise similarity or maximum pairwise dissimilarity, or (ii) for points in Euclidean space compute a centroid (the mean of all the points—see Section 8.2) and then compute the sum or average of the distances of the points to the centroid. (b) How might you define the distance between two sets of points in Euclidean space? One approach is to compute the distance between the centroids of the two sets of points. (c) How might you define the proximity between two sets of data objects? (Make no assumption about the data objects, except that a proximity measure is defined between any pair of objects.) One approach is to compute the average pairwise proximity of objects in one group of objects with those objects in the other group. Other approaches are to take the minimum or maximum proximity. Note that the cohesion of a cluster is related to the notion of the proximity of a group of objects among themselves and that the separation of clusters is related to concept of the proximity of two groups of objects. (See Section 8.4.) Furthermore, the proximity of two clusters is an important concept in agglomerative hierarchical clustering. (See Section 8.2.) 25. You are given a set of points S in Euclidean space, as well as the distance of each point in S to a point x. (It does not matter if x ∈ S.) (a) If the goal is to find all points within a specified distance ε of point y, y = x, explain how you could use the triangle inequality and the already calculated distances to x to potentially reduce the number of distance calculations necessary? Hint: The triangle inequality, d(x, z) ≤ d(x, y) + d(y, x), can be rewritten as d(x, y) ≥ d(x, z) − d(y, z). Unfortunately, there is a typo and a lack of clarity in the hint. The hint should be phrased as follows: 17 Hint: If z is an arbitrary point of S, then the triangle inequality, d(x, y) ≤ d(x, z)+d(y, z), can be rewritten as d(y, z) ≥ d(x, y)−d(x, z). Another application of the triangle inequality starting with d(x, z) ≤ d(x, y) + d(y, z), shows that d(y, z) ≥ d(x, z) − d(x, y). If the lower bound of d(y, z) obtained from either of these inequalities is greater than , then d(y, z) does not need to be calculated. Also, if the upper bound of d(y, z) obtained from the inequality d(y, z) ≤ d(y, x)+d(x, z) is less than or equal to , then d(x, z) does not need to be calculated. (b) In general, how would the distance between x and y affect the number of distance calculations? If x = y then no calculations are necessary. As x becomes farther away, typically more distance calculations are needed. (c) Suppose that you can find a small subset of points S , from the original data set, such that every point in the data set is within a specified distance ε of at least one of the points in S , and that you also have the pairwise distance matrix for S . Describe a technique that uses this information to compute, with a minimum of distance calculations, the set of all points within a distance of β of a specified point from the data set. Let x and y be the two points and let x∗ and y∗ be the points in S  that are closest to the two points, respectively. If d(x∗ , y∗ ) + 2 ≤ β, then we can safely conclude d(x, y) ≤ β. Likewise, if d(x∗ , y∗ )−2 ≥ β, then we can safely conclude d(x, y) ≥ β. These formulas are derived by considering the cases where x and y are as far from x∗ and y∗ as possible and as far or close to each other as possible. 26. Show that 1 minus the Jaccard similarity is a distance measure between two data objects, x and y, that satisfies the metric axioms given on page 70. Specifically, d(x, y) = 1 − J(x, y). 1(a). Because J(x, y) ≤ 1, d(x, y) ≥ 0. 1(b). Because J(x, x) = 1, d(x, x) = 0 2. Because J(x, y) = J(y, x), d(x, y) = d(y, x) 3. (Proof due to Jeffrey Ullman) minhash(x) is the index of first nonzero entry of x prob(minhash(x) = k) is the probability tha minhash(x) = k when x is randomly permuted. Note that prob(minhash(x) = minhash(y)) = J(x, y) (minhash lemma) Therefore, d(x, y) = 1−prob(minhash(x) = minhash(y)) = prob(minhash(x) = minhash(y)) We have to show that, prob(minhash(x) = minhash(z)) ≤ prob(minhash(x) = minhash(y)) + prob(minhash(y) = minhash(z) 18 Chapter 2 Data However, note that whenever minhash(x) = minhash(z), then at least one of minhash(x) = minhash(y) and minhash(y) = minhash(z) must be true. 27. Show that the distance measure defined as the angle between two data vectors, x and y, satisfies the metric axioms given on page 70. Specifically, d(x, y) = arccos(cos(x, y)). Note that angles are in the range 0 to 180◦ . 1(a). Because 0 ≤ cos(x, y) ≤ 1, d(x, y) ≥ 0. 1(b). Because cos(x, x) = 1, d(x, x) = arccos(1) = 0 2. Because cos(x, y) = cos(y, x), d(x, y) = d(y, x) 3. If the three vectors lie in a plane then it is obvious that the angle between x and z must be less than or equal to the sum of the angles between x and y and y and z. If y is the projection of y into the plane defined by x and z, then note that the angles between x and y and y and z are greater than those between x and y and y and z. 28. Explain why computing the proximity between two attributes is often simpler than computing the similarity between two objects. In general, an object can be a record whose fields (attributes) are of different types. To compute the overall similarity of two objects in this case, we need to decide how to compute the similarity for each attribute and then combine these similarities. This can be done straightforwardly by using Equations 2.15 or 2.16, but is still somewhat ad hoc, at least compared to proximity measures such as the Euclidean distance or correlation, which are mathematically wellfounded. In contrast, the values of an attribute are all of the same type, and thus, if another attribute is of the same type, then the computation of similarity is conceptually and computationally straightforward. 3 Exploring Data 1. Obtain one of the data sets available at the UCI Machine Learning Repository and apply as many of the different visualization techniques described in the chapter as possible. The bibliographic notes and book Web site provide pointers to visualization software. MATLAB and R have excellent facilities for visualization. Most of the figures in this chapter were created using MATLAB. R is freely available from http://www.r-project.org/. 2. Identify at least two advantages and two disadvantages of using color to visually represent information. Advantages: Color makes it much easier to visually distinguish visual elements from one another. For example, three clusters of two-dimensional points are more readily distinguished if the markers representing the points have different colors, rather than only different shapes. Also, figures with color are more interesting to look at. Disadvantages: Some people are color blind and may not be able to properly interpret a color figure. Grayscale figures can show more detail in some cases. Color can be hard to use properly. For example, a poor color scheme can be garish or can focus attention on unimportant elements. 3. What are the arrangement issues that arise with respect to three-dimensional plots? It would have been better to state this more generally as “What are the issues . . . ,” since selection, as well as arrangement plays a key issue in displaying a three-dimensional plot. The key issue for three dimensional plots is how to display information so that as little information is obscured as possible. If the plot is of a twodimensional surface, then the choice of a viewpoint is critical. However, if the plot is in electronic form, then it is sometimes possible to interactively change 20 Chapter 3 Exploring Data the viewpoint to get a complete view of the surface. For three dimensional solids, the situation is even more challenging. Typically, portions of the information must be omitted in order to provide the necessary information. For example, a slice or cross-section of a three dimensional object is often shown. In some cases, transparency can be used. Again, the ability to change the arrangement of the visual elements interactively can be helpful. 4. Discuss the advantages and disadvantages of using sampling to reduce the number of data objects that need to be displayed. Would simple random sampling (without replacement) be a good approach to sampling? Why or why not? Simple random sampling is not the best approach since it will eliminate most of the points in sparse regions. It is better to undersample the regions where data objects are too dense while keeping most or all of the data objects from sparse regions. 5. Describe how you would create visualizations to display information that describes the following types of systems. Be sure to address the following issues: • Representation. How will you map objects, attributes, and relationships to visual elements? • Arrangement. Are there any special considerations that need to be taken into account with respect to how visual elements are displayed? Specific examples might be the choice of viewpoint, the use of transparency, or the separation of certain groups of objects. • Selection. How will you handle a large number of attributes and data objects? The following solutions are intended for illustration. (a) Computer networks. Be sure to include both the static aspects of the network, such as connectivity, and the dynamic aspects, such as traffic. The connectivity of the network would best be represented as a graph, with the nodes being routers, gateways, or other communications devices and the links representing the connections. The bandwidth of the connection could be represented by the width of the links. Color could be used to show the percent usage of the links and nodes. (b) The distribution of specific plant and animal species around the world for a specific moment in time. The simplest approach is to display each species on a separate map of the world and to shade the regions of the world where the species occurs. If several species are to be shown at once, then icons for each species can be placed on a map of the world. 21 (c) The use of computer resources, such as processor time, main memory, and disk, for a set of benchmark database programs. The resource usage of each program could be displayed as a bar plot of the three quantities. Since the three quantities would have different scales, a proper scaling of the resources would be necessary for this to work well. For example, resource usage could be displayed as a percentage of the total. Alternatively, we could use three bar plots, one for type of resource usage. On each of these plots there would be a bar whose height represents the usage of the corresponding program. This approach would not require any scaling. Yet another option would be to display a line plot of each program’s resource usage. For each program, a line would be constructed by (1) considering processor time, main memory, and disk as different x locations, (2) letting the percentage resource usage of a particular program for the three quantities be the y values associated with the x values, and then (3) drawing a line to connect these three points. Note that an ordering of the three quantities needs to be specified, but is arbitrary. For this approach, the resource usage of all programs could be displayed on the same plot. (d) The change in occupation of workers in a particular country over the last thirty years. Assume that you have yearly information about each person that also includes gender and level of education. For each gender, the occupation breakdown could be displayed as an array of pie charts, where each row of pie charts indicates a particular level of education and each column indicates a particular year. For convenience, the time gap between each column could be 5 or ten years. Alternatively, we could order the occupations and then, for each gender, compute the cumulative percent employment for each occupation. If this quantity is plotted for each gender, then the area between two successive lines shows the percentage of employment for this occupation. If a color is associated with each occupation, then the area between each set of lines can also be colored with the color associated with each occupation. A similar way to show the same information would be to use a sequence of stacked bar graphs. 6. Describe one advantage and one disadvantage of a stem and leaf plot with respect to a standard histogram. A stem and leaf plot shows you the actual distribution of values. On the other hand, a stem and leaf plot becomes rather unwieldy for a large number of values. 7. How might you address the problem that a histogram depends on the number and location of the bins? 22 Chapter 3 Exploring Data The best approach is to estimate what the actual distribution function of the data looks like using kernel density estimation. This branch of data analysis is relatively well-developed and is more appropriate if the widely available, but simplistic approach of a histogram is not sufficient. 8. Describe how a box plot can give information about whether the value of an attribute is symmetrically distributed. What can you say about the symmetry of the distributions of the attributes shown in Figure 3.11? (a) If the line representing the median of the data is in the middle of the box, then the data is symmetrically distributed, at least in terms of the 75% of the data between the first and third quartiles. For the remaining data, the length of the whiskers and outliers is also an indication, although, since these features do not involve as many points, they may be misleading. (b) Sepal width and length seem to be relatively symmetrically distributed, petal length seems to be rather skewed, and petal width is somewhat skewed. 9. Compare sepal length, sepal width, petal length, and petal width, using Figure 3.12. For Setosa, sepal length > sepal width > petal length > petal width. For Versicolour and Virginiica, sepal length > sepal width and petal length > petal width, but although sepal length > petal length, petal length > sepal width. 10. Comment on the use of a box plot to explore a data set with four attributes: age, weight, height, and income. A great deal of information can be obtained by looking at (1) the box plots for each attribute, and (2) the box plots for a particular attribute across various categories of a second attribute. For example, if we compare the box plots of age for different categories of ages, we would see that weight increases with age. 11. Give a possible explanation as to why most of the values of petal length and width fall in the buckets along the diagonal in Figure 3.9. We would expect such a distribution if the three species of Iris can be ordered according to their size, and if petal length and width are both correlated to the size of the plant and each other. 12. Use Figures 3.14 and 3.15 to identify a characteristic shared by the petal width and petal length attributes. 23 There is a relatively flat area in the curves of the Empirical CDF’s and the percentile plots for both petal length and petal width. This indicates a set of flowers for which these attributes have a relatively uniform value. 13. Simple line plots, such as that displayed in Figure 2.12 on page 56, which shows two time series, can be used to effectively display high-dimensional data. For example, in Figure 56 it is easy to tell that the frequencies of the two time series are different. What characteristic of time series allows the effective visualization of high-dimensional data? The fact that the attribute values are ordered. 14. Describe the types of situations that produce sparse or dense data cubes. Illustrate with examples other than those used in the book. Any set of data for which all combinations of values are unlikely to occur would produce sparse data cubes. This would include sets of continuous attributes where the set of objects described by the attributes doesn’t occupy the entire data space, but only a fraction of it, as well as discrete attributes, where many combinations of values don’t occur. A dense data cube would tend to arise, when either almost all combinations of the categories of the underlying attributes occur, or the level of aggregation is high enough so that all combinations are likely to have values. For example, consider a data set that contains the type of traffic accident, as well as its location and date. The original data cube would be very sparse, but if it is aggregated to have categories consisting single or multiple car accident, the state of the accident, and the month in which it occurred, then we would obtain a dense data cube. 15. How might you extend the notion of multidimensional data analysis so that the target variable is a qualitative variable? In other words, what sorts of summary statistics or data visualizations would be of interest? A summary statistics that would be of interest would be the frequencies with which values or combinations of values, target and otherwise, occur. From this we could derive conditional relationships among various values. In turn, these relationships could be displayed using a graph similar to that used to display Bayesian networks. 24 Chapter 3 Exploring Data 16. Construct a data cube from Table 3.1. Is this a dense or sparse data cube? If it is sparse, identify the cells that are empty. The data cube is shown in Table 3.2. It is a dense cube; only two cells are empty. Location ID 1 3 1 2 Number Sold 10 6 5 22 Table 3.2. Data cube for Exercise 16. Product ID Table 3.1. Fact table for Exercise 16. Product ID 1 1 2 2 Location ID 1 2 3 1 10 0 6 2 5 22 0 Total 15 22 6 Total 16 27 43 17. Discuss the differences between dimensionality reduction based on aggregation and dimensionality reduction based on techniques such as PCA and SVD. The dimensionality of PCA or SVD can be viewed as a projection of the data onto a reduced set of dimensions. In aggregation, groups of dimensions are combined. In some cases, as when days are aggregated into months or the sales of a product are aggregated by store location, the aggregation can be viewed as a change of scale. In contrast, the dimensionality reduction provided by PCA and SVD do not have such an interpretation. 4 Classification: Basic Concepts, Decision Trees, and Model Evaluation 1. Draw the full decision tree for the parity function of four Boolean attributes, A, B, C, and D. Is it possible to simplify the tree? A T T T T T T T T F F F F F F F F B T T T T F F F F T T T T F F F F C T T F F T T F F T T F F T T F F D T F T F T F T F T F T F T F T F Class T F F T F T T F F T T F T F F T A T F B B T F C C T F D C F T D F T D C T D T F D D F D D T F T F T F T F T F T F T F T F T F F T F T T F F T T F T F F T Figure 4.1. Decision tree for parity function of four Boolean attributes. 26 Chapter 4 Classification The preceding tree cannot be simplified. 2. Consider the training examples shown in Table 4.1 for a binary classification problem. Table 4.1. Data set for Exercise 2. Customer ID Gender Car Type Shirt Size 1 M Family Small 2 M Sports Medium 3 M Sports Medium 4 M Sports Large 5 M Sports Extra Large 6 M Sports Extra Large 7 F Sports Small 8 F Sports Small 9 F Sports Medium 10 F Luxury Large 11 M Family Large 12 M Family Extra Large 13 M Family Medium 14 M Luxury Extra Large 15 F Luxury Small 16 F Luxury Small 17 F Luxury Medium 18 F Luxury Medium 19 F Luxury Medium 20 F Luxury Large Class C0 C0 C0 C0 C0 C0 C0 C0 C0 C0 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 (a) Compute the Gini index for the overall collection of training examples. Answer: Gini = 1 − 2 × 0.52 = 0.5. (b) Compute the Gini index for the Customer ID attribute. Answer: The gini for each Customer ID value is 0. Therefore, the overall gini for Customer ID is 0. (c) Compute the Gini index for the Gender attribute. Answer: The gini for Male is 1 − 2 × 0.52 = 0.5. The gini for Female is also 0.5. Therefore, the overall gini for Gender is 0.5 × 0.5 + 0.5 × 0.5 = 0.5. 27 Table 4.2. Data set for Exercise 3. Instance a1 a2 a3 Target Class 1 T T 1.0 + 2 T T 6.0 + 3 T F 5.0 − 4 F F 4.0 + 5 F T 7.0 − 6 F T 3.0 − 7 F F 8.0 − 8 T F 7.0 + 9 F T 5.0 − (d) Compute the Gini index for the Car Type attribute using multiway split. Answer: The gini for Family car is 0.375, Sports car is 0, and Luxury car is 0.2188. The overall gini is 0.1625. (e) Compute the Gini index for the Shirt Size attribute using multiway split. Answer: The gini for Small shirt size is 0.48, Medium shirt size is 0.4898, Large shirt size is 0.5, and Extra Large shirt size is 0.5. The overall gini for Shirt Size attribute is 0.4914. (f) Which attribute is better, Gender, Car Type, or Shirt Size? Answer: Car Type because it has the lowest gini among the three attributes. (g) Explain why Customer ID should not be used as the attribute test condition even though it has the lowest Gini. Answer: The attribute has no predictive power since new customers are assigned to new Customer IDs. 3. Consider the training examples shown in Table 4.2 for a binary classification problem. (a) What is the entropy of this collection of training examples with respect to the positive class? Answer: There are four positive examples and five negative examples. Thus, P (+) = 4/9 and P (−) = 5/9. The entropy of the training examples is −4/9 log2 (4/9) − 5/9 log2 (5/9) = 0.9911. 28 Chapter 4 Classification (b) What are the information gains of a1 and a2 relative to these training examples? Answer: For attribute a1 , the corresponding counts and probabilities are: a1 T F + 3 1 1 4 The entropy for a1 is   4 − (3/4) log2 (3/4) − (1/4) log2 (1/4) 9   5 + − (1/5) log2 (1/5) − (4/5) log2 (4/5) = 0.7616. 9 Therefore, the information gain for a1 is 0.9911 − 0.7616 = 0.2294. For attribute a2 , the corresponding counts and probabilities are: a2 T F + 2 2 3 2 The entropy for a2 is   5 − (2/5) log2 (2/5) − (3/5) log2 (3/5) 9   4 + − (2/4) log2 (2/4) − (2/4) log2 (2/4) = 0.9839. 9 Therefore, the information gain for a2 is 0.9911 − 0.9839 = 0.0072. (c) For a3 , which is a continuous attribute, compute the information gain for every possible split. Answer: a3 1.0 3.0 4.0 5.0 5.0 6.0 7.0 7.0 Class label + + + + - Split point 2.0 3.5 4.5 Entropy 0.8484 0.9885 0.9183 Info Gain 0.1427 0.0026 0.0728 5.5 6.5 0.9839 0.9728 0.0072 0.0183 7.5 0.8889 0.1022 The best split for a3 occurs at split point equals to 2. 29 (d) What is the best split (among a1 , a2 , and a3 ) according to the information gain? Answer: According to information gain, a1 produces the best split. (e) What is the best split (between a1 and a2 ) according to the classification error rate? Answer: For attribute a1 : error rate = 2/9. For attribute a2 : error rate = 4/9. Therefore, according to error rate, a1 produces the best split. (f) What is the best split (between a1 and a2 ) according to the Gini index? Answer: For attribute a1 , the gini index is     4 5 2 2 2 2 1 − (3/4) − (1/4) + 1 − (1/5) − (4/5) = 0.3444. 9 9 For attribute a2 , the gini index is     5 4 2 2 2 2 1 − (2/5) − (3/5) + 1 − (2/4) − (2/4) = 0.4889. 9 9 Since the gini index for a1 is smaller, it produces the better split. 4. Show that the entropy of a node never increases after splitting it into smaller successor nodes. Answer: Let Y = {y1 , y2 , · · · , yc } denote the c classes and X = {x1 , x2 , · · · , xk } denote the k attribute values of an attribute X. Before a node is split on X, the entropy is: E(Y ) = − c  P (yj ) log2 P (yj ) = j=1 c  k  P (xi , yj ) log2 P (yj ), (4.1) j=1 i=1 where we have used the fact that P (yj ) = total probability. k i=1 P (xi , yj ) from the law of After splitting on X, the entropy for each child node X = xi is: E(Y |xi ) = − c  j=1 P (yj |xi ) log2 P (yj |xi ) (4.2) 30 Chapter 4 Classification where P (yj |xi ) is the fraction of examples with X = xi that belong to class yj . The entropy after splitting on X is given by the weighted entropy of the children nodes: E(Y |X) = k  P (xi )E(Y |xi ) i=1 = − k  c  P (xi )P (yj |xi ) log2 P (yj |xi ) i=1 j=1 = − k  c  P (xi , yj ) log2 P (yj |xi ), (4.3) i=1 j=1 where we have used a known fact from probability theory that P (xi , yj ) = P (yj |xi )×P (xi ). Note that E(Y |X) is also known as the conditional entropy of Y given X. To answer this question, we need to show that E(Y |X) ≤ E(Y ). Let us compute the difference between the entropies after splitting and before splitting, i.e., E(Y |X) − E(Y ), using Equations 4.1 and 4.3: E(Y |X) − E(Y ) = − k  c  P (xi , yj ) log2 P (yj |xi ) + i=1 j=1 = k  c  k  c  P (xi , yj ) log2 P (yj ) i=1 j=1 P (xi , yj ) log2 P (yj ) P (yj |xi ) P (xi , yj ) log2 P (xi )P (yj ) P (xi , yj ) i=1 j=1 = k  c  i=1 j=1 (4.4) To prove that Equation 4.4 is non-positive, we use the following property of a logarithmic function: d  ak log(zk ) ≤ log k=1 d  ak zk , (4.5) k=1 d subject to the condition that k=1 ak = 1. This property is a special case of a more general theorem involving convex functions (which include the logarithmic function) known as Jensen’s inequality. 31 By applying Jensen’s inequality, Equation 4.4 can be bounded as follows:   c k  P (xi )P (yj ) P (xi , yj ) E(Y |X) − E(Y ) ≤ log2 P (xi , yj ) i=1 j=1  k = log2 = = log2 (1) 0 P (xi ) i=1 c   P (yj ) j=1 Because E(Y |X) − E(Y ) ≤ 0, it follows that entropy never increases after splitting on an attribute. 5. Consider the following data set for a binary class problem. A T T T T T F F F T T B F T T F T F F F T F Class Label + + + − + − − − − − (a) Calculate the information gain when splitting on A and B. Which attribute would the decision tree induction algorithm choose? Answer: The contingency tables after splitting on attributes A and B are: B=T B=F A=T A=F + 4 0 + 3 1 − 3 3 − 1 5 The overall entropy before splitting is: Eorig = −0.4 log 0.4 − 0.6 log 0.6 = 0.9710 The information gain after splitting on A is: EA=T EA=F ∆ 4 3 3 4 = − log − log = 0.9852 7 7 7 7 3 0 0 3 = − log − log = 0 3 3 3 3 = Eorig − 7/10EA=T − 3/10EA=F = 0.2813 32 Chapter 4 Classification The information gain after splitting on B is: EB=T EB=F ∆ 3 1 1 3 = − log − log = 0.8113 4 4 4 4 1 5 5 1 = − log − log = 0.6500 6 6 6 6 = Eorig − 4/10EB=T − 6/10EB=F = 0.2565 Therefore, attribute A will be chosen to split the node. (b) Calculate the gain in the Gini index when splitting on A and B. Which attribute would the decision tree induction algorithm choose? Answer: The overall gini before splitting is: Gorig = 1 − 0.42 − 0.62 = 0.48 The gain in gini after splitting on A is: GA=T = 1− GA=F = 1= ∆ = 4 7 2 − 3 7 2 2 = 0.4898 2 3 0 − =0 3 3 Gorig − 7/10GA=T − 3/10GA=F = 0.1371 The gain in gini after splitting on B is: GB=T = 1− GB=F = 1= ∆ = 1 4 2 − 2 3 4 2 = 0.3750 2 1 5 − = 0.2778 6 6 Gorig − 4/10GB=T − 6/10GB=F = 0.1633 Therefore, attribute B will be chosen to split the node. (c) Figure 4.13 shows that entropy and the Gini index are both monotonously increasing on the range [0, 0.5] and they are both monotonously decreasing on the range [0.5, 1]. Is it possible that information gain and the gain in the Gini index favor different attributes? Explain. Answer: Yes, even though these measures have similar range and monotonous behavior, their respective gains, ∆, which are scaled differences of the measures, do not necessarily behave in the same way, as illustrated by the results in parts (a) and (b). 6. Consider the following set of training examples. 33 X 0 0 0 0 1 1 1 1 Y 0 0 1 1 0 0 1 1 Z 0 1 0 1 0 1 0 1 No. of Class C1 Examples 5 0 10 45 10 25 5 0 No. of Class C2 Examples 40 15 5 0 5 0 20 15 (a) Compute a two-level decision tree using the greedy approach described in this chapter. Use the classification error rate as the criterion for splitting. What is the overall error rate of the induced tree? Answer: Splitting Attribute at Level 1. To determine the test condition at the root node, we need to compute the error rates for attributes X, Y , and Z. For attribute X, the corresponding counts are: X 0 1 C1 60 40 C2 60 40 Therefore, the error rate using attribute X is (60 + 40)/200 = 0.5. For attribute Y , the corresponding counts are: Y 0 1 C1 40 60 C2 60 40 Therefore, the error rate using attribute Y is (40 + 40)/200 = 0.4. For attribute Z, the corresponding counts are: Z 0 1 C1 30 70 C2 70 30 Therefore, the error rate using attribute Y is (30 + 30)/200 = 0.3. Since Z gives the lowest error rate, it is chosen as the splitting attribute at level 1. Splitting Attribute at Level 2. After splitting on attribute Z, the subsequent test condition may involve either attribute X or Y . This depends on the training examples distributed to the Z = 0 and Z = 1 child nodes. For Z = 0, the corresponding counts for attributes X and Y are the same, as shown in the table below. 34 Chapter 4 Classification X 0 1 C1 15 15 Y 0 1 C2 45 25 C1 15 15 C2 45 25 The error rate in both cases (X and Y ) are (15 + 15)/100 = 0.3. For Z = 1, the corresponding counts for attributes X and Y are shown in the tables below. X 0 1 C1 45 25 Y 0 1 C2 15 15 C1 25 45 C2 15 15 Although the counts are somewhat different, their error rates remain the same, (15 + 15)/100 = 0.3. The corresponding two-level decision tree is shown below. Z 0 1 X or Y X or Y 1 0 C2 0 C2 1 C1 C1 The overall error rate of the induced tree is (15+15+15+15)/200 = 0.3. (b) Repeat part (a) using X as the first splitting attribute and then choose the best remaining attribute for splitting at each of the two successor nodes. What is the error rate of the induced tree? Answer: After choosing attribute X to be the first splitting attribute, the subsequent test condition may involve either attribute Y or attribute Z. For X = 0, the corresponding counts for attributes Y and Z are shown in the table below. Y 0 1 C1 5 55 C2 55 5 Z 0 1 C1 15 45 C2 45 15 The error rate using attributes Y and Z are 10/120 and 30/120, respectively. Since attribute Y leads to a smaller error rate, it provides a better split. For X = 1, the corresponding counts for attributes Y and Z are shown in the tables below. 35 Y 0 1 C1 35 5 Z 0 1 C2 5 35 C1 15 25 C2 25 15 The error rate using attributes Y and Z are 10/80 and 30/80, respectively. Since attribute Y leads to a smaller error rate, it provides a better split. The corresponding two-level decision tree is shown below. X 0 1 Y 0 Y 1 C2 0 C1 C1 1 C2 The overall error rate of the induced tree is (10 + 10)/200 = 0.1. (c) Compare the results of parts (a) and (b). Comment on the suitability of the greedy heuristic used for splitting attribute selection. Answer: From the preceding results, the error rate for part (a) is significantly larger than that for part (b). This examples shows that a greedy heuristic does not always produce an optimal solution. 7. The following table summarizes a data set with three attributes A, B, C and two class labels +, −. Build a two-level decision tree. A T F T F T F T F B T T F F T T F F C T T T T F F F F Number of Instances + − 5 0 0 20 20 0 0 5 0 0 25 0 0 0 0 25 (a) According to the classification error rate, which attribute would be chosen as the first splitting attribute? For each attribute, show the contingency table and the gains in classification error rate. 36 Chapter 4 Classification Answer: The error rate for the data without partitioning on any attribute is Eorig = 1 − max( 50 50 50 , )= . 100 100 100 After splitting on attribute A, the gain in error rate is: + − A=T 25 0 0 25 0 , )= =0 25 25 25 25 25 50 EA=F = 1 − max( , ) = 75 75 75 25 75 25 ∆A = Eorig − EA=T − EA=F = 100 100 100 EA=T = 1 − max( A=F 25 50 After splitting on attribute B, the gain in error rate is: + − B=T 30 20 20 50 20 = 50 EB=T = B=F 20 30 EB=F ∆B = Eorig − 50 50 10 EB=T − EB=F = 100 100 100 After splitting on attribute C, the gain in error rate is: + − C=T 25 25 25 50 25 = 50 EC=T = C=F 25 25 EC=F ∆C = Eorig − 50 50 0 EC=T − EC=F = =0 100 100 100 The algorithm chooses attribute A because it has the highest gain. (b) Repeat for the two children of the root node. Answer: Because the A = T child node is pure, no further splitting is needed. For the A = F child node, the distribution of training instances is: B T F T F C T T F F Class label + − 0 20 0 5 25 0 0 25 The classification error of the A = F child node is: 37 Eorig = 25 75 After splitting on attribute B, the gain in error rate is: + − B=T 25 20 B=F 0 30 20 45 =0 EB=T = EB=F ∆B = Eorig − 45 20 5 EB=T − EB=F = 75 75 75 After splitting on attribute C, the gain in error rate is: + − C=T 0 25 C=F 25 25 0 25 25 = 50 EC=T = EC=F ∆C = Eorig − 25 50 EC=T − EC=F = 0 75 75 The split will be made on attribute B. (c) How many instances are misclassified by the resulting decision tree? Answer: 20 .) 20 instances are misclassified. (The error rate is 100 (d) Repeat parts (a), (b), and (c) using C as the splitting attribute. Answer: For the C = T child node, the error rate before splitting is: Eorig = 25 50 . After splitting on attribute A, the gain in error rate is: + − A=T 25 0 A=F 0 25 EA=T = 0 EA=F = 0 25 ∆A = 50 After splitting on attribute B, the gain in error rate is: + − B=T 5 20 B=F 20 5 5 25 5 EB=F = 25 15 ∆B = 50 EB=T = Therefore, A is chosen as the splitting attribute. 38 Chapter 4 Classification A 0 1 B 0 C 1 0 _ + + 1 _ Training: Instance 1 2 3 4 5 6 7 8 9 10 A 0 0 0 0 1 1 1 1 1 1 B 0 0 1 1 0 0 1 0 1 1 C 0 1 0 1 0 0 0 1 0 0 Class + + + – + + – + – – Validation: Instance 11 12 13 14 15 A 0 0 1 1 1 B 0 1 1 0 0 C 0 1 0 1 0 Class + + + – + Figure 4.2. Decision tree and data sets for Exercise 8. For the C = F child, the error rate before splitting is: Eorig = After splitting on attribute A, the error rate is: + − A=T 0 0 A=F 25 25 25 50 . EA=T = 0 25 EA=F = 50 ∆A = 0 After splitting on attribute B, the error rate is: + − B=T 25 0 B=F 0 25 EB=T = 0 EB=F = 0 25 ∆B = 50 Therefore, B is used as the splitting attribute. The overall error rate of the induced tree is 0. (e) Use the results in parts (c) and (d) to conclude about the greedy nature of the decision tree induction algorithm. The greedy heuristic does not necessarily lead to the best tree. 8. Consider the decision tree shown in Figure 4.2. 39 (a) Compute the generalization error rate of the tree using the optimistic approach. Answer: According to the optimistic approach, the generalization error rate is 3/10 = 0.3. (b) Compute the generalization error rate of the tree using the pessimistic approach. (For simplicity, use the strategy of adding a factor of 0.5 to each leaf node.) Answer: According to the pessimistic approach, the generalization error rate is (3 + 4 × 0.5)/10 = 0.5. (c) Compute the generalization error rate of the tree using the validation set shown above. This approach is known as reduced error pruning. Answer: According to the reduced error pruning approach, the generalization error rate is 4/5 = 0.8. 9. Consider the decision trees shown in Figure 4.3. Assume they are generated from a data set that contains 16 binary attributes and 3 classes, C1 , C2 , and C3 . Compute the total description length of each decision tree according to the minimum description length principle. C1 C1 C2 C2 C3 C3 C1 (a) Decision tree with 7 errors C2 (b) Decision tree with 4 errors Figure 4.3. Decision trees for Exercise 9. • The total description length of a tree is given by: Cost(tree, data) = Cost(tree) + Cost(data|tree). 40 Chapter 4 Classification • Each internal node of the tree is encoded by the ID of the splitting attribute. If there are m attributes, the cost of encoding each attribute is log2 m bits. • Each leaf is encoded using the ID of the class it is associated with. If there are k classes, the cost of encoding a class is log2 k bits. • Cost(tree) is the cost of encoding all the nodes in the tree. To simplify the computation, you can assume that the total cost of the tree is obtained by adding up the costs of encoding each internal node and each leaf node. • Cost(data|tree) is encoded using the classification errors the tree commits on the training set. Each error is encoded by log2 n bits, where n is the total number of training instances. Which decision tree is better, according to the MDL principle? Answer: Because there are 16 attributes, the cost for each internal node in the decision tree is: log2 (m) = log2 (16) = 4 Furthermore, because there are 3 classes, the cost for each leaf node is: log2 (k) = log2 (3) = 2 The cost for each misclassification error is log2 (n). The overall cost for the decision tree (a) is 2×4+3×2+7×log2 n = 14+7 log2 n and the overall cost for the decision tree (b) is 4×4+5×2+4×5 = 26+4 log2 n. According to the MDL principle, tree (a) is better than (b) if n < 16 and is worse than (b) if n > 16. 10. While the .632 bootstrap approach is useful for obtaining a reliable estimate of model accuracy, it has a known limitation. Consider a two-class problem, where there are equal number of positive and negative examples in the data. Suppose the class labels for the examples are generated randomly. The classifier used is an unpruned decision tree (i.e., a perfect memorizer). Determine the accuracy of the classifier using each of the following methods. (a) The holdout method, where two-thirds of the data are used for training and the remaining one-third are used for testing. Answer: Assuming that the training and test samples are equally representative, the test error rate will be close to 50%. 41 (b) Ten-fold cross-validation. Answer: Assuming that the training and test samples for each fold are equally representative, the test error rate will be close to 50%. (c) The .632 bootstrap method. Answer: The training error for a perfect memorizer is 100% while the error rate for each bootstrap sample is close to 50%. Substituting this information into the formula for .632 bootstrap method, the error estimate is:  b  1 0.632 × 0.5 + 0.368 × 1 = 0.684. b i=1 (d) From the results in parts (a), (b), and (c), which method provides a more reliable evaluation of the classifier’s accuracy? Answer: The ten-fold cross-validation and holdout method provides a better error estimate than the .632 boostrap method. 11. Consider the following approach for testing whether a classifier A beats another classifier B. Let N be the size of a given data set, pA be the accuracy of classifier A, pB be the accuracy of classifier B, and p = (pA + pB )/2 be the average accuracy for both classifiers. To test whether classifier A is significantly better than B, the following Z-statistic is used: pA − pB Z= . 2p(1−p) N Classifier A is assumed to be better than classifier B if Z > 1.96. Table 4.3 compares the accuracies of three different classifiers, decision tree classifiers, naı̈ve Bayes classifiers, and support vector machines, on various data sets. (The latter two classifiers are described in Chapter 5.) 42 Chapter 4 Classification Table 4.3. Comparing the accuracy of various classification methods. Data Set Anneal Australia Auto Breast Cleve Credit Diabetes German Glass Heart Hepatitis Horse Ionosphere Iris Labor Led7 Lymphography Pima Sonar Tic-tac-toe Vehicle Wine Zoo Size (N ) 898 690 205 699 303 690 768 1000 214 270 155 368 351 150 57 3200 148 768 208 958 846 178 101 Decision Tree (%) 92.09 85.51 81.95 95.14 76.24 85.80 72.40 70.90 67.29 80.00 81.94 85.33 89.17 94.67 78.95 73.34 77.03 74.35 78.85 83.72 71.04 94.38 93.07 naı̈ve Bayes (%) 79.62 76.81 58.05 95.99 83.50 77.54 75.91 74.70 48.59 84.07 83.23 78.80 82.34 95.33 94.74 73.16 83.11 76.04 69.71 70.04 45.04 96.63 93.07 Support vector machine (%) 87.19 84.78 70.73 96.42 84.49 85.07 76.82 74.40 59.81 83.70 87.10 82.61 88.89 96.00 92.98 73.56 86.49 76.95 76.92 98.33 74.94 98.88 96.04 Answer: A summary of the relative performance of the classifiers is given below: win-loss-draw Decision tree Naı̈ve Bayes Support vector machine Decision tree Naı̈ve Bayes 0 - 0 - 23 3 - 9 - 11 7 - 2 - 14 9 - 3 - 11 0 - 0 - 23 8 - 0 - 15 Support vector machine 2 - 7- 14 0 - 8 - 15 0 - 0 - 23 12. Let X be a binomial random variable with mean N p and variance N p(1 − p). Show that the ratio X/N also has a binomial distribution with mean p and variance p(1 − p)/N . Answer: Let r = X/N . Since X has a binomial distribution, r also has the same distribution. The mean and variance for r can be computed as follows: Mean, E[r] = E[X/N ] = E[X]/N = (N p)/N = p; 43 Variance, E[(r − E[r])2 ] = E[(X/N − E[X/N ])2 ] = E[(X − E[X])2 ]/N 2 = N p(1 − p)/N 2 = p(1 − p)/N 5 Classification: Alternative Techniques 1. Consider a binary classification problem with the following set of attributes and attribute values: • Air Conditioner = {Working, Broken} • Engine = {Good, Bad} • Mileage = {High, Medium, Low} • Rust = {Yes, No} Suppose a rule-based classifier produces the following rule set: Mileage = High −→ Value = Low Mileage = Low −→ Value = High Air Conditioner = Working, Engine = Good −→ Value = High Air Conditioner = Working, Engine = Bad −→ Value = Low Air Conditioner = Broken −→ Value = Low (a) Are the rules mutually exclustive? Answer: No (b) Is the rule set exhaustive? Answer: Yes (c) Is ordering needed for this set of rules? Answer: Yes because a test instance may trigger more than one rule. (d) Do you need a default class for the rule set? Answer: No because every instance is guaranteed to trigger at least one rule. 46 Chapter 5 Classification: Alternative Techniques 2. The RIPPER algorithm (by Cohen [1]) is an extension of an earlier algorithm called IREP (by Fürnkranz and Widmer [3]). Both algorithms apply the reduced-error pruning method to determine whether a rule needs to be pruned. The reduced error pruning method uses a validation set to estimate the generalization error of a classifier. Consider the following pair of rules: R1 : R2 : A −→ C A ∧ B −→ C R2 is obtained by adding a new conjunct, B, to the left-hand side of R1 . For this question, you will be asked to determine whether R2 is preferred over R1 from the perspectives of rule-growing and rule-pruning. To determine whether a rule should be pruned, IREP computes the following measure: vIREP = p + (N − n) , P +N where P is the total number of positive examples in the validation set, N is the total number of negative examples in the validation set, p is the number of positive examples in the validation set covered by the rule, and n is the number of negative examples in the validation set covered by the rule. vIREP is actually similar to classification accuracy for the validation set. IREP favors rules that have higher values of vIREP . On the other hand, RIPPER applies the following measure to determine whether a rule should be pruned: vRIP P ER = p−n . p+n (a) Suppose R1 is covered by 350 positive examples and 150 negative examples, while R2 is covered by 300 positive examples and 50 negative examples. Compute the FOIL’s information gain for the rule R2 with respect to R1 . Answer: For this problem, p0 = 350, n0 = 150, p1 = 300, and n1 = 50. Therefore, the FOIL’s information gain for R2 with respect to R1 is:   300 350 − log2 Gain = 300 × log2 = 87.65 350 500 (b) Consider a validation set that contains 500 positive examples and 500 negative examples. For R1 , suppose the number of positive examples covered by the rule is 200, and the number of negative examples covered by the rule is 50. For R2 , suppose the number of positive examples covered by the rule is 100 and the number of negative examples is 5. Compute vIREP for both rules. Which rule does IREP prefer? 47 Answer: For this problem, P = 500, and N = 500. For rule R1, p = 200 and n = 50. Therefore, VIREP (R1) = 200 + (500 − 50) p + (N − n) = = 0.65 P +N 1000 For rule R2, p = 100 and n = 5. VIREP (R2) = 100 + (500 − 5) p + (N − n) = = 0.595 P +N 1000 Thus, IREP prefers rule R1. (c) Compute vRIP P ER for the previous problem. Which rule does RIPPER prefer? Answer: 150 p−n VRIP P ER (R1) = = = 0.6 p+n 250 VRIP P ER (R2) = 95 p−n = = 0.9 p+n 105 Thus, RIPPER prefers the rule R2. 3. C4.5rules is an implementation of an indirect method for generating rules from a decision tree. RIPPER is an implementation of a direct method for generating rules directly from data. (a) Discuss the strengths and weaknesses of both methods. Answer: The C4.5 rules algorithm generates classification rules from a global perspective. This is because the rules are derived from decision trees, which are induced with the objective of partitioning the feature space into homogeneous regions, without focusing on any classes. In contrast, RIPPER generates rules one-class-at-a-time. Thus, it is more biased towards the classes that are generated first. (b) Consider a data set that has a large difference in the class size (i.e., some classes are much bigger than others). Which method (between C4.5rules and RIPPER) is better in terms of finding high accuracy rules for the small classes? Answer: The class-ordering scheme used by C4.5rules has an easier interpretation than the scheme used by RIPPER. 48 Chapter 5 Classification: Alternative Techniques 4. Consider a training set that contains 100 positive examples and 400 negative examples. For each of the following candidate rules, R1 : A −→ + (covers 4 positive and 1 negative examples), R2 : B −→ + (covers 30 positive and 10 negative examples), R3 : C −→ + (covers 100 positive and 90 negative examples), determine which is the best and worst candidate rule according to: (a) Rule accuracy. Answer: The accuracies of the rules are 80% (for R1 ), 75% (for R2 ), and 52.6% (for R3 ), respectively. Therefore R1 is the best candidate and R3 is the worst candidate according to rule accuracy. (b) FOIL’s information gain. Answer: Assume the initial rule is ∅ −→ +. This rule covers p0 = 100 positive examples and n0 = 400 negative examples. The rule R1 covers p1 = 4 positive examples and n1 = 1 negative example. Therefore, the FOIL’s information gain for this rule is 4× log2 4 100 − log2 5 500 = 8. The rule R2 covers p1 = 30 positive examples and n1 = 10 negative example. Therefore, the FOIL’s information gain for this rule is 30 × log2 30 100 − log2 40 500 = 57.2. The rule R3 covers p1 = 100 positive examples and n1 = 90 negative example. Therefore, the FOIL’s information gain for this rule is 100 × log2 100 100 − log2 190 500 = 139.6. Therefore, R3 is the best candidate and R1 is the worst candidate according to FOIL’s information gain. (c) The likelihood ratio statistic. Answer: For R1 , the expected frequency for the positive class is 5 × 100/500 = 1 and the expected frequency for the negative class is 5 × 400/500 = 4. Therefore, the likelihood ratio for R1 is   2 × 4 × log2 (4/1) + 1 × log2 (1/4) = 12. 49 For R2 , the expected frequency for the positive class is 40×100/500 = 8 and the expected frequency for the negative class is 40 × 400/500 = 32. Therefore, the likelihood ratio for R2 is   2 × 30 × log2 (30/8) + 10 × log2 (10/32) = 80.85 For R3 , the expected frequency for the positive class is 190 × 100/500 = 38 and the expected frequency for the negative class is 190 × 400/500 = 152. Therefore, the likelihood ratio for R3 is   2 × 100 × log2 (100/38) + 90 × log2 (90/152) = 143.09 Therefore, R3 is the best candidate and R1 is the worst candidate according to the likelihood ratio statistic. (d) The Laplace measure. Answer: The Laplace measure of the rules are 71.43% (for R1 ), 73.81% (for R2 ), and 52.6% (for R3 ), respectively. Therefore R2 is the best candidate and R3 is the worst candidate according to the Laplace measure. (e) The m-estimate measure (with k = 2 and p+ = 0.2). Answer: The m-estimate measure of the rules are 62.86% (for R1 ), 73.38% (for R2 ), and 52.3% (for R3 ), respectively. Therefore R2 is the best candidate and R3 is the worst candidate according to the m-estimate measure. 5. Figure 5.1 illustrates the coverage of the classification rules R1, R2, and R3. Determine which is the best and worst rule according to: (a) The likelihood ratio statistic. Answer: There are 29 positive examples and 21 negative examples in the data set. R1 covers 12 positive examples and 3 negative examples. The expected frequency for the positive class is 15 × 29/50 = 8.7 and the expected frequency for the negative class is 15×21/50 = 6.3. Therefore, the likelihood ratio for R1 is   2 × 12 × log2 (12/8.7) + 3 × log2 (3/6.3) = 4.71. R2 covers 7 positive examples and 3 negative examples. The expected frequency for the positive class is 10 × 29/50 = 5.8 and the expected 50 Chapter 5 Classification: Alternative Techniques R3 R1 + + + ++ + + + + + + + + + ++ + + + + + - - + class = + class = - R2 + + + + + + + - - Figure 5.1. Elimination of training records by the sequential covering algorithm. R1, R2, and R3 represent regions covered by three different rules. frequency for the negative class is 10 × 21/50 = 4.2. Therefore, the likelihood ratio for R2 is   2 × 7 × log2 (7/5.8) + 3 × log2 (3/4.2) = 0.89. R3 covers frequency frequency likelihood 8 positive examples and 4 negative examples. The expected for the positive class is 12 × 29/50 = 6.96 and the expected for the negative class is 12 × 21/50 = 5.04. Therefore, the ratio for R3 is   2 × 8 × log2 (8/6.96) + 4 × log2 (4/5.04) = 0.5472. R1 is the best rule and R3 is the worst rule according to the likelihood ratio statistic. (b) The Laplace measure. Answer: The Laplace measure for the rules are 76.47% (for R1), 66.67% (for R2), and 64.29% (for R3), respectively. Therefore R1 is the best rule and R3 is the worst rule according to the Laplace measure. (c) The m-estimate measure (with k = 2 and p+ = 0.58). Answer: The m-estimate measure for the rules are 77.41% (for R1), 68.0% (for R2), and 65.43% (for R3), respectively. Therefore R1 is the best rule and R3 is the worst rule according to the m-estimate measure. (d) The rule accuracy after R1 has been discovered, where none of the examples covered by R1 are discarded). 51 Answer: If the examples for R1 are not discarded, then R2 will be chosen because it has a higher accuracy (70%) than R3 (66.7%). (e) The rule accuracy after R1 has been discovered, where only the positive examples covered by R1 are discarded). Answer: If the positive examples covered by R1 are discarded, the new accuracies for R2 and R3 are 70% and 60%, respectively. Therefore R2 is preferred over R3. (f) The rule accuracy after R1 has been discovered, where both positive and negative examples covered by R1 are discarded. Answer: If the positive and negative examples covered by R1 are discarded, the new accuracies for R2 and R3 are 70% and 75%, respectively. In this case, R3 is preferred over R2. 6. (a) Suppose the fraction of undergraduate students who smoke is 15% and the fraction of graduate students who smoke is 23%. If one-fifth of the college students are graduate students and the rest are undergraduates, what is the probability that a student who smokes is a graduate student? Answer: Given P (S|U G) = 0.15, P (S|G) = 0.23, P (G) = 0.2, P (U G) = 0.8. We want to compute P (G|S). According to Bayesian Theorem, P (G|S) = 0.23 × 0.2 = 0.277. 0.15 × 0.8 + 0.23 × 0.2 (5.1) (b) Given the information in part (a), is a randomly chosen college student more likely to be a graduate or undergraduate student? Answer: An undergraduate student, because P (U G) > P (G). (c) Repeat part (b) assuming that the student is a smoker. Answer: An undergraduate student because P (U G|S) > P (G|S). (d) Suppose 30% of the graduate students live in a dorm but only 10% of the undergraduate students live in a dorm. If a student smokes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student? You can assume independence between students who live in a dorm and those who smoke. Answer: First, we need to estimate all the probabilities. 52 Chapter 5 Classification: Alternative Techniques P (D|U G) = 0.1, P (D|G) = 0.3. P (D) = P (U G).P (D|U G) + P (G).P (D|G) = 0.8 ∗ 0.1 + 0.2 ∗ 0.3 = 0.14. P (S) = P (S|U G)P (U G)+P (S|G)P (G) = 0.15∗0.8+0.23∗0.2 = 0.166. P (DS|G) = P (D|G) × P (S|G) = 0.3 × 0.23 = 0.069 (using conditional independent assumption) P (DS|U G) = P (D|U G) × P (S|U G) = 0.1 × 0.15 = 0.015. We need to compute P (G|DS) and P (U G|DS). P (G|DS) = 0.069 × 0.2 0.0138 = P (DS) P (DS) P (U G|DS) = 0.012 0.015 × 0.8 = P (DS) P (DS) Since P (G|DS) > P (U G|DS), he/she is more likely to be a graduate student. 7. Consider the data set shown in Table 5.1 Table 5.1. Data set for Exercise 7. Record 1 2 3 4 5 6 7 8 9 10 A 0 0 0 0 0 1 1 1 1 1 B 0 0 1 1 0 0 0 0 1 0 C 0 1 1 1 1 1 1 1 1 1 Class + − − − + + − − + + (a) Estimate the conditional probabilities for P (A|+), P (B|+), P (C|+), P (A|−), P (B|−), and P (C|−). Answer: P (A = 1|−) = 2/5 = 0.4, P (B = 1|−) = 2/5 = 0.4, P (C = 1|−) = 1, P (A = 0|−) = 3/5 = 0.6, P (B = 0|−) = 3/5 = 0.6, P (C = 0|−) = 0; P (A = 1|+) = 3/5 = 0.6, P (B = 1|+) = 1/5 = 0.2, P (C = 1|+) = 2/5 = 0.4, P (A = 0|+) = 2/5 = 0.4, P (B = 0|+) = 4/5 = 0.8, P (C = 0|+) = 3/5 = 0.6. 53 (b) Use the estimate of conditional probabilities given in the previous question to predict the class label for a test sample (A = 0, B = 1, C = 0) using the naı̈ve Bayes approach. Answer: Let P (A = 0, B = 1, C = 0) = K. P (+|A = 0, B = 1, C = 0) P (A = 0, B = 1, C = 0|+) × P (+) = P (A = 0, B = 1, C = 0) P (A = 0|+)P (B = 1|+)P (C = 0|+) × P (+) = K = 0.4 × 0.2 × 0.6 × 0.5/K = 0.024/K. P (−|A = 0, B = 1, C = 0) P (A = 0, B = 1, C = 0|−) × P (−) = P (A = 0, B = 1, C = 0) P (A = 0|−) × P (B = 1|−) × P (C = 0|−) × P (−) = K = 0/K The class label should be ’+’. (c) Estimate the conditional probabilities using the m-estimate approach, with p = 1/2 and m = 4. Answer: P (A = 0|+) = (2 + 2)/(5 + 4) = 4/9, P (A = 0|−) = (3 + 2)/(5 + 4) = 5/9, P (B = 1|+) = (1 + 2)/(5 + 4) = 3/9, P (B = 1|−) = (2 + 2)/(5 + 4) = 4/9, P (C = 0|+) = (3 + 2)/(5 + 4) = 5/9, P (C = 0|−) = (0 + 2)/(5 + 4) = 2/9. (d) Repeat part (b) using the conditional probabilities given in part (c). Answer: Let P (A = 0, B = 1, C = 0) = K 54 Chapter 5 Classification: Alternative Techniques P (+|A = 0, B = 1, C = 0) P (A = 0, B = 1, C = 0|+) × P (+) = P (A = 0, B = 1, C = 0) P (A = 0|+)P (B = 1|+)P (C = 0|+) × P (+) = K (4/9) × (3/9) × (5/9) × 0.5 = K = 0.0412/K = = = = P (−|A = 0, B = 1, C = 0) P (A = 0, B = 1, C = 0|−) × P (−) P (A = 0, B = 1, C = 0) P (A = 0|−) × P (B = 1|−) × P (C = 0|−) × P (−) K (5/9) × (4/9) × (2/9) × 0.5 K 0.0274/K The class label should be ’+’. (e) Compare the two methods for estimating probabilities. Which method is better and why? Answer: When one of the conditional probability is zero, the estimate for conditional probabilities using the m-estimate probability approach is better, since we don’t want the entire expression becomes zero. 8. Consider the data set shown in Table 5.2. (a) Estimate the conditional probabilities for P (A = 1|+), P (B = 1|+), P (C = 1|+), P (A = 1|−), P (B = 1|−), and P (C = 1|−) using the same approach as in the previous problem. Answer: P (A = 1|+) = 0.6, P (B = 1|+) = 0.4, P (C = 1|+) = 0.8, P (A = 1|−) = 0.4, P (B = 1|−) = 0.4, and P (C = 1|−) = 0.2 (b) Use the conditional probabilities in part (a) to predict the class label for a test sample (A = 1, B = 1, C = 1) using the naı̈ve Bayes approach. Answer: Let R : (A = 1, B = 1, C = 1) be the test record. To determine its class, we need to compute P (+|R) and P (−|R). Using Bayes theorem, 55 Table 5.2. Data set for Exercise 8. Instance 1 2 3 4 5 6 7 8 9 10 A 0 1 0 1 1 0 1 0 0 1 B 0 0 1 0 0 0 1 0 1 1 C 1 1 0 0 1 1 0 0 0 1 Class − + − − + + − − + + P (+|R) = P (R|+)P (+)/P (R) and P (−|R) = P (R|−)P (−)/P (R). Since P (+) = P (−) = 0.5 and P (R) is constant, R can be classified by comparing P (+|R) and P (−|R). For this question, P (R|+) = P (A = 1|+) × P (B = 1|+) × P (C = 1|+) = 0.192 P (R|−) = P (A = 1|−) × P (B = 1|−) × P (C = 1|−) = 0.032 Since P (R|+) is larger, the record is assigned to (+) class. (c) Compare P (A = 1), P (B = 1), and P (A = 1, B = 1). State the relationships between A and B. Answer: P (A = 1) = 0.5, P (B = 1) = 0.4 and P (A = 1, B = 1) = P (A) × P (B) = 0.2. Therefore, A and B are independent. (d) Repeat the analysis in part (c) using P (A = 1), P (B = 0), and P (A = 1, B = 0). Answer: P (A = 1) = 0.5, P (B = 0) = 0.6, and P (A = 1, B = 0) = P (A = 1) × P (B = 0) = 0.3. A and B are still independent. (e) Compare P (A = 1, B = 1|Class = +) against P (A = 1|Class = +) and P (B = 1|Class = +). Are the variables conditionally independent given the class? Answer: Compare P (A = 1, B = 1|+) = 0.2 against P (A = 1|+) = 0.6 and P (B = 1|Class = +) = 0.4. Since the product between P (A = 1|+) and P (A = 1|−) are not the same as P (A = 1, B = 1|+), A and B are not conditionally independent given the class. 56 Chapter 5 Classification: Alternative Techniques Attributes Distinguishing Attributes Noise Attributes A1 Class A A2 Records B1 Class B B2 Figure 5.2. Data set for Exercise 9. 9. (a) Explain how naı̈ve Bayes performs on the data set shown in Figure 5.2. Answer: NB will not do well on this data set because the conditional probabilities for each distinguishing attribute given the class are the same for both class A and class B. (b) If each class is further divided such that there are four classes (A1, A2, B1, and B2), will naı̈ve Bayes perform better? Answer: The performance of NB will improve on the subclasses because the product of conditional probabilities among the distinguishing attributes will be different for each subclass. (c) How will a decision tree perform on this data set (for the two-class problem)? What if there are four classes? Answer: For the two-class problem, decision tree will not perform well because the entropy will not improve after splitting the data using the distinguishing attributes. If there are four classes, then decision tree will improve considerably. 10. Repeat the analysis shown in Example 5.3 for finding the location of a decision boundary using the following information: (a) The prior probabilities are P (Crocodile) = 2 × P (Alligator). Answer: x̂ = 13.0379. 57 (b) The prior probabilities are P (Alligator) = 2 × P (Crocodile). Answer: x̂ = 13.9621. (c) The prior probabilities are the same, but their standard deviations are different; i.e., σ(Crocodile) = 4 and σ(Alligator) = 2. Answer: x̂ = 22.1668. 11. Figure 5.3 illustrates the Bayesian belief network for the data set shown in Table 5.3. (Assume that all the attributes are binary). Mileage Air Conditioner Engine Car Value Figure 5.3. Bayesian belief network. Table 5.3. Data set for Exercise 11. Mileage Engine Air Conditioner Hi Hi Hi Hi Lo Lo Lo Lo Good Good Bad Bad Good Good Bad Bad Working Broken Working Broken Working Broken Working Broken Number of Records with Car Value=Hi 3 1 1 0 9 5 1 0 Number of Records with Car Value=Lo 4 2 5 4 0 1 2 2 (a) Draw the probability table for each node in the network. P(Mileage=Hi) = 0.5 P(Air Cond=Working) = 0.625 P(Engine=Good|Mileage=Hi) = 0.5 P(Engine=Good|Mileage=Lo) = 0.75 58 Chapter 5 Classification: Alternative Techniques P(B = bad) = 0.1 P(F = empty) = 0.2 Battery Fuel Gauge P(G = empty | B = good, F = not empty) = 0.1 P(G = empty | B = good, F = empty) = 0.8 P(G = empty | B = bad, F = not empty) = 0.2 P(G = empty | B = bad, F = empty) = 0.9 Start P(S = no | B = good, F = not empty) = 0.1 P(S = no | B = good, F = empty) = 0.8 P(S = no | B = bad, F = not empty) = 0.9 P(S = no | B = bad, F = empty) = 1.0 Figure 5.4. Bayesian belief network for Exercise 12. P(Value=High|Engine=Good, Air Cond=Working) = 0.750 P(Value=High|Engine=Good, Air Cond=Broken) = 0.667 P(Value=High|Engine=Bad, Air Cond=Working) = 0.222 P(Value=High|Engine=Bad, Air Cond=Broken) = 0 (b) Use the Bayesian network to compute P(Engine = Bad, Air Conditioner = Broken). P (Engine = Bad, Air Cond = Broken)  = P (Engine = Bad, Air Cond = Broken, M ileage = α, V alue = β) αβ =  P (V alue = β|Engine = Bad, Air Cond = Broken) αβ × P (Engine = Bad|M ileage = α)P (M ileage = α)P (Air Cond = Broken) = 0.1453. 12. Given the Bayesian network shown in Figure 5.4, compute the following probabilities: (a) P (B = good, F = empty, G = empty, S = yes). 59 Answer: P (B = good, F = empty, G = empty, S = yes) = P (B = good) × P (F = empty) × P (G = empty|B = good, F = empty) ×P (S = yes|B = good, F = empty) = 0.9 × 0.2 × 0.8 × 0.2 = 0.0288. (b) P (B = bad, F = empty, G = not empty, S = no). Answer: P (B = bad, F = empty, G = not empty, S = no) = P (B = bad) × P (F = empty) × P (G = not empty|B = bad, F = empty) ×P (S = no|B = bad, F = empty) = 0.1 × 0.2 × 0.1 × 1.0 = 0.002. (c) Given that the battery is bad, compute the probability that the car will start. Answer: P (S = yes|B = bad)  = P (S = yes|B = bad, F = α)P (B = bad)P (F = α) α = 0.1 × 0.1 × 0.8 = 0.008 13. Consider the one-dimensional data set shown in Table 5.4. x y 0.5 − Table 5.4. Data set for Exercise 13. 3.0 4.5 4.6 4.9 5.2 5.3 5.5 − + + + − − + 7.0 − 9.5 − (a) Classify the data point x = 5.0 according to its 1-, 3-, 5-, and 9-nearest neighbors (using majority vote). Answer: 1-nearest neighbor: +, 3-nearest neighbor: −, 5-nearest neighbor: +, 9-nearest neighbor: −. (b) Repeat the previous analysis using the distance-weighted voting approach described in Section 5.2.1. 60 Chapter 5 Classification: Alternative Techniques Answer: 1-nearest neighbor: 3-nearest neighbor: 5-nearest neighbor: 9-nearest neighbor: +, +, +, +. 14. The nearest-neighbor algorithm described in Section 5.2 can be extended to handle nominal attributes. A variant of the algorithm called PEBLS (Parallel Examplar-Based Learning System) by Cost and Salzberg [2] measures the distance between two values of a nominal attribute using the modified value difference metric (MVDM). Given a pair of nominal attribute values, V1 and V2 , the distance between them is defined as follows: d(V1 , V2 ) =  k    ni1 ni2   , −  n1 n2  i=1 (5.2) where nij is the number of examples from class i with attribute value Vj and nj is the number of examples with attribute value Vj . Consider the training set for the loan classification problem shown in Figure 5.9. Use the MVDM measure to compute the distance between every pair of attribute values for the Home Owner and Marital Status attributes. Answer: The training set shown in Figure 5.9 can be summarized for the Home Owner and Marital Status attributes as follows. Marital Status Married Divorced 0 1 4 1 Class Yes No Single 2 2 Class Yes No Home Owner Yes No 0 3 3 4 d(Single, Married) = 1 d(Single, Divorced) = 0 d(Married, Divorced) = 1 d(Refund=Yes, Refund=No) = 6/7 61 15. For each of the Boolean functions given below, state whether the problem is linearly separable. (a) A AND B AND C Answer: Yes (b) NOT A AND B Answer: Yes (c) (A OR B) AND (A OR C) Answer: Yes (d) (A XOR B) AND (A OR B) Answer: No 16. (a) Demonstrate how the perceptron model can be used to represent the AND and OR functions between a pair of Boolean variables. Answer: Let x1 and x2 be a pair of Boolean variables and y be the output. For AND function, a possible perceptron model is:   y = sgn x1 + x2 − 1.5 . For OR function, a possible perceptron model is:   y = sgn x1 + x2 − 0.5 . (b) Comment on the disadvantage of using linear functions as activation functions for multilayer neural networks. Answer: Multilayer neural networks is useful for modeling nonlinear relationships between the input and output attributes. However, if linear functions are used as activation functions (instead of sigmoid or hyperbolic tangent function), the output is still a linear combination of its input attributes. Such a network is just as expressive as a perceptron. 17. You are asked to evaluate the performance of two classification models, M1 and M2 . The test set you have chosen contains 26 binary attributes, labeled as A through Z. Table 5.5 shows the posterior probabilities obtained by applying the models to the test set. (Only the posterior probabilities for the positive class are shown). As this is a two-class problem, P (−) = 1 − P (+) and P (−|A, . . . , Z) = 1 − P (+|A, . . . , Z). Assume that we are mostly interested in detecting instances from the positive class. 62 Chapter 5 Classification: Alternative Techniques Table 5.5. Posterior probabilities for Exercise 17. Instance 1 2 3 4 5 6 7 8 9 10 True Class + + − − + + − − + − P (+|A, . . . , Z, M1 ) 0.73 0.69 0.44 0.55 0.67 0.47 0.08 0.15 0.45 0.35 P (+|A, . . . , Z, M2 ) 0.61 0.03 0.68 0.31 0.45 0.09 0.38 0.05 0.01 0.04 (a) Plot the ROC curve for both M1 and M2 . (You should plot them on the same graph.) Which model do you think is better? Explain your reasons. Answer: The ROC curve for M 1 and M 2 are shown in the Figure 5.5. 1 0.9 0.8 0.7 M1 M2 0.6 TPR 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FPR Figure 5.5. ROC curve. M 1 is better, since its area under the ROC curve is larger than the area under ROC curve for M 2. (b) For model M1 , suppose you choose the cutoff threshold to be t = 0.5. In other words, any test instances whose posterior probability is greater than t will be classified as a positive example. Compute the precision, recall, and F-measure for the model at this threshold value. 63 When t = 0.5, the confusion matrix for M 1 is shown below. Actual + - + 3 1 2 4 Precision = 3/4 = 75%. Recall = 3/5 = 60%. F-measure = (2 × .75 × .6)/(.75 + .6) = 0.667. (c) Repeat the analysis for part (c) using the same cutoff threshold on model M2 . Compare the F -measure results for both models. Which model is better? Are the results consistent with what you expect from the ROC curve? Answer: When t = 0.5, the confusion matrix for M 2 is shown below. Actual + - + 1 1 4 4 Precision = 1/2 = 50%. Recall = 1/5 = 20%. F-measure = (2 × .5 × .2)/(.5 + .2) = 0.2857. Based on F-measure, M 1 is still better than M 2. This result is consistent with the ROC plot. (d) Repeat part (c) for model M1 using the threshold t = 0.1. Which threshold do you prefer, t = 0.5 or t = 0.1? Are the results consistent with what you expect from the ROC curve? Answer: When t = 0.1, the confusion matrix for M 1 is shown below. Actual + - + 5 4 0 1 Precision = 5/9 = 55.6%. Recall = 5/5 = 100%. F-measure = (2 × .556 × 1)/(.556 + 1) = 0.715. According to F-measure, t = 0.1 is better than t = 0.5. When t = 0.1, F P R = 0.8 and T P R = 1. On the other hand, when t = 0.5, F P R = 0.2 and T RP = 0.6. Since (0.2, 0.6) is closer to the point (0, 1), we favor t = 0.5. This result is inconsistent with the results using F-measure. We can also show this by computing the area under the ROC curve 64 Chapter 5 Classification: Alternative Techniques For t = 0.5, area = 0.6 × (1 − 0.2) = 0.6 × 0.8 = 0.48. For t = 0.1, area = 1 × (1 − 0.8) = 1 × 0.2 = 0.2. Since the area for t = 0.5 is larger than the area for t = 0.1, we prefer t = 0.5. 18. Following is a data set that contains two attributes, X and Y , and two class labels, “+” and “−”. Each attribute can take three different values: 0, 1, or 2. X 0 1 2 0 1 2 0 1 2 Y 0 0 0 1 1 1 2 2 2 Number of Instances + − 0 100 0 0 0 100 10 100 10 0 10 100 0 100 0 0 0 100 The concept for the “+” class is Y = 1 and the concept for the “−” class is X = 0 ∨ X = 2. (a) Build a decision tree on the data set. Does the tree capture the “+” and “−” concepts? Answer: There are 30 positive and 600 negative examples in the data. Therefore, at the root node, the error rate is Eorig = 1 − max(30/630, 600/630) = 30/630. If we split on X, the gain in error rate is: + − X=0 10 300 X=1 10 0 ∆X = Eorig − X=2 10 300 EX=0 = 10/310 EX=1 EX=2 = = 0 10/310 10 310 10 310 10 − 0− = 10/630. 630 310 630 630 310 If we split on Y , the gain in error rate is: 65 + − Y =0 0 200 Y =1 30 200 EY =0 EY =1 EY =2 Y =2 0 200 ∆Y = Eorig − = = = 0 30/230 0 230 30 = 0. 630 230 Therefore, X is chosen to be the first splitting attribute. Since the X = 1 child node is pure, it does not require further splitting. We may use attribute Y to split the impure nodes, X = 0 and X = 2, as follows: • The Y = 0 and Y = 2 nodes contain 100 − instances. • The Y = 1 node contains 100 − and 10 + instances. In all three cases for Y , the child nodes are labeled as −. The resulting concept is  +, X = 1; class = −, otherwise. (b) What are the accuracy, precision, recall, and F1 -measure of the decision tree? (Note that precision, recall, and F1 -measure are defined with respect to the “+” class.) Answer: The confusion matrix on the training data: Actual + − Predicted + − 10 20 0 600 accuracy : precision : recall : F − measure : 610 = 0.9683 630 10 = 1.0 10 10 = 0.3333 30 2 ∗ 0.3333 ∗ 1.0 = 0.5 1.0 + 0.3333 (c) Build a new decision tree with the following cost function: ⎧ if i = j; ⎨ 0, 1, if i = +, j = −; C(i, j) = ⎩ Number of − instances , if i = −, j = +. Number of + instances (Hint: only the leaves of the old decision tree need to be changed.) Does the decision tree capture the “+” concept? Answer: The cost matrix can be summarized as follows: 66 Chapter 5 Classification: Alternative Techniques Predicted + − + 0 600/30=20 Actual − 1 0 The decision tree in part (a) has 7 leaf nodes, X = 1, X = 0 ∧ Y = 0, X = 0 ∧ Y = 1, X = 0 ∧ Y = 2, X = 2 ∧ Y = 0, X = 2 ∧ Y = 1, and X = 2 ∧ Y = 2. Only X = 0 ∧ Y = 1 and X = 2 ∧ Y = 1 are impure nodes. The cost of misclassifying these impure nodes as positive class is: 10 ∗ 0 + 1 ∗ 100 = 100 while the cost of misclassifying them as negative class is: 10 ∗ 20 + 0 ∗ 100 = 200. These nodes are therefore labeled as +. The resulting concept is  +, X = 1 ∨ (X = 0 ∧ Y = 1) ∨ (X = 2 ∧ Y = 2); class = −, otherwise. (d) What are the accuracy, precision, recall, and F1 -measure of the new decision tree? Answer: The confusion matrix of the new tree Actual 19. + − Predicted + − 30 0 200 400 accuracy : precision : recall : F − measure : 430 = 0.6825 630 30 = 0.1304 230 30 = 1.0 30 2 ∗ 0.1304 ∗ 1.0 = 0.2307 1.0 + 0.1304 (a) Consider the cost matrix for a two-class problem. Let C(+, +) = C(−, −) = p, C(+, −) = C(−, +) = q, and q > p. Show that minimizing the cost function is equivalent to maximizing the classifier’s accuracy. Answer: Confusion Matrix + − + a c − b d Cost Matrix + − + p q q p 67 The total cost is F = p(a + d) + q(b + c). Since acc = a+d N , where N = a + b + c + d, we may write   F = N acc(p − q) + q . Because p − q is negative, minimizing the total cost is equivalent to maximizing accuracy. (b) Show that a cost matrix is scale-invariant. For example, if the cost matrix is rescaled from C(i, j) −→ βC(i, j), where β is the scaling factor, the decision threshold (Equation 5....
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Attached. Please let me know if you have any questions or need revisions.

Running Head: DATA MINING APPLICATIONS IN THE HEALTHCARE SECTOR

Data mining applications in the Healthcare Sector: Addressing health promotion
Name:
Institutional Affiliation:
Date:

1

DATA MINING APPLICATIONS IN THE HEALTHCARE SECTOR

2

Table of Contents
Data mining applications in the healthcare Sector: Addressing health promotion .............................. 3
Introduction ................................................................................................................................................. 3
References .................................................................................................................................................... 5

DATA MINING APPLICATIONS IN THE HEALTHCARE SECTOR

3

Data mining applications in the healthcare Sector: Addressing health promotion
Introduction
Over the years, numerous developments have been made concerning technology. The
technology changes have allowed many people and stakeholders to boost their connection with
the arising consumer needs. On the same note, it is worth appreciating the changes which
continue to occur within the community when considering technological changes and their
effects in promoting overall consumer living. Data mining is one of the most common
technologies used today to foster analysis and make meaning out of vast data volumes (Islam,
Hasan, Wang & Germack, 2018). In the healthcare sector, diverse records are witnessed that
relate to health outcomes associated with a given population. Healthcare stakeholders may need
to use the right technologies and solutions to focus more on disease prevention and health
promotion. Understanding the patterns in a given community, condition, or illness is essential for
healthcare stakeholders because it informs their decisions.
However, predicting diseases and or conditions within the healthcare sector has been a
major challenge making it hard to influence preparations and resilience. With the developments
witnessed today, healthcare stakeholders can create an ideal framework for handling community
health issues. Data mining is a promising technology that combines various frameworks for
collecting, analyzing and visualizing information in readiness for decision making.
This technology has been applied in a wide range of areas, including the healthcare sector
and business environments. When applied in the healthcare sector, practitioners and stakeholders
...


Anonymous
Just what I needed. Studypool is a lifesaver!

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Similar Content

Related Tags