Writing
ITS 632 University of Cummberlands Intro to Data Mining Chapter 3 and 4 Questions

ITS 632

University of the Cumberlands

ITS

Question Description

Help me study for my Computer Science class. I’m stuck and don’t understand.

here is the doc file attached.

only edit answers which is marked red not questions.

dont change the question but do edit the answer in such a way that it is less % plagiarize

Unformatted Attachment Preview

ITS-632 Intro to Data Mining Dr. Patrick Haney Dept. of Information Technology & School of Computer and Information Sciences University of the Cumberlands Chapter 3 & 4 Assignment [Dharmin Nareshbhai Patel] 1. Obtain one of the data sets available at the UCI Machine Learning Repository and apply as many of the different visualization techniques described in the chapter as possible. The bibliographic notes and book Web site provide pointers to visualization software. Ans) MATLAB and R have excellent facilities for visualization. Most of the figures in this chapter were created using MATLAB. R is freely available from http://www.r-project.org/. 2. Identify at least two advantages and two disadvantages of using color to visually represent information. Ans) Advantages: Color makes it much easier to visually distinguish visual elements from one another. For example, three clusters of two-dimensional points are more readily distinguished if the markers representing the points have different colors, rather than only different shapes. Also, figures with color are more interesting to look at. Disadvantages: Some people are color blind and may not be able to properly interpret a color figure. Grayscale figures can show more detail in some cases. Color can be hard to use properly. For example, a poor color scheme can be garish or can focus attention on unimportant elements. 3. What are the arrangement issues that arise with respect to three-dimensional plots? Ans) It would have been better to state this more generally as “What are the issues . ” since selection, as well as arrangement plays a key issue in displaying a three-dimensional plot. The key issue for three dimensional plots is how to display information so that as little information is obscured as possible. If the plot is of a twodimensional surface, then the choice of a viewpoint is critical. However, if the plot is in electronic form, then it is sometimes possible to interactively change the viewpoint to get a complete view of the surface. For three dimensional solids, the situation is even more challenging. Typically, portions of the information must be omitted in order to provide the necessary information. For example, a slice or cross-section of a three dimensional object is often shown. In some cases, transparency can be used. Again, the ability to change the arrangement of the visual elements interactively can be helpful. 4. Discuss the advantages and disadvantages of using sampling to reduce the number of data objects that need to be displayed. Would simple random sampling (without replacement) be a good approach to sampling? Why or why not? Ans) Simple random sampling is not the best approach since it will eliminate most of the points in sparse regions. It is better to under sample the regions where data objects are too dense while keeping most or all of the data objects from sparse regions. 5. Describe how you would create visualizations to display information that de-scribes the following types of systems. a) Computer networks. Be sure to include both the static aspects of the network, such as connectivity, and the dynamic aspects, such as traffic. b) The distribution of specific plant and animal species around the world fora specific moment in time. c) The use of computer resources, such as processor time, main memory, and disk, for a set of benchmark database programs. d) The change in occupation of workers in a particular country over the last thirty years. Assume that you have yearly information about each person that also includes gender and level of education. Be sure to address the following issues: • Representation. How will you map objects, attributes, and relation-ships to visual elements? • Arrangement. Are there any special considerations that need to be taken into account with respect to how visual elements are displayed? Specific examples might be the choice of viewpoint, the use of transparency, or the separation of certain groups of objects. • Selection. How will you handle a large number of attributes and data objects Ans) (a) Computer networks. Be sure to include both the static aspects of the network, such as connectivity, and the dynamic aspects, such as traffic. The connectivity of the network would best be represented as a graph, with the nodes being routers, gateways, or other communications devices and the links representing the connections. The bandwidth of the connection could be represented by the width of the links. Color could be used to show the percent usage of the links and nodes. (b) The distribution of specific plant and animal species around the world for a specific moment in time. The simplest approach is to display each species on a separate map of the world and to shade the regions of the world where the species occurs. If several species are to be shown at once, then icons for each species can be placed on a map of the world. (c) The use of computer resources, such as processor time, main memory, and disk, for a set of benchmark database programs. The resource usage of each program could be displayed as a bar plot of the three quantities. Since the three quantities would have different scales, a proper scaling of the resources would be necessary for this to work well. For example, resource usage could be displayed as a percentage of the total. Alternatively, we could use three bar plots, one for type of resource usage. On each of these plots there would be a bar whose height represents the usage of the corresponding program. This approach would not require any scaling. Yet another option would be to display a line plot of each program’s resource usage. For each program, a line would be constructed by (1) considering processor time, main memory, and disk as different x locations, (2) letting the percentage resource usage of a particular program for the three quantities be the y values associated with the x values, and then (3) drawing a line to connect these three points. Note that an ordering of the three quantities needs to be specified, but is arbitrary. For this approach, the resource usage of all programs could be displayed on the same plot. (d) The change in occupation of workers in a particular country over the last thirty years. Assume that you have yearly information about each person that also includes gender and level of education. For each gender, the occupation breakdown could be displayed as an array of pie charts, where each row of pie charts indicates a particular level of education and each column indicates a particular year. For convenience, the time gap between each column could be 5 or ten years. Alternatively, we could order the occupations and then, for each gender, compute the cumulative percent employment for each occupation. If this quantity is plotted for each gender, then the area between two successive lines shows the percentage of employment for this occupation. If a color is associated with each occupation, then the area between each set of lines can also be colored with the color associated with each occupation. A similar way to show the same information would be to use a sequence of stacked bar graphs. 6. Describe one advantage and one disadvantage of a stem and leaf plot with respect to a standard histogram. Ans) A stem and leaf plot shows you the actual distribution of values. On the other hand, a stem and leaf plot becomes rather unwieldy for a large number of values. 7. How might you address the problem that a histogram depends on the number and location of the bins? Ans) The best approach is to estimate what the actual distribution function of the data looks like using kernel density estimation. This branch of data analysis is relatively well-developed and is more appropriate if the widely available, but simplistic approach of a histogram is not sufficient. 8. Describe how a box plot can give information about whether the value of an attribute is symmetrically distributed. What can you say about the symmetry of the distributions of the attributes shown in Figure 3.11? Ans) (a) If the line representing the median of the data is in the middle of the box, then the data is symmetrically distributed, at least in terms of the 75% of the data between the first and third quartiles. For the remaining data, the length of the whiskers and outliers is also an indication, although, since these features do not involve as many points, they may be misleading. (b) Sepal width and length seem to be relatively symmetrically distributed, petal length seems to be rather skewed, and petal width is somewhat skewed. 9. Compare sepal length, sepal width, petal length, and petal width, using Figure3.12. Ans) For Setosa, sepal length > sepal width > petal length > petal width. For Versicolour and Virginiica, sepal length > sepal width and petal length > petal width, but although sepal length > petal length, petal length > sepal width. 10. Comment on the use of a box plot to explore a data set with four attributes: age, weight, height, and income. Ans) A great deal of information can be obtained by looking at (1) the box plots for each attribute, and (2) the box plots for a particular attribute across various categories of a second attribute. For example, if we compare the box plots of age for different categories of ages, we would see that weight increases with age. 11. Give a possible explanation as to why most of the values of petal length and width fall in the buckets along the diagonal in Figure 3.9. Ans) We would expect such a distribution if the three species of Iris can be ordered according to their size, and if petal length and width are both correlated to the size of the plant and each other. 12. Use Figures 3.14 and 3.15 to identify a characteristic shared by the petal width and petal length attributes. Ans) There is a relatively flat area in the curves of the Empirical CDF’s and the percentile plots for both petal length and petal width. This indicates a set of flowers for which these attributes have a relatively uniform value. 13.Simple line plots, such as that displayed in Figure 2.12 on page 56, which shows two time series, can be used to effectively display high-dimensional data. For example, in Figure 2.12 it is easy to tell that the frequencies of the two time series are different. What characteristic of time series allows the effective visualization of high-dimensional data? Ans) The fact that the attribute values are ordered. 14.Describe the types of situations that produce sparse or dense data cubes. Illustrate with examples other than those used in the book. Ans) Any set of data for which all combinations of values are unlikely to occur would produce sparse data cubes. This would include sets of continuous attributes where the set of objects described by the attributes doesn’t occupy the entire data space, but only a fraction of it, as well as discrete attributes, where many combinations of values don’t occur. A dense data cube would tend to arise, when either almost all combinations of the categories of the underlying attributes occur, or the level of aggregation is high enough so that all combinations are likely to have values. For example, consider a data set that contains the type of traffic accident, as well as its location and date. The original data cube would be very sparse, but if it is aggregated to have categories consisting single or multiple car accident, the state of the accident, and the month in which it occurred, then we would obtain a dense data cube. 15.How might you extend the notion of multidimensional data analysis so that the target variable is a qualitative variable? In other words, what sorts of summary statistics or data visualizations would be of interest? Ans) A summary statistics that would be of interest would be the frequencies with which values or combinations of values, target and otherwise, occur. From this we could derive conditional relationships among various values. In turn, these relationships could be displayed using a graph similar to that used to display Bayesian networks. 16.Construct a data cube from Table 3.14. Is this a dense or sparse data cube? If it is sparse, identify the cells that are empty. Ans) The data cube is shown in belowTable , It is a dense cube; only two cells are empty. 17.Discuss the differences between dimensionality reduction based on aggregation and dimensionality reduction based on techniques such as PCA and SVD. Ans) The dimensionality of PCA or SVD can be viewed as a projection of the data onto a reduced set of dimensions. In aggregation, groups of dimensions are combined. In some cases, as when days are aggregated into months or the sales of a product are aggregated by store location, the aggregation can be viewed as a change of scale. In contrast, the dimensionality reduction provided by PCA and SVD do not have such an interpretation. ...
Purchase answer to see full attachment
Student has agreed that all tutoring, explanations, and answers provided by the tutor will be used to help in the learning process and in accordance with Studypool's honor code & terms of service.

Final Answer

Attached.

ITS-632 Intro to Data Mining
Dr. Patrick Haney
Dept. of Information Technology &
School of Computer and Information Sciences
University of the Cumberlands

Chapter 3 & 4 Assignment

[Dharmin Nareshbhai Patel]

1.

Obtain one of the data sets available at the UCI Machine Learning Repository and apply as many
of the different visualization techniques described in the chapter as possible. The bibliographic
notes and book Web site provide pointers to visualization software.
Two programs that are excellent tools for visualization are MATLAB and R. In this chapter, most
of the figures are created with MATLAB. For R, we can download it for free on http://www.rproject.org/.

2.

Identify at least two advantages and two disadvantages of using color to visually represent
information.
Advantages: With the availability of color, it is easier to distinguish the different visual elements.
An example would be three clusters of two-dimensional points. If the markers which represent
the points have various colors, it is easier to differentiate them, compared to if the markers only
have different shapes. Furthermore, color makes the figures more interesting and attractive.

Disadvantages: It is not suitable for people who are color blind. For these people, they are not
able interpret a color figure properly. In particular cases, grayscale figures are more capable of
showing detail, as color can be difficult to utilize properly. A poor color scheme, for example,
might be too flashy and instead diverts its attention on elements that are insignificant.

3.

What are the arrangement issues that arise with respect to three-dimensional plots?

It would be wiser to state this question as “What are the issues” instead. This is because when
displaying a three-dimensional plot, not only is selection important, arrangement is important, too.
The main problem for three dimensional plots is how to display information in a way such that

information which is hidden is as little as possible. The choice of a viewpoint is critical if the plot is a
two-dimensional surface. However, if the plot takes an electronic form, then it might be possible to
interact with the viewpoint and change it, so we get to see a complete view of the surface. The
situation proves to be even more challenging for three-dimensional solids. Usually, some parts of
the information are required to be removed to provide information that is necessary. For example,
what is often shown is a slice or cross-section of a three-dimensional object. In some cases, we can
use transparency. Once again, it is extremely helpful to have the ability to change and interact with
the arrangement of visual elements, to our liking.

4.

Discuss the advantages and di...

Carnegie Mellon University

Anonymous
Thanks for the help.

Anonymous
Outstanding. Studypool always delivers quality work.

Anonymous
Tutor was very helpful and took the time to explain concepts to me. Very responsive, managed to get replies within the hour.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4
Similar Questions
Related Tags