Apply 2 data mining algorithms in the same dataset using R

Anonymous
timer Asked: Apr 25th, 2017

Question description

Find a data set and apply TWO data mining algorithims in it.

1- The dataset should contains at least 1000 row and should find out what is the problem in this data that you are trying to solve it

2- the algorithms that are chosen should be (one for clustering and one for classification )and clarify the purpose of using each of them (what is changes in the data set after applying the algorithm )

3- the classification algorithm should be DTree

4- you have to apply the second algorithm on the result of the first algorithm.

5- the program of the dataset that you build should be different than which is available online.

5- submit a full report with an explanation and output screenshots + Rscript file

6- don't forget to mention the URL of the dataset (attached the example of the report)

Airlines Active in the Data Mining Abstract— From the information of the data mining project for the course GCIS 544, information from the wellknown Airlines Active Data set was examined. The Airlines Active data set was derived from a simple hierarchical decision model. The major purpose of analysis and visualization of the Airlines Active data set is that it has to determine the factors that the Airlines Active data set will active Y (Yes Airlines Active) or not active No ( Airlines Active) for us. The strategy that has been utilized is classification and we will use some of attributes the data set has including AirlineID, Name, Alias, IATA, ICAO, Callsign, Country and Country to determine which data is task relevant. From those attributes our decision will be built. Make a judgment on the Airlines Active data set is it Y or N through the Active attribute is our goal in this project. As we said, the attribute of Active is the root for the decision tree. In more detail, if the Y has a number of iterations more than others the data set will be active. Keyword— Airlines Active data set, frequent pattern, Classification (decision tree). I. INTRODUCTION Nowadays there is a huge amount of data being collected and stored in databases everywhere across the globe. One of these is Airlines Active data set. Creating and maintaining this database has required and continues to require an immense amount of work. The Open Flights Airlines Database contains 5888 airlines. In this project we will be testing Airlines Active data set by the R program, which can be found: Active or not active. Several strategies will be tried out, and it will be determined which ones, Y = (Yes Airlines Active) or N (No Airlines Active), represent a correct solution to make a judgment on this data set. If there is a high percentage in the Yes “Y” kind at the Active attribute the data set will be Active. II. THE PROBLEM The purpose of this project is to analyze and visualize the Airlines Active data set through data mining algorithms, for example Frequent Pattern [11] and Classification [12] (Decision Tree) to understand more details about this data set. After analysis and visualization a judgment can be made on the data set. The more clear the analysis, the easier to it will be to understand the results of this data set. We will work with Frequent Pattern and Classification algorithms on this data set. Why? In data mining, frequent pattern mining (FPM) is one of the most intensively investigated problems in terms of computational and algorithmic development. Frequent Pattern algorithm will show us how many time these attributes have frequent values and by that a judgment can be made on the data set. As well, Classification consists of predicting a certain outcome based on a given input. The algorithm tries to discover relationships between the attributes that would make it possible to predict the outcome. Next, the algorithm is given a data set not seen before, called a prediction set, which contains the same set of attributes, except for the prediction attribute–not yet known. Clustering algorithm is to divide the set of entities into the groups of the attributes by using the Decision Tree and these groups are called clusters. For the R code, will be used that will produce some tree picture, for example, rpart.plot() code [1]. III. FIND OUT Using the Frequent Pattern and Classification (Decision Tree) by R software is the best way to carry out some simple analyses that are common in Airlines Active data set. IV. ALGORTHMS A. Algorithms explanation [2] The import algorithms used in this project are Frequent Pattern, shown as Fig.1, [9] and Classification algorithms, shown as Fig.2, 3, and 4, Using table() code in R program will give how many times these attributes have frequent values and by that a judgment can be made on the data set. The key algorithms for frequent pattern mining are explored. These include join-based methods such as Apriori, and pattern-growth methods. Also, clustering algorithm will be used to divide the set of entities into the groups of the attributes by using the Decision Tree and these groups are called clusters [5]. For R program, some of code will be used to that give us some tree picture, for example, rpart.plot() code [9]. Figure 1. E.g. A generic frequent pattern mining algorithm Figure 3. E.g Illustrating Classification Task Figure 2. E.g. The lattice of item sets in the frequent pattern Figure 4. E.g. From Decision Trees to Rules B. Key Concepts i) Frequent Pattern [9] • Itemset: A set of one or more items • K-itemset X = {x1, …, xk} • (absolute) support, or, support count of X: Frequency or occurrence of an itemset X • (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) • An itemset X is frequent if X’s support is no less than a minsup threshold ii) Classification [10] • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set V. IMPLEMENTATION ▪ R program packages To start analysis and visualization R program must be downloaded because R programming language for statistical, analyzing or visualization purposes to do the Frequent Pattern and the Classification algorithms. To download some packages in R, shown as Table.1, [7] because some of the code will not give the executor information without these packages. The following packages and libraries must be installed with some explanations on it: 1) install.packages("rpart.plot") 2) library('rpart') 3) library('rpart.plot') ▪ Implementation TABLE 1. PACKAGES AND LIBRARIES i. DATA SET DESCRIPTION ▪ In a simple description of the Airlines Active data set: • Airlines Active data set was derived from a simple decision model • The Airlines Active data set contains the six input attributes: AirlineID, Name, Alias, IATA, ICAO, Callsign, Callsign, Country. VI. ATTRIBUTE INFORMATION ▪ The data set consists of 6 attributes and 6049 records, shown as Table 2 and 3 1. Active Values: • "Y" if the airline is or has until recently been operational • "N" if it is defunctgood 2. Attributes: • • • Airline ID: Unique OpenFlights identifier for this airline. Name : Name of the airline. Alias : Alias of the airline. For example, All India Airways is commonly known as “IA". • • • • IATA : 2-letter IATA code, if available - Find out the 2-letter code of a location (airport, city) ICAO : 3-letter ICAO code, if available - Find out the 3-letter code of a location (airport, city) Callsign : Airline callsign. Country :Country or territory where airline is incorporated. ▪ Implementation TABLE 2. DATA SET BY EXCEL PROGRAM TABLE 3. DATA SET BY R PROGRAM 3. Plot the data set, shown as Fig.5, [7] Figure 5. Plot the data set ii. DATA SUMMARY a) Data summary, shown as Fig. 6 ▪ Implementation Figure 6. Summary data b) Select Attributes i. After reading and summarizing the data set, the attributes will be chosen and the Frequent Pattern will be done on it as shown in Fig. 7 and 8. Frequent patterns are item sets, subsequences, or substructures that appear in the data set. We will use two R codes: • table() • barplot() Figure 7. Frequent patterns to the attributes of Active Figure 8. Plot the select attributes (Active) ii. The frequency of the two attribute will be found using Frequent patterns, shown as Fig. 8, 9 and 10, by using the R code of: • with(data, table(AirlineID,Active)) • with(data, table(AirlineID,Name)) • table=with(data, table(AirlineID,Alias)) • with(data, table(AirlineID,IATA)) • with(data, table(AirlineID,ICAO)) • with(data, table(AirlineID,Callsign)) • with(data, table(AirlineID,Country)) iii. Implementation • Data Frequent plot, shown as Fig. 9: Frequent=with(data, table(class,safety)) barplot(Frequent, col=rainbow(7), beside=T, legend=T,main=" Frequency of AirlineID and Active”) Figure 9. Plot two difference attributes frequency iv. Classification: Classification is a data mining technique used to predict group membership for data instances, shown as Fig. 10. Two R codes will be used for the classification: • rpart() • rpart.plot() Figure 10. Classification of the Airline Active Data Set v. Decision Tree by R Programing, shown as Fig. 11, [6] • The attributes are: AirlineID, Name, Alias, IATA, ICAO, Callsign, Callsign, Country • The attribute of Active is the class label (n, Y and N ) • summary(Tree1) Figure 11. Smmary of the Tree1) • Displays the cp table for fitted rpart object. By R use this cod, shown as Fig. 12: printcp(Tree1) Figure 12. Displays the cp table for fitted rpart object of the Airlines Active data set When calculating the size of the data set on four because the attribute of class has values (Y or N) there will be (1162/6048 = 0.19213). The N has =1162 from the Fig.10 which means the majority of the data set from the N. The most Airlines are not active, shown as Fig. 13 and 14 [8]. Figure 13. Classification Tre Figure 13. Printcp Tree1 VII. CHALLENGES R Program was a big challenge, especially with the correct packages that need to be used in this project. The problem must be separated, think in parallel manner with a correct code in R. Some of the challenges faced in implementing Classification algorithm on the Airlines Active Data Set was finding a correct code. Choosing a right attribute was also a challenge. VIII. CONCLUSION Using Airlines Active data set to predict whether this dataset is accepted or unaccepted. In this project, the following R functions were discovered: • How to install R • How to install R package • Work with data mining functions in R IX. FUTURE WORKS It was so exciting working with this data set. In the future, this data set could be used: • Adding different attributes • Using more function about data mining • Testing this data set by test cod in R X. ACKNOWLEDGMENT I would like to thank my teacher, Prof. Sreela Sasi, for providing the information in Data Mining Concepts and Techniques GCIS 544. In her class I learned good information, especially with R Program .I would like to thank my family for their support in furthering my studies. XI. REFERENCES [1] M. Zaki , & W. Jr, “Data Mining and Analysis,” 1st ed. 2014, p. 266. [2] M. Zaki , & W. Jr, “Frequent Pattern Mining (FPM) and Classification (CLASS): Decision Trees (Video Lecture by the author)”, (2015). Available: http://www.dataminingbook.info/uploads/videos/lecture12/. [Accessed: 17- AP- 2015]. [3] J. Han, H. Cheng, D. Xin ·Xifeng Yan. "Frequent pattern mining: current status and future directions." Data Min Knowl Disc. (2007) 15:55–86, pdf. [4] M. Bohanec, B. Zupan, “Airline Active Data Set”, (2012). Available: http://openflights.org/data.html. [Accessed: 04- May- 2015]. [5] M. Chapple “Classification”. Available: http://databases.about.com/od/datamining/g/classification.htm. [Accessed: 01- AP- 2015]. [6] “Decision Trees ” Available: http://www.rdatamining.com/examples/decision-tree. [Accessed: 02-May-2015]. [7] W. King. “R Tutorials”. Available: http://ww2.coastal.edu/kingw/statistics/R-tutorials/ [Accessed: 02-May-2015]. [8] OpenFlights Airlines, “Airline Dataset”, (1987 to 2008). Available: https://www.datadr.org/doc/airline.html. [Accessed: 04- May- 2015]. [9] C. Aggarwal, J. Han, “Frequent Pattern Mining”, (2014). Available: http://www.charuaggarwal.net/freqbook.pdf. [Accessed: 01- MAY- 2015]. [10] Tan, Steinbach, Kumar,” Data Mining Classification: Alternative Techniques ”, (2044). http://www users.cs.umn.edu/~kumar/dmbook/dmslides/chap5_alternative_classification.pdf,. [Accessed: 01- MAY- 2015]. [11] J. Han, M. Kamber, J. Pei, “Data Mining Concepts and Techniques”, (2013). Available: http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCUQFjAB&url=http%3A%2 F%2Fweb.engr.illinois.edu%2F~hanj%2Fcs412%2Fbk3_slides%2F06FPBasic.ppt&ei=LSZGVaj8O4qXNtitgXg &usg=AFQjCNES11J9IO6_Y3IHrJS67LKdkBiBMQ. [Accessed: 01- MAY- 2015]. [12] J. Han, M. Kamber, J. Pei, “Data Mining Concepts and Techniques”, (2011). Available: http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCQQFjAB&url=http%3A%2 F%2Fweb.engr.illinois.edu%2F~hanj%2Fcs412%2Fbk3_slides%2F08ClassBasic.ppt&ei=_idGVfDRKYz1gwT W74GoDw&usg=AFQjCNHkfp8z2Z6JoURezM3ZCHY8_AvEvQ. [Accessed: 01- MAY- 2015].
Studypool has helped 1,244,100 students
flag Report DMCA
Similar Questions
Hot Questions
Related Tags
Study Guides

Brown University





1271 Tutors

California Institute of Technology




2131 Tutors

Carnegie Mellon University




982 Tutors

Columbia University





1256 Tutors

Dartmouth University





2113 Tutors

Emory University





2279 Tutors

Harvard University





599 Tutors

Massachusetts Institute of Technology



2319 Tutors

New York University





1645 Tutors

Notre Dam University





1911 Tutors

Oklahoma University





2122 Tutors

Pennsylvania State University





932 Tutors

Princeton University





1211 Tutors

Stanford University





983 Tutors

University of California





1282 Tutors

Oxford University





123 Tutors

Yale University





2325 Tutors