Predictive model using Weka tool

Computer Science

American Institute of Technology - NV

Question Description

I'm working on a machine learning question and need guidance to help me understand better.

I need help with an assignment. The assignment is about creating a simple predictive model using the Weka tool. I will provide relevant csv files.

Unformatted Attachment Preview

Overview Overview Attached Files: • HBS_otus.csv (3.889 MB) • HS_otus.csv (671.278 KB) • HSS_otus.csv (2.165 MB) Predictive Models for Microbiome Data The datasets to be used for the project are taken from a large number of experiments of Microbiome data analysis. Three different Microbiome datasets are provided which represent the abundance of microbial communities on different body sites or subjects. Operational taxonomic Units (OUTs) constitute the features and different body sites or subjects constitute the classes in our study. Body sites (or subjects) are described by nominal attributes (target classes) while the OTUs are provided in numeric attributes (features). The objective of this project is to build predictive models from these datasets (one or more models for each dataset separately) that can reliably estimate the body site (or subject) where the microbiome samples are taken from. This is considered as a general computational biology project. Background Recent DNA sequencing technologies provide high-dimensional data from humanassociated microbial communities. Identification of important groups of microorganisms in the host is a major goal of these studies. Machine learning is an appropriate approach for microarray analysis. Supervised classifiers can be applied effectively to microbiota data to both selecting subsets of taxa that are highly discriminative of the type of community (feature ranking and selection), and to build models that can accurately classify unlabeled data (classification models). For a background reading on the topic of prediction of microbiome community using machine learning, see the links below. http://www.epi.msu.edu/seminars/Microbiome-talk.pdf http://www.annualreviews.org/doi/pdf/10.1146/annurev-statistics-010814-020351 Description of Datasets There are 3 datasets in the csv format; Human Body Sites (HBS), Human Skin Sites (HSS), and Human Subjects (HS). Dataset Samples # of OTUs # of classes Human Body Sites (HBS) 622 2741 6 Human Skin Sites (HSS) 401 2227 12 Human Subjects (HS) 144 1592 7 HBS dataset contains 622 samples (instances) with 2741 OTUs (features). The first column of this dataset includes labels of body sites (i.e. gut, skin, oral cavity, hair …) which the microbiome sample was collected. Please note that hair is relatively under-represented in this dataset. Therefore, this is a multi-class classification problem. Also we are interested in identifying the most discriminative (important) features. HSS dataset includes 401 samples (instances) with 2227 OTUs (features). The first column of this dataset includes labels of different skin sites (i.e., palm, forehead, plantar foot …) which the microbiome sample was collected. Therefore, this is also a multi-class classification problem with classes. Similar to HBS dataset, we are interested in identifying features with most discriminative power to separate different skin sites. HS dataset consist of 144 samples (instances) with 1592 OTUs (features). The first column of this dataset includes the ID of subjects in this study (class). Samples in this study come from heterogeneous time points (June to September of 2012) which makes it challenging due to significant variation in microbiome community of individuals over time. This multi-class classification problem can be applied to forensics identification. Similar to both problems above, we are interested in identifying the most discriminative (important) features. For this problem, we are also interested in an unsupervised model that can distinguish 7 clusters (for 7 subjects). Training and Test Datasets 10-fold cross validation in Weka. Report on these results. Dataset Description Column 1: Includes the sample IDs. This column must be removed when you perform the analysis. Columns 2: Includes the label of each sample (class). Classification will be assessed according to the labels in this column. Column 3 - N: Includes the OTUs or features. Note that the matrix of OTUs is sparse (so many zeros). Deliverable Items • For the HBS and HSS datasets, choose at least two different algorithms to train (e.g., Decision Tree, Random Forest). You can choose more than 2 algorithms. Choose algorithms that are different from each other and discuss why you chose these algorithms. For each algorithm, train a model on the full features in the data set. Then, train a model on a reduced subset of the features. The reduced features data set should be consistent across algorithms. Report the features that are in the reduced features set. Also, report which method you used to reduce the number of features. Compare and contrast the performance of your different models on the full and reduced feature sets. Include plots and table (which can be created outside of Weka). Discuss why you think you got your results (was it due to the algorithm, or the reduced feature set, or something else?). • For HS dataset, perform a clustering approach and generate 7 clusters (you need to instruct Weka to generate 7 clusters). Also, under "Cluster mode", you need to select "Classes to clusters evaluation." Your algorithm will create 7 clusters and hopefully these 7 clusters will be split across your 7 classes. How accurate was this approach. Now, choose a classification algorithm and see how accurate this method is. How do the results compare? You do not need to do feature reduction (but you could for the classification algorithm if you wanted to). Include all Ouput results from Weka as text files that you attach to your assignment submission. Have a separate text file for each model-dataset run and name the files with this pattern "output_dataset_algorithm_featureset.txt" (e.g., output_hbs_randomforest_full.txt, output_hbs_randomforest_reduced.txt, etc.) Here is an example of how analyses that could be run: • Dataset HBS Algorithm Type Classification HSS Classification HS Clustering Algorithm One Rule One Rule Decision Tree Decision Tree Random Forest Random Forest One Rule One Rule Decision Tree Decision Tree Random Forest Random Forest K-means Naïve Bayes Naïve Bayes Features Full Reduced Full Reduced Full Reduced Full Reduced Full Reduced Full Reduced Full Full Reduced Performance .## .## .## .## .## .## .## .## .## .## .## .## .## .## .## Write your final project report (3000-4000 words) that includes the following main sections: Introduction, Methods, Results, Discussion. The subsections enumerated below are intended to guide the construction of the project report, and may be augmented with other details as appropriate. Please refer to the course outline for the grading rubric. • • • Introduction (25 points): brief summary on the importance and objective of the experiments Methods (50 points): 1. Any preprocessing or feature selection processing steps used, if applicable; 2. Data mining algorithms used, and if applicable, any non-default optional parameter values; 3. Be sure to discuss how your algorithms work. Why did you choose them? 4. Evaluation procedures: assessment of the “goodness” of the models and the predictions made using these models; Why did you choose you evaluation metric(s)? Results (75 points): 1. How many features were in the full and reduced? 2. How did the algorithms perform? 3. Any visualizations that might be useful to summarize the results 4. Any tables that might be useful to summarize the results • Discussion (50 points): 1. Any conclusions drawn from the experimental process; 2. Limitations of the data and associated information; 3. Comments on how you would extend this work, if additional resources were available. Please refer to the Grading Rubric in the Course Resources section for how performance on this project will show mastery of Program Learning Outcome (PLO) #9: Evaluate machine learning methods and strategies for advanced data mining. ...
Student has agreed that all tutoring, explanations, and answers provided by the tutor will be used to help in the learning process and in accordance with Studypool's honor code & terms of service.

This question has not been answered.

Create a free account to get help with this and any other question!

Similar Questions
Related Tags