Unformatted Attachment Preview
HBS_otus.csv (3.889 MB)
HS_otus.csv (671.278 KB)
HSS_otus.csv (2.165 MB)
Predictive Models for Microbiome Data
The datasets to be used for the project are taken from a large number of experiments
of Microbiome data analysis. Three different Microbiome datasets are provided which represent
the abundance of microbial communities on different body sites or subjects. Operational taxonomic
Units (OUTs) constitute the features and different body sites or subjects constitute the classes in
our study. Body sites (or subjects) are described by nominal attributes (target classes) while
the OTUs are provided in numeric attributes (features). The objective of this project is to
build predictive models from these datasets (one or more models for each dataset separately)
that can reliably estimate the body site (or subject) where the microbiome samples are taken from.
This is considered as a general computational biology project.
Recent DNA sequencing technologies provide high-dimensional data from humanassociated microbial communities. Identification of important groups of
microorganisms in the host is a major goal of these studies. Machine learning is an
appropriate approach for microarray analysis. Supervised classifiers can be applied
effectively to microbiota data to both selecting subsets of taxa that are highly
discriminative of the type of community (feature ranking and selection), and to build
models that can accurately classify unlabeled data (classification models).
For a background reading on the topic of prediction of microbiome community using machine
learning, see the links below.
Description of Datasets
There are 3 datasets in the csv format; Human Body Sites (HBS), Human Skin Sites (HSS), and
Human Subjects (HS).
# of OTUs
# of classes
Human Body Sites (HBS)
Human Skin Sites (HSS)
Human Subjects (HS)
HBS dataset contains 622 samples (instances) with 2741 OTUs (features). The first column of this
dataset includes labels of body sites (i.e. gut, skin, oral cavity, hair …) which the microbiome sample
was collected. Please note that hair is relatively under-represented in this dataset. Therefore, this is
a multi-class classification problem. Also we are interested in identifying the most discriminative
HSS dataset includes 401 samples (instances) with 2227 OTUs (features). The first column of this
dataset includes labels of different skin sites (i.e., palm, forehead, plantar foot …) which
the microbiome sample was collected. Therefore, this is also a multi-class classification problem
with classes. Similar to HBS dataset, we are interested in identifying features with most
discriminative power to separate different skin sites.
HS dataset consist of 144 samples (instances) with 1592 OTUs (features). The first column of this
dataset includes the ID of subjects in this study (class). Samples in this study come from
heterogeneous time points (June to September of 2012) which makes it challenging due to
significant variation in microbiome community of individuals over time. This multi-class
classification problem can be applied to forensics identification. Similar to both problems above, we
are interested in identifying the most discriminative (important) features. For this problem, we are
also interested in an unsupervised model that can distinguish 7 clusters (for 7 subjects).
Training and Test Datasets
10-fold cross validation in Weka. Report on these results.
Column 1: Includes the sample IDs. This column must be removed when you perform the analysis.
Columns 2: Includes the label of each sample (class). Classification will be assessed according to the
labels in this column.
Column 3 - N: Includes the OTUs or features. Note that the matrix of OTUs is sparse (so many
For the HBS and HSS datasets, choose at least two different algorithms to train
(e.g., Decision Tree, Random Forest). You can choose more than 2 algorithms.
Choose algorithms that are different from each other and discuss why you
chose these algorithms. For each algorithm, train a model on the full features in
the data set. Then, train a model on a reduced subset of the features. The
reduced features data set should be consistent across algorithms. Report the
features that are in the reduced features set. Also, report which method you
used to reduce the number of features. Compare and contrast the performance
of your different models on the full and reduced feature sets. Include plots and
table (which can be created outside of Weka). Discuss why you think you got
your results (was it due to the algorithm, or the reduced feature set, or
For HS dataset, perform a clustering approach and generate 7 clusters (you need to
instruct Weka to generate 7 clusters). Also, under "Cluster mode", you need to select
"Classes to clusters evaluation." Your algorithm will create 7 clusters and hopefully these
7 clusters will be split across your 7 classes. How accurate was this approach. Now,
choose a classification algorithm and see how accurate this method is. How do the
results compare? You do not need to do feature reduction (but you could for the
classification algorithm if you wanted to).
Include all Ouput results from Weka as text files that you attach to your assignment
submission. Have a separate text file for each model-dataset run and name the files with
this pattern "output_dataset_algorithm_featureset.txt" (e.g.,
output_hbs_randomforest_full.txt, output_hbs_randomforest_reduced.txt, etc.)
Here is an example of how analyses that could be run:
Write your final project report (3000-4000 words) that includes the following main sections:
Introduction, Methods, Results, Discussion. The subsections enumerated below are intended to
guide the construction of the project report, and may be augmented with other details as
appropriate. Please refer to the course outline for the grading rubric.
Introduction (25 points): brief summary on the importance and objective of the
Methods (50 points):
1. Any preprocessing or feature selection processing steps used, if
2. Data mining algorithms used, and if applicable, any non-default
optional parameter values;
3. Be sure to discuss how your algorithms work. Why did you
4. Evaluation procedures: assessment of the “goodness” of the
models and the predictions made using these models; Why did
you choose you evaluation metric(s)?
Results (75 points):
1. How many features were in the full and reduced?
2. How did the algorithms perform?
3. Any visualizations that might be useful to summarize the results
4. Any tables that might be useful to summarize the results
Discussion (50 points):
1. Any conclusions drawn from the experimental process;
2. Limitations of the data and associated information;
3. Comments on how you would extend this work, if additional
resources were available.
Please refer to the Grading Rubric in the Course Resources section for how
performance on this project will show mastery of Program Learning Outcome (PLO) #9:
Evaluate machine learning methods and strategies for advanced data mining.