### Unformatted Attachment Preview

v. 1
CSE 142: Machine Learning, Winter 2021
Assignment #1
Due Tuesday, January 26 by 23:59pm PT
Notes:
§
§
§
§
§
§
This assignment is to be done individually. You may discuss the problems at a general
level with others in the class (e.g., about the concepts underlying the question, or what
lecture or reading material may be relevant), but the work you turn in must be solely
your own.
Be sure to re-read the “Policy on Academic Integrity” on the course website.
Be aware of the late policy in the course syllabus so turn in what you have by the due
time.
Justify every answer you give – show the work that achieves the answer or explain your
response.
Any updates or corrections will be posted on the Assignments page (of the course
website), so check there occasionally.
To turn in your assignment:
- Submission through Gradescope.
- Clearly indicates each part of your submission belongs to which question when you
are submitting. If you don't do it right, grader might see "no content is available".
Problem #1 [5 points]
Consider the problem of an adult learning to speak and understand a foreign language.
Explain how this process fits into the general learning model (Fig. 3 in the textbook) –
i.e., describe the domain objects, training data, model, learning algorithm, and output for
this scenario. Discuss what kind(s) of learning takes place.
Problem #2 [6 points]
You are asked to build a machine learning system to estimate someone’s blood pressure
(two numbers: systolic and diastolic; consider them to be real-valued) based on the
following inputs: the patient’s sex, age, weight, average grams of fat consumed per day,
number of servings of red meat per week, servings of fruits and vegetables per day,
smoker or non-smoker. You are given a training data set of values for all of these
variables and the blood pressure numbers for 10,000 patients.
Answer (and explain) the following questions:
(a) What kind of machine learning problem is this?
(b) Is it a predictive task or a descriptive task?
(c) Are you likely to use a geometric model, a probabilistic model, or a logical
model?
(d) Will your model be a grouping model or a grading model?
(e) What is the label space for this problem?
(f) What is the output space for this problem?
Problem #3 [8 points]
We (simplistically) describe a basketball player’s value in terms of the following
statistics:
Minutes played per game
Points scored per game
Rebounds per game
Assists per game
Steals per game
Fouls per game
Turnovers per game
Stephen Curry (x1)
33.6
30.6
4.8
6.8
1.22
2.11
3.00
James Harden (x2)
36.7
27.0
4.7
11.3
1.0
1.67
3.83
Treating the statistics for each player (x1 and x2) as a feature vector, what is the distance
between them, measured in terms of (a) L1 distance, (b) L2 distance, (c) L10 distance, (d)
L100 distance?
(e) If a constant vector v = [5 5 2 2 0.5 0.1 1]T is added to both 𝑥1 and 𝑥2, which (if any)
of L1, L2, L10, or L100 will change?
(f) If 𝑥1 and 𝑥2 are multiplied by a constant k, which (if any) of L1, L2, or L10 will
change?
Problem #4 [12 points]
The joint probability distribution of three variables, class, grade and effort can be
computed from the following table that shows numbers of students in each bin:
class = 165B
class = basketweaving
grade effort=Small Medium
Large
effort=Small Medium
Large
A
0
25
100
50
100
150
B
25
50
75
50
50
25
C
25
50
25
50
25
0
D
50
20
5
0
0
0
F
50
0
0
0
0
0
(a)
(b)
(c)
(d)
What is the conditional probability distribution P(grade | class, effort)?
What is the marginal probability distribution P(grade, effort)?
What is the marginal probability distribution P(effort)?
What is P(grade=A | class)?
Problem #5 [12 points]
There are 100,000 emails used to train a spam detection system – 5,000 of them are of
spam and the rest are non-spam. To test the system, you have 10,000 emails – 2,000 spam
and 8,000 non-spam – in your test set.
The results of the test are as follows: 250 of the spam emails are classified as non-spam,
and the rest are classified as spam; 250 of the non-spam emails are classified as spam,
and the rest are classified as non-spam.
(a) Show the contingency table for this binary classification experiment. Label it
clearly and fill out the table entries.
(b) What is the false positive rate of the system in this experiment?
(c) What is the false negative rate?
(d) What is the error rate?
(e) What is the precision?
(f) What is the accuracy?
Problem #6 [8 points]
A ranking classifier ranks 25 training examples {xi}, from highest to lowest rank, in the
following order:
Highest
Lowest
x2, x3, x1, x5, x13, x6, x8, x7, x9, x10, x12, x11, x15, x4, x14, x21, x17, x20, x18, x22, x16, x19, x25, x23, x24
Examples x1 through x12 are in the positive class (which should be ranked higher);
examples x13 through x25 are in the negative class (which should be ranked lower).
(a) How many ranking errors are there?
(b) What is the ranking error rate?
(c) What is the ranking accuracy?
Problem #7 [14 points]
C6
C5
C4
C3
C2
C1
The figure below shows training data with two features, with each example labeled as
being in the positive (filled-in points) or negative (open points) class. Proposed linear
discriminant functions (C1 through C6) are shown as dotted lines, each one indicating a
different classifier for this data. Each classifier classifies points to the upper-right of its
dotted line as positive and points to the lower-left of its dotted line as negative.
(a) Draw the coverage plot for this data and plot the different classifiers (and label
them as C1, C2, etc.).
(b) Draw the ROC plot and label the classifiers on the plot.
(c)
(d)
(e)
(f)
(g)
Which classifiers have the highest and lowest accuracy?
Which classifiers have the highest and lowest precision?
Which classifiers have the highest and lowest recall?
Which classifiers (if any) are complete?
Which classifiers (if any) are consistent?
MACHINE LEARNING
The Art and Science of Algorithms
that Make Sense of Data
As one of the most comprehensive machine learning texts around, this book does
justice to the ﬁeld’s incredible richness, but without losing sight of the unifying
principles.
Peter Flach’s clear, example-based approach begins by discussing how a spam
ﬁlter works, which gives an immediate introduction to machine learning in action,
with a minimum of technical fuss. He covers a wide range of logical, geometric
and statistical models, and state-of-the-art topics such as matrix factorisation and
ROC analysis. Particular attention is paid to the central role played by features.
Machine Learning will set a new standard as an introductory textbook:
r The Prologue and Chapter 1 are freely available on-line, providing an accessible
ﬁrst step into machine learning.
r The use of established terminology is balanced with the introduction of new and
r
r
r
r
useful concepts.
Well-chosen examples and illustrations form an integral part of the text.
Boxes summarise relevant background material and provide pointers for revision.
Each chapter concludes with a summary and suggestions for further reading.
A list of ‘Important points to remember’ is included at the back of the book
together with an extensive index to help readers navigate through the material.
MACHINE LEARNING
The Art and Science of Algorithms
that Make Sense of Data
PETER FLACH
cambridge university press
Cambridge, New York, Melbourne, Madrid, Cape Town,
Singapore, São Paulo, Delhi, Mexico City
Cambridge University Press
The Edinburgh Building, Cambridge CB2 8RU, UK
Published in the United States of America by Cambridge University Press, New York
www.cambridge.org
Information on this title: www.cambridge.org/9781107096394
C
Peter Flach 2012
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2012
Printed and bound in the United Kingdom by the MPG Books Group
A catalogue record for this publication is available from the British Library
ISBN 978-1-107-09639-4 Hardback
ISBN 978-1-107-42222-3 Paperback
Additional resources for this publication at www.cs.bris.ac.uk/home/ﬂach/mlbook
Cambridge University Press has no responsibility for the persistence or
accuracy of URLs for external or third-party internet websites referred to in
this publication, and does not guarantee that any content on such websites is,
or will remain, accurate or appropriate.
To Hessel Flach (1923–2006)
Brief Contents
Preface
xv
Prologue: A machine learning sampler
1
1
The ingredients of machine learning
13
2
Binary classiﬁcation and related tasks
49
3
Beyond binary classiﬁcation
81
4
Concept learning
104
5
Tree models
129
6
Rule models
157
7
Linear models
194
8
Distance-based models
231
9
Probabilistic models
262
10 Features
298
11 Model ensembles
330
12 Machine learning experiments
343
Epilogue: Where to go from here
360
Important points to remember
363
References
367
Index
383
vii
Contents
Preface
xv
Prologue: A machine learning sampler
1
1
The ingredients of machine learning
1.1
13
Tasks: the problems that can be solved with machine learning . . . . . . . 14
Looking for structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Evaluating performance on a task . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2
Models: the output of machine learning . . . . . . . . . . . . . . . . . . . . 20
Geometric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Probabilistic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Logical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Grouping and grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.3
Features: the workhorses of machine learning . . . . . . . . . . . . . . . . 38
Two uses of features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Feature construction and transformation . . . . . . . . . . . . . . . . . . . 41
Interaction between features . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.4
Summary and outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
What you’ll ﬁnd in the rest of the book . . . . . . . . . . . . . . . . . . . . . 48
2
Binary classiﬁcation and related tasks
2.1
49
Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
ix
Contents
x
Assessing classiﬁcation performance . . . . . . . . . . . . . . . . . . . . . . 53
Visualising classiﬁcation performance . . . . . . . . . . . . . . . . . . . . . 58
2.2
Scoring and ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Assessing and visualising ranking performance . . . . . . . . . . . . . . . . 63
Turning rankers into classiﬁers . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.3
Class probability estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Assessing class probability estimates . . . . . . . . . . . . . . . . . . . . . . 73
Turning rankers into class probability estimators . . . . . . . . . . . . . . . 76
2.4
3
Binary classiﬁcation and related tasks: Summary and further reading . . 79
Beyond binary classiﬁcation
3.1
81
Handling more than two classes . . . . . . . . . . . . . . . . . . . . . . . . . 81
Multi-class classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Multi-class scores and probabilities . . . . . . . . . . . . . . . . . . . . . . 86
3.2
Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.3
Unsupervised and descriptive learning . . . . . . . . . . . . . . . . . . . . 95
Predictive and descriptive clustering . . . . . . . . . . . . . . . . . . . . . . 96
Other descriptive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.4
4
Beyond binary classiﬁcation: Summary and further reading . . . . . . . . 102
Concept learning
4.1
104
The hypothesis space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Least general generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Internal disjunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.2
Paths through the hypothesis space . . . . . . . . . . . . . . . . . . . . . . 112
Most general consistent hypotheses . . . . . . . . . . . . . . . . . . . . . . 116
Closed concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.3
Beyond conjunctive concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Using ﬁrst-order logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5
4.4
Learnability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.5
Concept learning: Summary and further reading . . . . . . . . . . . . . . . 127
Tree models
129
5.1
Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2
Ranking and probability estimation trees . . . . . . . . . . . . . . . . . . . 138
Sensitivity to skewed class distributions . . . . . . . . . . . . . . . . . . . . 143
5.3
Tree learning as variance reduction . . . . . . . . . . . . . . . . . . . . . . . 148
Regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Contents
xi
Clustering trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.4
6
Tree models: Summary and further reading . . . . . . . . . . . . . . . . . . 155
Rule models
6.1
157
Learning ordered rule lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Rule lists for ranking and probability estimation . . . . . . . . . . . . . . . 164
6.2
Learning unordered rule sets . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Rule sets for ranking and probability estimation . . . . . . . . . . . . . . . 173
A closer look at rule overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.3
Descriptive rule learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Rule learning for subgroup discovery . . . . . . . . . . . . . . . . . . . . . . 178
Association rule mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
7
6.4
First-order rule learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.5
Rule models: Summary and further reading . . . . . . . . . . . . . . . . . . 192
Linear models
7.1
194
The least-squares method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Multivariate linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Regularised regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Using least-squares regression for classiﬁcation . . . . . . . . . . . . . . . 205
7.2
The perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.3
Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Soft margin SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
8
7.4
Obtaining probabilities from linear classiﬁers . . . . . . . . . . . . . . . . 219
7.5
Going beyond linearity with kernel methods . . . . . . . . . . . . . . . . . 224
7.6
Linear models: Summary and further reading . . . . . . . . . . . . . . . . 228
Distance-based models
231
8.1
So many roads. . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
8.2
Neighbours and exemplars . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
8.3
Nearest-neighbour classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . 242
8.4
Distance-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
K -means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Clustering around medoids . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
Silhouettes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
8.5
Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
8.6
From kernels to distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
8.7
Distance-based models: Summary and further reading . . . . . . . . . . . 260
Contents
xii
9
Probabilistic models
262
9.1
The normal distribution and its geometric interpretations . . . . . . . . . 266
9.2
Probabilistic models for categorical data . . . . . . . . . . . . . . . . . . . . 273
Using a naive Bayes model for classiﬁcation . . . . . . . . . . . . . . . . . . 275
Training a naive Bayes model . . . . . . . . . . . . . . . . . . . . . . . . . . 279
9.3
Discriminative learning by optimising conditional likelihood . . . . . . . 282
9.4
Probabilistic models with hidden variables . . . . . . . . . . . . . . . . . . 286
Expectation-Maximisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
Gaussian mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
9.5
Compression-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
9.6
Probabilistic models: Summary and further reading . . . . . . . . . . . . . 295
10 Features
298
10.1 Kinds of feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Calculations on features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Categorical, ordinal and quantitative features . . . . . . . . . . . . . . . . 304
Structured features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
10.2 Feature transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Thresholding and discretisation . . . . . . . . . . . . . . . . . . . . . . . . . 308
Normalisation and calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Incomplete features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
10.3 Feature construction and selection . . . . . . . . . . . . . . . . . . . . . . . 322
Matrix transformations and decompositions . . . . . . . . . . . . . . . . . 324
10.4 Features: Summary and further reading . . . . . . . . . . . . . . . . . . . . 327
11 Model ensembles
330
11.1 Bagging and random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
11.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Boosted rule learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
11.3 Mapping the ensemble landscape . . . . . . . . . . . . . . . . . . . . . . . 338
Bias, variance and margins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Other ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
11.4 Model ensembles: Summary and further reading . . . . . . . . . . . . . . 341
12 Machine learning experiments
343
12.1 What to measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
12.2 How to measure it . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Contents
xiii
12.3 How to interpret it . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Interpretation of results over multiple data sets . . . . . . . . . . . . . . . . 354
12.4 Machine learning experiments: Summary and further reading . . . . . . . 357
Epilogue: Where to go from here
360
Important points to remember
363
References
367
Index
383
Preface
This book started life in the Summer of 2008, when my employer, the University of
Bristol, awarded me a one-year research fellowship. I decided to embark on writing
a general introduction to machine learning, for two reasons. One was that there was
scope for such a book, to complement the many mor ...