Computer Science 8751
Advanced Machine Learning
Homework Assignment 2 (20 points)
Due February 18, 2008
-
Rich is comparing his learning algorithm to a similar algorithm to try and
demonstrate that his algorithm outperforms the other algorithm.
The baseline algorithm has a number of different ways it may be run (in
the same sense that decision trees often have multiple bells and whistles
like different types of pruning mechanisms, different ways of measuring
"best" etc.).
Rich decides to compare his algorithm to the version of the algorithm that
is most similar to his own. In fact, his algorithm and the similar version
use one similar parameter, C.
Rich performs an initial pilot study with some data drawn from the set
STOCK_MARKET and finds that his algorithm does particularly well when using
a particular feature representation, so he decides to use that representation
in all of his tests for both algorithms.
Rich then performs his test, careful to use only data from STOCK_MARKET
that was not used in his pilot study.
Rich's test is one 10-fold cross validation study on all of the remaining
data.
Rich estimates the value to be used for C using his algorithm and a
cross validation study on the first training fold. He then uses that value
of C for all of the remaining folds and the other algorithm.
Rich examines his results and notices that his F1 score (the harmonic mean
of the precision and recall) is best when he sets the threshold to 0.6, so he
uses that threshold for both algorithms.
Identify the problems with the above methodology and explain why each
problem could invalidate Rich's results.
- Exercise 5.3 on page 152.
- Exercise 3.1 on page 77.
- Exercise 3.2 on page 77.