Computer Science 8751
Advanced Machine Learning
Homework Assignment 2 (20 points)
Due February 18, 2008

  1. Rich is comparing his learning algorithm to a similar algorithm to try and demonstrate that his algorithm outperforms the other algorithm. The baseline algorithm has a number of different ways it may be run (in the same sense that decision trees often have multiple bells and whistles like different types of pruning mechanisms, different ways of measuring "best" etc.). Rich decides to compare his algorithm to the version of the algorithm that is most similar to his own. In fact, his algorithm and the similar version use one similar parameter, C. Rich performs an initial pilot study with some data drawn from the set STOCK_MARKET and finds that his algorithm does particularly well when using a particular feature representation, so he decides to use that representation in all of his tests for both algorithms. Rich then performs his test, careful to use only data from STOCK_MARKET that was not used in his pilot study. Rich's test is one 10-fold cross validation study on all of the remaining data. Rich estimates the value to be used for C using his algorithm and a cross validation study on the first training fold. He then uses that value of C for all of the remaining folds and the other algorithm. Rich examines his results and notices that his F1 score (the harmonic mean of the precision and recall) is best when he sets the threshold to 0.6, so he uses that threshold for both algorithms. Identify the problems with the above methodology and explain why each problem could invalidate Rich's results.
  2. Exercise 5.3 on page 152.
  3. Exercise 3.1 on page 77.
  4. Exercise 3.2 on page 77.