Conclusions:

In general my findings agree with the authors of the original paper. ANNs, random forests and boosted trees perform well over most problems. Decision trees and Naive Bayes usually exhibit poor performance.

 My findings differ with the authors for logistic regression. Logistic regression performed well in my experiments. Similarly there are a few data sets where KNN has the best performance.

 The authors found that Boosted decision trees performed best after calibration. Boosted trees did not perform as well in my experiments since I did not calibrate them.

The differences in my results may be due to the fact that I did not consider all data sets and all metrics used by the authors. The omission of some of the metrics may have had a significant impact on the rankings. The methods used to convert nominal attributes to boolean may have also been different leading to different rankings.

I agree with the authors that there is no single best learning algorithm (Naive Bayes performed best on the HS data set)
The choice of algorithm depends on the data set and performance metric being used.

References:
Rich Caruana , Alexandru Niculescu-Mizil, An empirical comparison of supervised learning algorithms, Proceedings of the 23rd international conference on Machine learning, p.161-168, June 25-29, 2006, Pittsburgh, Pennsylvania

Rich Caruana , Alexandru Niculescu-Mizil, Data mining in metric space: an empirical analysis of supervised learning performance criteria, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA

Back
Back to project main page
Back to Varun's home page