Conclusions:
In
general my findings agree with the authors of the original paper. ANNs,
random forests and boosted trees perform well over most problems.
Decision trees and Naive Bayes usually exhibit poor performance.
My
findings differ with the authors for logistic regression. Logistic
regression performed well in my experiments. Similarly there are a few
data sets where KNN has the best performance.
The
authors found that Boosted decision trees performed best after
calibration. Boosted trees did not perform as well in my experiments
since I did not calibrate them.
The
differences in my results may be due to the fact that I did not
consider all data sets and all metrics used by the authors. The
omission of some of the metrics may have had a significant impact on
the rankings. The methods used to convert nominal attributes to boolean
may have also been different leading to different rankings.
I agree with the authors that there is no single best learning algorithm (Naive Bayes performed best on the HS data set)
The choice of algorithm depends on the data set and performance metric being used.
References:
Rich
Caruana , Alexandru Niculescu-Mizil, An empirical comparison of
supervised learning algorithms, Proceedings of the 23rd international
conference on Machine learning, p.161-168, June 25-29, 2006,
Pittsburgh, Pennsylvania
Rich
Caruana , Alexandru Niculescu-Mizil, Data mining in metric space: an
empirical analysis of supervised learning performance criteria,
Proceedings of the tenth ACM SIGKDD international conference on
Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA,
USA
Back
Back to project main page
Back to Varun's home page