WekaClassify.java ================= Version 0.4 Copyright (C) 2001-2004 Ted Pedersen, tpederse@umn.edu University of Minnesota, Duluth Satanjeev Banerjee, satanjeev@cmu.edu Carnegie Mellon University http://www.d.umn.edu/~tpederse/sensetools.html 0. Important note on versions: ------------------------------ WekaClassify version 0.4 compiles with Weka 3.4, and can be used with models generated using Weka 3.4, but not with earlier versions of Weka. For earlier versions, use WekaClassify version 0.3. 1. Introduction: ---------------- WekaClassify is a java program that is part of the SenseTools package, which is a suite of programs that support the application machine learning techniques to the problem of word sense disambiguation. While WekaClassify was developed by Satanjeev Banerjee as a part of that package, we have found it generally useful and are distributing it separately as well, in the hopes that it will be useful for users of the machine learning suite Weka, regardless of their application area. WekaClassify carries out classification based on a previously learned model by Weka, and produces output such that each possible answer is "scored" based on whatever criteria the learned model might use (confidence scores, probabilities, etc.). Thus, WekaClassify requires as input an ARFF file that represents the data to be classified, and a machine learning model that was learned by Weka from training examples. 2. Details ---------- This java program takes a test/evaluation file in ARFF format and classifies the instances in the file using the stored representation of a model learned by Weka from a set of training data. 2.1. The Model: --------------- WekaClassify uses a previously learned (and stored) model to classify instances in the given test ARFF file. This model has to be created by training some classifier (such as a decision tree, a neural network, a naive Bayesian classifier, etc) on training data consisting of instances of the same target word as that in the test arff file. Moreover, the model has to be created using Weka. One can create such a model in Weka by using the -d switch (see Weka's [1] help and documentation for more information). A model thus saved contains information about the type of classifier used as well as other facts "learned" from the training data. WekaClassify uses this model to classify the instances in the test file. The model file can be passed to WekaClassify using the -d option. Classification is done by calling the Weka library routines. Hence Weka should be available on the java CLASSPATH for this program to run properly. 1.2 Other Command-line Options: ------------------------------- Other command-line options include the -t switch to specify the test file and the -p switch to specify the level of precision required. By default, values are output up to 4 places of decimal. 2.3. The Output of WekaClassify: -------------------------------- For each instance in the test file, WekaClassify outputs a probability distribution over all the possible 'class' values. As expected, these values range from 0 (implying this particular instance has zero probability of belong to this particular class) to 1 (implying this particular instance belongs to this particular class). For SENSEVAL test files, "class" values are the possible senseid values that the target word in the test data may assume. This output is in the answer file format required by SENSEVAL (and used by the scorer.python program). Following is the format of the output of WekaClassify: ... For example, suppose for our art.n.xml.arff file above we find the following: art.n art.40001 art_gallery~1:06:00::/0.0 fine_art~1:06:00::/0.0 art~1:06:00::/1.0 art.n art.40002 art_gallery~1:06:00::/0.25 fine_art~1:06:00::/0.25 art~1:06:00::/0.50 For art.4001, this tells us that the first two senses have zero probability of occurring, while the third sense has 100 percent probability of occurring. 3. Copying: ----------- This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file GPL.txt that you should have received with this distribution. 4. Acknowledgments: --------------------- This work has been partially supported by a National Science Foundation Faculty Early CAREER Development award (#0092784) and by a Grant-in-Aid of Research, Artistry and Scholarship from the Office of the Vice President for Research and the Dean of the Graduate School of the University of Minnesota. 5. References: -------------- 1. Weka 3 - Machine Learning Software in Java. World Wide Web site: http://www.cs.waikato.ac.nz/ml/weka/.