The variable selection problem is defined as: Selecting a subset of original variables from the large set of input variables with the goal that the subset of variables predict best the target concept or function. Practically, the produced subset is very small that reduces the training costs and helps to prevent overfitting. Hence variable selection is a necessary aspect of inductive learning.
A varitey of approaches have been proposed to tackle the problem of variable selection. These approaches can be categorized as: filter methods and wrapper methods. Filter methods perform variable selection independently of the learning algorithm, while wrappers make learner-dependent selections.
In filter methods, the quality of variable subsets is evaluated using statistical measures. The set of variables that is best with respect to some specific quality measure is selected. The advantage with filter methods is that statistical measures require vey less computational cost as compared to runnig the algorithm. However, the disadvantage with these methods is that variables are not evaluated in the context of learning problem, hence we may get incorrect results in some cases.
Wrapper methods uses the performance of the learner to evaluate subset quality. Each candidate variable set is evaluated by executing the learning algorithm on the subset and then the accruacy of the hypothesis is tested. Wrapper methods use accuracy of the hypothesis to measure the quality of the subset of variables. The problem with wrapper methods is that there is a risk of running the learning algorithm several times. However, it has been observed that wrapper methods outperform filter emthods.
The variable selection problem is categorized in two forms of algorithm based on how the variable are searched to form the subset. A forward selection algorithm starts with an empty set and goes on adding to the set as it finds relevant variables. Contrarily, a backward elmination algorithm starts with a set of all variables and goes on removing variables as it discovers irrelevant variables. The advantage with forward selection algorithm is that the size of subsets always remains relatively small. The advantage of backward elmination is that it is easy to recognize irrelevant variable. Removing a relevant variable causes an immediate decline in performance while adding a relevant variable might not have immediate effect on the evaluation.