Wednesday, March 26, 2014

On the Dangers of Cross-Validation. An Experimental Evaluation (Rao et al, SDM 2008)

This paper discusses the issue of using k-fold cross-validation to estimate the generalization error of a learning algorithm. One of the reasons for wanting to estimate the generalization accuracy of a set of algorithms is to select which one to use in an experiment or application. Now, there have been a number studies on how to estimate the generalization of a model and how to select if one model is better than another. For example, Thomas Dietterich (1998) recommends using 5 by 2-fold cross-validation when an algorithm can be executed 10 times, otherwise use McNemar's test. When the set of algorithms to be consider becomes large, the authors show that cross-validation is no longer a good measure of generalization performance, and accordingly can no longer be used for algorithm or feature selection.

For their methodology, they first define $\Omega(\mathcal{A})$ as the cross validation under-estimate of true error as: $$\Omega(\mathcal{A}) = true\mbox{ }err-CV\mbox{ }err=\epsilon(\mathcal{A}(S))-\hat{\epsilon}^\mathcal{A}_{CV}(S)$$ for a learning algorithm $\mathcal{A}$ and training set $S$ where $\epsilon(\mathcal{A}(S))$ is the true generalization error of the hypothesis over the distribution of the training instances $\mathcal{D}$. Cross validation is also used to select from a set of learning algorithms: $$\Omega^*=\Omega(\mathcal{A}^*).$$

To make the computation feasible, the authors use a family of regularized least squares methods (Ridge regression) for classification. Using some fancy math from Cawley and Talbot, 2003, they compute the leave one out cross validation accuracy in closed form. To generate a large number of learning algorithms the randomly assign $K_{dim}$ radial basis function centers for an RBF kernel function provided for the Ridge regression. They run experiments on synthetic and benchmark data sets.

For the synthetic data sets, the input vector $x_i$ is sampled over $\Re^d$ in the range $[-1,1]^d$ according to the distribution $\nabla$. A training set $S$ is created by drawing $n$ samples from $\nabla$. The corresponding class label $y_i$ is then randomly assigned 1 or -1 independent of $x_i$. Therefore the true error for every hypothesis on $\nabla$ is 0.5. Using $M$ models, the results are summarized as:
M $10^1$ $10^2$ $10^3$ $10^4$ $10^5$ $10^6$
Best CV 61.9 69.0 74.8 79.6 83.2 85.6
Expected 50 50 50 50 50 50
As can be seen, the Best CV grossly overfits the data.

The authors also examine the impact of overfitting based on the size and dimensionality of the data set. The CV over-estimate of the accuracy is most pronounced for small data sets with high dimensionality. The authors also show that feature selection is unreliable when a large number of features are selected using cross validation. In addition allowing access to more and more features dramatically overfits the training set. The results are similar for the benchmark and real-world data sets.

It is generally accepted the increasing the model complexity increases the likelihood of overfitting the data. The same applies to overfitting on "cross-validation space." When the number of learning algorithms is large, cross validation ceases to be an effective estimate of generalization accuracy. In general, you want to use as much data from training as possible, which makes LOOCV attractive. Similarly, you want to use as much data for testing as well since it affects the estimate of the predictive accuracy of a model. In the case of LOOCV, only one data point is used for testing resulting in high variance. The more data points reduces the variance on the test set. This point makes Dietterich's 5 by 2 CV approach seem very reasonable as it will avoid variance in small test sets.

Whenever possible, a test set should be used that is only examined after selecting a model. Further care has to be taken to not continue tuning the parameters by observing the performance on the test set as it will no longer simulate real-world settings.

No comments:

Post a Comment