For their methodology, they first define $\Omega(\mathcal{A})$ as the cross validation under-estimate of true error as: $$\Omega(\mathcal{A}) = true\mbox{ }err-CV\mbox{ }err=\epsilon(\mathcal{A}(S))-\hat{\epsilon}^\mathcal{A}_{CV}(S)$$ for a learning algorithm $\mathcal{A}$ and training set $S$ where $\epsilon(\mathcal{A}(S))$ is the true generalization error of the hypothesis over the distribution of the training instances $\mathcal{D}$. Cross validation is also used to select from a set of learning algorithms: $$\Omega^*=\Omega(\mathcal{A}^*).$$
To make the computation feasible, the authors use a family of regularized least squares methods (Ridge regression) for classification. Using some fancy math from Cawley and Talbot, 2003, they compute the leave one out cross validation accuracy in closed form. To generate a large number of learning algorithms the randomly assign $K_{dim}$ radial basis function centers for an RBF kernel function provided for the Ridge regression. They run experiments on synthetic and benchmark data sets.
For the synthetic data sets, the input vector $x_i$ is sampled over $\Re^d$ in the range $[-1,1]^d$ according to the distribution $\nabla$. A training set $S$ is created by drawing $n$ samples from $\nabla$. The corresponding class label $y_i$ is then randomly assigned 1 or -1 independent of $x_i$. Therefore the true error for every hypothesis on $\nabla$ is 0.5. Using $M$ models, the results are summarized as:
M | $10^1$ | $10^2$ | $10^3$ | $10^4$ | $10^5$ | $10^6$ |
---|---|---|---|---|---|---|
Best CV | 61.9 | 69.0 | 74.8 | 79.6 | 83.2 | 85.6 |
Expected | 50 | 50 | 50 | 50 | 50 | 50 |
The authors also examine the impact of overfitting based on the size and dimensionality of the data set. The CV over-estimate of the accuracy is most pronounced for small data sets with high dimensionality. The authors also show that feature selection is unreliable when a large number of features are selected using cross validation. In addition allowing access to more and more features dramatically overfits the training set. The results are similar for the benchmark and real-world data sets.
It is generally accepted the increasing the model complexity increases the likelihood of overfitting the data. The same applies to overfitting on "cross-validation space." When the number of learning algorithms is large, cross validation ceases to be an effective estimate of generalization accuracy. In general, you want to use as much data from training as possible, which makes LOOCV attractive. Similarly, you want to use as much data for testing as well since it affects the estimate of the predictive accuracy of a model. In the case of LOOCV, only one data point is used for testing resulting in high variance. The more data points reduces the variance on the test set. This point makes Dietterich's 5 by 2 CV approach seem very reasonable as it will avoid variance in small test sets.
Whenever possible, a test set should be used that is only examined after selecting a model. Further care has to be taken to not continue tuning the parameters by observing the performance on the test set as it will no longer simulate real-world settings.
No comments:
Post a Comment