Wednesday, March 26, 2014

On the Dangers of Cross-Validation. An Experimental Evaluation (Rao et al, SDM 2008)

This paper discusses the issue of using k-fold cross-validation to estimate the generalization error of a learning algorithm. One of the reasons for wanting to estimate the generalization accuracy of a set of algorithms is to select which one to use in an experiment or application. Now, there have been a number studies on how to estimate the generalization of a model and how to select if one model is better than another. For example, Thomas Dietterich (1998) recommends using 5 by 2-fold cross-validation when an algorithm can be executed 10 times, otherwise use McNemar's test. When the set of algorithms to be consider becomes large, the authors show that cross-validation is no longer a good measure of generalization performance, and accordingly can no longer be used for algorithm or feature selection.

For their methodology, they first define $\Omega(\mathcal{A})$ as the cross validation under-estimate of true error as: $$\Omega(\mathcal{A}) = true\mbox{ }err-CV\mbox{ }err=\epsilon(\mathcal{A}(S))-\hat{\epsilon}^\mathcal{A}_{CV}(S)$$ for a learning algorithm $\mathcal{A}$ and training set $S$ where $\epsilon(\mathcal{A}(S))$ is the true generalization error of the hypothesis over the distribution of the training instances $\mathcal{D}$. Cross validation is also used to select from a set of learning algorithms: $$\Omega^*=\Omega(\mathcal{A}^*).$$

To make the computation feasible, the authors use a family of regularized least squares methods (Ridge regression) for classification. Using some fancy math from Cawley and Talbot, 2003, they compute the leave one out cross validation accuracy in closed form. To generate a large number of learning algorithms the randomly assign $K_{dim}$ radial basis function centers for an RBF kernel function provided for the Ridge regression. They run experiments on synthetic and benchmark data sets.

For the synthetic data sets, the input vector $x_i$ is sampled over $\Re^d$ in the range $[-1,1]^d$ according to the distribution $\nabla$. A training set $S$ is created by drawing $n$ samples from $\nabla$. The corresponding class label $y_i$ is then randomly assigned 1 or -1 independent of $x_i$. Therefore the true error for every hypothesis on $\nabla$ is 0.5. Using $M$ models, the results are summarized as:
M $10^1$ $10^2$ $10^3$ $10^4$ $10^5$ $10^6$
Best CV 61.9 69.0 74.8 79.6 83.2 85.6
Expected 50 50 50 50 50 50
As can be seen, the Best CV grossly overfits the data.

The authors also examine the impact of overfitting based on the size and dimensionality of the data set. The CV over-estimate of the accuracy is most pronounced for small data sets with high dimensionality. The authors also show that feature selection is unreliable when a large number of features are selected using cross validation. In addition allowing access to more and more features dramatically overfits the training set. The results are similar for the benchmark and real-world data sets.

It is generally accepted the increasing the model complexity increases the likelihood of overfitting the data. The same applies to overfitting on "cross-validation space." When the number of learning algorithms is large, cross validation ceases to be an effective estimate of generalization accuracy. In general, you want to use as much data from training as possible, which makes LOOCV attractive. Similarly, you want to use as much data for testing as well since it affects the estimate of the predictive accuracy of a model. In the case of LOOCV, only one data point is used for testing resulting in high variance. The more data points reduces the variance on the test set. This point makes Dietterich's 5 by 2 CV approach seem very reasonable as it will avoid variance in small test sets.

Whenever possible, a test set should be used that is only examined after selecting a model. Further care has to be taken to not continue tuning the parameters by observing the performance on the test set as it will no longer simulate real-world settings.

Wednesday, March 19, 2014

Auto-WEKA: Combined Selection of Hyperparameter Optimization of Classification Algorithms (Thornton et al., KDD 2013)

This paper deals with the problem of algorithm selection and hyperparameter selection. Generally, both of these problems (model selection and hyperparameter selection) have been dealt with in isolation most likely due to the fact that it is a hard problem with a very large search space.

Model selection is the process of selecting which learning algorithm $A$ from a set of learning algorithms $\mathcal{A}$ to use for a given data set $\mathcal{D}$. This can be done by splitting $\mathcal{D}$ into disjoint training set $\mathcal{D}_{train}^{(i)}$ and validation sets $\mathcal{D}_{valid}^{(i)}$ ($k$-fold cross validation). The model selection problem can then be written as: $$A^* \in \operatorname{argmin}\limits_{A \in \mathcal{A}} \frac{1}{k} \sum\limits_{i=1}^k \mathcal{L}(A,\mathcal{D}_{train}^{(i)}, \mathcal{D}_{valid}^{(i)}),$$ where $\mathcal{L}(A,\mathcal{D}_{train}^{(i)}, \mathcal{D}_{valid}^{(i)})$ is the loss (i.e. misclassification rate) achieved by $A$ when trained on $\mathcal{D}_{train}^{(i)}$ and evaluated on $\mathcal{D}_{valid}$.

Optimization of the hyperparameters $\lambda \in \Lambda$ for a given learning algorithm can be formulated similarly: $$\lambda^* \in \operatorname{argmin}\limits_{\lambda \in \Lambda} \frac{1}{k} \sum\limits_{i=1}^k \mathcal{L}(A_\lambda,\mathcal{D}_{train}^{(i)}, \mathcal{D}_{valid}^{(i)}).$$ With hyperparameters, the correlation structure between different hyperparameter settings $\lambda_1, \lambda_2 \in \Lambda$ can be exploited. For example, a hyperparameter  $\lambda_i$ can be conditional on another hyperparameter $\lambda_j$ if $\lambda_i$ is only active if $\lambda_j$ takes on certain value(s). Hyperparameter $\lambda_j$ is a parent of $\lambda_i$. Conditional hyperparameters can be parents of other conditional hyperparamters, thus, giving rise to a tree-structure space or a directed acyclic graph (which will be exploited in this paper).

The authors call the problem of simultaneously model selection and hyperparameter optimization the combined algroithm selection and hyperparameter optimization  (CASH) problem. It is formalized as: $$A_{\lambda^*}^* \in \operatorname{argmin}\limits_{A^{(j)}\in\mathcal{A}, \lambda \in \Lambda^{(j)}} \frac{1}{k} \sum\limits_{i=1}^k \mathcal{L}(A_\lambda^{(j)},\mathcal{D}_{train}^{(i)}, \mathcal{D}_{valid}^{(i)}).$$ This problem can be reformulated as a single hierarchical hyperparameter optimization problem with parameter space $\Lambda = \Lambda^{(1)} \cup \dots \cup \Lambda^{(k)} \cup \left \{ \lambda_r \right \}$, where $\lambda_r$ is a new root-level hyperparameter that selects between algorithms $A^{(1)}, \dots,A^{(k)}$. The root level parameters of each subspace $\Lambda^{(i)}$ are made conditional on $\lambda_r$ being instantiated to $A_i$.

The authors propose to attack this problem  using Bayesian optimization, specifically Sequential Model-Based Optimization (SMBO) that can exploit hierarchical structure stemming from condition parameters. SMBO first builds a model $\mathcal{M}_\mathcal{L}$ that captures the dependence of loss function $\mathcal{L}$ on hyperparameter settings $\lambda$. (This is done using SMAC or TPE in this paper). SMBO then iterates between:

  • using $\mathcal{M}_\mathcal{L}$ to determine a promising candidate configuration of hyperparameters $\lambda$ to evaluate next.
  • evaluate the loss $c$ of $\lambda$.
  • update the model $\mathcal{M}_\mathcal{L}$ with the new data point $(\lambda, c)$.
To select the next set of hyperparameter configuration use next using the model $\mathcal{M}_\mathcal{L}$, SMBO uses an acquisition function $a_{\mathcal{M}_\mathcal{L}}:\Lambda \rightarrow \mathbb{R}$. SMBO choose the $\lambda$ from $\Lambda$ that is the next most useful to use as determined by $a_{\mathcal{M}_\mathcal{L}}$. The authors use positive expected improvement (EI) as their acquisition function: $$I_{c_{min}}(\lambda:= max\left\{c_{min}-c(\lambda), 0\right\}$$ where $c_{min}$ is the lowest cost from a known $\lambda$ and $c(\lambda)$ denotes the error rate of hyperparameter configuration $\lambda$. Since $c(\lambda)$ is unknown, the expected EI value with respect to the current model $\mathcal{M}_\mathcal{L}$ can be computed instead: $$\mathbb{E}_{\mathcal{M}_\mathcal{L}}[I_{c_{min}}] := \int_{-\infty}^{c_{min}} max\left\{c_{min}-c,0\right\} \cdot p_{\mathcal{M}_\mathcal{L}}(c|\lambda) dc.$$ The quantity $p_{\mathcal{M}_\mathcal{L}}(c|\lambda)$ is unknown. To estimate it in closed form, the authors examine two SMBO methods that can handle hierarchical hyperparameters and can solve for $p_{\mathcal{M}_\mathcal{L}}(c|\lambda)$ in closed form.

The first method is sequential model-based algorithm configuration (SMAC) which uses random forests. SMAC obtains a predictive mean $\mu_\lambda$ and variance $\sigma_\lambda^2$ of $p(c|\lambda)$ as frequentist estimates over the predictions of its individual trees for $\lambda$ to model $p_{\mathcal{M}_\mathcal{L}}(c|\lambda)$ as a Gaussian $\mathcal{N}(\mu_\lambda,\sigma_\lambda^2)$. Using this, the expected EI can be calculated as: $$\mathbb{E}_{\mathcal{M}_\mathcal{L}}[I_{c_{min}}]=\sigma_\lambda \cdot [u \cdot \Phi(u) + \varphi(u)],$$ where $u=\frac{c_{min}-\mu_\lambda}{\sigma_\lambda}$ and $\varphi$ and $\Phi$ represent the probability density function and cumulative distribution function of a standard normal distribution respectively. To aid in exploration of the configuration space (rather then just exploitation of the known subspace), every second configuration is selected at random.

The second method is tree-structured parzen estimator (TPE). TPE uses separate models for $p(c)$ and $p(\lambda|c)$. $p(\lambda|c)$ is modeled as one of two density estimates conditional on whether $c$ is greater of less than a given tresholf value $c^*$:$$p(\lambda|c)=\begin{cases}l(\lambda, & \text{if }c<c^*\\g(\lambda),&\text{if }c\ge c^*\end{cases},$$ where $l(\cdot)$ is a density estimate learned from all previous hyperparameters with a corresponding loss smaller than $c^*$ and $g(\cdot)$ is a density estimate learned from all previous hyperparameters with a corresponding loss greater than or equal to $c^*$. $c^*$ is chosen as the $\gamma$-quantile of the losses TPE has obtained so far (default value is 0.15). Intuitively, this creates a density estimator for the hyperparameters that do well and a density estimator for the hyperparameters that do poorly. A proportional value of the expected EI value can be computed in closed form as:$$\mathbb{E}_{\mathcal{M}_\mathcal{L}}[I_{c_{min}}] \propto \left(\gamma + \frac{g(\lambda)}{l(\lambda)} \cdot (1-\gamma)\right)^{-1}.$$ TPE maximizes this expectation by generating many candidate hyperparameter configurations at random and picking $\lambda$ with the smallest value of $g(\lambda)/l(\lambda)$.

The ran a large number of experiments on generally decreased the error on a set of 21 data sets compared to the best algorithm with default hyperparameters and to an algorithm with its hyperparameters chosen using a random grid search. One of their conclusions that even for relatively simple approaches for the CASH problem can outperform algorithm selection by itself. Of course, the trade-off here is the computational resources that are required to search the space of hyperparameter configurations. The random grid search used an average fo 400 CPU hours per data set. Auto-WEKA used $4\times30$ (120) CPU hours. (The authors point out that their process can be ran in parallel with different random seeds which increases the performance of Auto-WEKA. This seems kind of obvious to me since the more configurations you try to better results you would be expected to achieve. I think that a nice contribution would be to have the 4 processes communicate such that $\mathcal{M}_\mathcal{L}$ can be updated from the results of all 4 runs rather than 4 single runs.) The results were also less pronounced on a test set as opposed to 10-fold cross validation.

How does Auto-WEKA compare with Metalearning

The question that conjures in my head is how does this (or any other Bayesian optimization technique, BOT, for hyperparameter selection) differ from metalearning. It seems that the main difference is that metalearning uses the meta-features from a data set and/or learning algorithm to suggest a learning algorithm/hyperparameter configuration. Not much work has been in metalearning on model selection and hyperparameter selection, as the authors have pointed out, it is a hard problem. Given a large number of meta-features and results on a set of data sets and learning algorithms, meta-learning could produce a suggested configuration faster given that the time was used previously. BOT, on the other hand, uses that computation time without using the knowledge that was previously obtained of the data sets and learning algorithms. Perhaps, one day a mechanism could be developed to combine the two.

Monday, March 17, 2014

Latex tips

Will be continually updated

To reduce spacing in math equations: \thinmuskip=0mu

Symbols:
$\mathbb{R}$ : Set of real valued numbers \mathbb{R}
$\mathbb{E}$ : Expected value \mathbb{E}
$\infty$ : Infinity \infty
$\int$ : Integral \int
$\cdot$ : dot product \cdot
$\propto$ : proportional to \propto
$\Re$ : fancy R for set of real numbers \Re
$\nabla$ : upside-down triangle (gradient symbol) \nabla

Curly braces
$\left \{ \dots \right \}$ : Curly braces \left \{ \right \}

$\times$ : Cross product/multiplication \times

$\cup$ : Union operator

An Efficient Approach of Assessing Hyperparamter Importance (Hutter et al., ICML 2014)

This paper, found here, deals with identifying which hyperparameters to a machine learning algorithm are the most important. First, the hyperparameter settings to a learning algorithm often make the difference between mediocre and state-of-the-art performance. Recognizing that only a few hyperparameters are important relative to the other hyperparameters is not new (see Bergstra and Bengio, 2012 for example). In this paper, the author try to go further and quantify the relative importance of the hyperparameters that do matter.

Current methods (grid search) only give information about how different hyperparameter values perform in the context of a single instantiation of the other hyperparameters. However, algorithm designers and machine learning practitioners would like to know how the hyperparameters affect the performance in general. Trying all hyperparameter instantiations is obviously infeasible for all but the simplest cases. The authors go on to show that an approximate analysis based on predictive models can be used to solve this problem and quantify the performance of a hyperparameter instantiation in the context of all instantiations of the other hyperparameters.

The authors choose to use random forest models (using regression trees such that $\hat{y}: \Theta \rightarrow \mathbb{R}$) since they have been shown to achieve the best performance for model-based optimization in complex hyperparameter spaces (Thornton et al., 2013Eggensperger et al., 2013).  The basic idea seems to be that using a random forest to predict/map the performance of an algorithm given a hyperparameter configuration. They then use functional ANOVA to decompose the variance of the predicted algorithms performance into the contributions of variance from each hyperparameter. The assumption is made that the importance of each hyperparameter is quantified by the fraction of variance that it explains.

To train a random forest, the authors use a configuration space ${\bf \Theta}$ for an algorithm $A$ consisting of $\left\{ \Theta_1 \times \dots \times \Theta_n \right\}$. A complete instantiation of $A$'s $n$ hyperparameters is a vector ${\bf \theta} = \langle \theta_1,\dots,\theta_n \rangle$ with $\theta_i \in \Theta_i$. A performance metric $m({\bf \theta}, \pi)$ captures $A$'s performance with hyperparameter configurations ${\bf \theta} \in {\bf \Theta}$ on scenario $\pi$. To use in practical situations, the authors construct random forest predictors $\hat{y}' : \Theta \times \Pi \rightarrow \mathbb{R}$ and use them to predict, for every unique ${\bf \theta}$ in the training data, the average performance $\hat{m}_\theta = 1/m \sum_{i=1}^k m({\bf \theta}, \pi_i)$ across training scenarios $\pi_i,\dots,\pi_k$ where a training scenario is typically cross-folds. The random forest predictors $\hat{y} : {\bf \Theta} \rightarrow \mathbb{R}$ are trained using the tuples $({\bf \theta},\hat{m}_\theta)$. The authors conclude their paper showing how their procedure works in various problems.

In his paper introducing Random Forests, Breiman, 2001 described method for determining the importance of variables. I wonder if the importance could also be determined using the procedure set forth by Breiman. Using random forests to predict the performance of an algorithm with the hyperparameter configurations as the input features, a random forest can be used to determine which hyperparameters are the most important. This is because the random forest is composed of a forest of random trees (random feature selection) and each tree is trained using bootstrap aggregating (bagging)--also developed by Breiman. By measuring the variance between the trees in the forest, the importance of the input features (in this case hyperparameters) can be determined.

An implementation of their work can be found HERE.

Friday, March 14, 2014

A Discriminative Latent Variable Model for Online Clustering (ICML 2014)

This paper addresses the problem of coreference resolution in which denotative noun phrases (mentions) that refer to the same underlying entity are clustered together. For example:
[American President]1 [Bill Clinton]1 has been invited by the [Russian President]2, [Vladimir Putin]2, to visit [Russia]3. [President Clinton]3 said [he]1 looks forward to [his]1 visit.
This paper's main objective is to treat this as an online clustering problem since the mentions follow a natural order. This follows the linguistic intuition that humans are likely to use to resolve coreference in a statement.

They further present their procedure, the Latent Left-Linking Model (L3M). L3M assumes that since the data arrives in an order, to cluster a given item it is sufficient to only consider the previous items. L3M makes 3 modeling assumptions:

  1. Left-Linking. An item i can only link (probabilistically) to items before it
  2. Independence of Left-links. The probability of an item i linking to and antecedent item j is independent of any other item i' that links to j.
  3. Probabilistic Left-link. For an item set d, the probability of an item i ≥  1 linking to an antecedent item j (0 ≤ < i), $P[j \leftarrow i; d,{\bf w}] $: $$Pr[j \leftarrow i; d,{\bf w}] = \frac{exp(\frac{{\bf w}_{ij}}{\gamma})}{Z_i(d,{\bf w},\gamma)}$$
  4. where $\gamma$ is a temperature parameter and $Z_i$ is the normalization factor. Essentially, this means that $P[j \leftarrow i; d,{\bf w}] $ is $exp(\frac{{\bf w}_{ij}}{\gamma})$ normalized to be a probability (it sums to one). $w$ is a weight vector that consists of different features indicative of the similarity of items $i$ and $j$ such as the cosine similarity.
This model is similar to the Distance Dependent Chinese Restaurant Process (CRP) for generative clustering. L3M is differentiated from CRP in that is the first supervised discrimintative model to generalize the idea of using arbitrary pairwise features with learned weights.

The goal of the model is to cluster to items. An item is clustered into the cluster  to which is has the highest probability of belonging to (greedy approach). If an item has the highest probability with a dummy cluster, a new cluster is formed with that item. The probability that an item joins a previously formed cluster $c$ is the sum of probabilities of the $i$ linking to items in $c$: $$Pr[c \odot i;d,{\bf w}] = \sum_{j \in c, o\le j \lt i} Pr[j \leftarrow i; d {\bf w}]$$. The feature vector ${\bf w}$ is learned by minimizing the regularized negative log-likelihood of the data. Casting the problem as a latent tree, a log-likelihood loss function can be minimized to find ${\bf w}$ using stochastic gradient descent.

I am left with a few questions, namely, how is ${\bf w}$ initialized? At the conclusion of the paper, they randomly initialize ${\bf w}$ to show that their algorithm is robust to the non-convexity of the problem. But, it wasn't clear to me how ${\bf w}$ as initialized in the previous experiments.