This paper deals with the problem of algorithm selection
and hyperparameter selection. Generally, both of these problems (model selection and hyperparameter selection) have been dealt with in isolation most likely due to the fact that it is a hard problem with a very large search space.
Model selection is the process of selecting which learning algorithm $A$ from a set of learning algorithms $\mathcal{A}$ to use for a given data set $\mathcal{D}$. This can be done by splitting $\mathcal{D}$ into disjoint training set $\mathcal{D}_{train}^{(i)}$ and validation sets $\mathcal{D}_{valid}^{(i)}$ ($k$-fold cross validation). The model selection problem can then be written as: $$A^* \in \operatorname{argmin}\limits_{A \in \mathcal{A}} \frac{1}{k} \sum\limits_{i=1}^k \mathcal{L}(A,\mathcal{D}_{train}^{(i)}, \mathcal{D}_{valid}^{(i)}),$$ where $\mathcal{L}(A,\mathcal{D}_{train}^{(i)}, \mathcal{D}_{valid}^{(i)})$ is the loss (i.e. misclassification rate) achieved by $A$ when trained on $\mathcal{D}_{train}^{(i)}$ and evaluated on $\mathcal{D}_{valid}$.
Optimization of the hyperparameters $\lambda \in \Lambda$ for a given learning algorithm can be formulated similarly: $$\lambda^* \in \operatorname{argmin}\limits_{\lambda \in \Lambda} \frac{1}{k} \sum\limits_{i=1}^k \mathcal{L}(A_\lambda,\mathcal{D}_{train}^{(i)}, \mathcal{D}_{valid}^{(i)}).$$ With hyperparameters, the correlation structure between different hyperparameter settings $\lambda_1, \lambda_2 \in \Lambda$ can be exploited. For example, a hyperparameter $\lambda_i$ can be
conditional on another hyperparameter $\lambda_j$ if $\lambda_i$ is only active if $\lambda_j$ takes on certain value(s). Hyperparameter $\lambda_j$ is a
parent of $\lambda_i$. Conditional hyperparameters can be parents of other conditional hyperparamters, thus, giving rise to a tree-structure space or a directed acyclic graph (which will be exploited in this paper).
The authors call the problem of simultaneously model selection and hyperparameter optimization the
combined algroithm selection and hyperparameter optimization (CASH) problem. It is formalized as: $$A_{\lambda^*}^* \in \operatorname{argmin}\limits_{A^{(j)}\in\mathcal{A}, \lambda \in \Lambda^{(j)}} \frac{1}{k} \sum\limits_{i=1}^k \mathcal{L}(A_\lambda^{(j)},\mathcal{D}_{train}^{(i)}, \mathcal{D}_{valid}^{(i)}).$$ This problem can be reformulated as a single hierarchical hyperparameter optimization problem with parameter space $\Lambda = \Lambda^{(1)} \cup \dots \cup \Lambda^{(k)} \cup \left \{ \lambda_r \right \}$, where $\lambda_r$ is a new root-level hyperparameter that selects between algorithms $A^{(1)}, \dots,A^{(k)}$. The root level parameters of each subspace $\Lambda^{(i)}$ are made conditional on $\lambda_r$ being instantiated to $A_i$.
The authors propose to attack this problem using Bayesian optimization, specifically
Sequential Model-Based Optimization (SMBO) that can exploit hierarchical structure stemming from condition parameters. SMBO first builds a model $\mathcal{M}_\mathcal{L}$ that captures the dependence of loss function $\mathcal{L}$ on hyperparameter settings $\lambda$. (This is done using SMAC or TPE in this paper). SMBO then iterates between:
- using $\mathcal{M}_\mathcal{L}$ to determine a promising candidate configuration of hyperparameters $\lambda$ to evaluate next.
- evaluate the loss $c$ of $\lambda$.
- update the model $\mathcal{M}_\mathcal{L}$ with the new data point $(\lambda, c)$.
To select the next set of hyperparameter configuration use next using the model $\mathcal{M}_\mathcal{L}$, SMBO uses an acquisition function $a_{\mathcal{M}_\mathcal{L}}:\Lambda \rightarrow \mathbb{R}$. SMBO choose the $\lambda$ from $\Lambda$ that is the next most useful to use as determined by $a_{\mathcal{M}_\mathcal{L}}$. The authors use
positive expected improvement (EI) as their acquisition function: $$I_{c_{min}}(\lambda:= max\left\{c_{min}-c(\lambda), 0\right\}$$ where $c_{min}$ is the lowest cost from a known $\lambda$ and $c(\lambda)$ denotes the error rate of hyperparameter configuration $\lambda$. Since $c(\lambda)$ is unknown, the expected EI value with respect to the current model $\mathcal{M}_\mathcal{L}$ can be computed instead: $$\mathbb{E}_{\mathcal{M}_\mathcal{L}}[I_{c_{min}}] := \int_{-\infty}^{c_{min}} max\left\{c_{min}-c,0\right\} \cdot p_{\mathcal{M}_\mathcal{L}}(c|\lambda) dc.$$ The quantity $p_{\mathcal{M}_\mathcal{L}}(c|\lambda)$ is unknown. To estimate it in closed form, the authors examine two SMBO methods that can handle hierarchical hyperparameters and can solve for $p_{\mathcal{M}_\mathcal{L}}(c|\lambda)$ in closed form.
The first method is
sequential model-based algorithm configuration (SMAC) which uses random forests. SMAC obtains a predictive mean $\mu_\lambda$ and variance $\sigma_\lambda^2$ of $p(c|\lambda)$ as frequentist estimates over the predictions of its individual trees for $\lambda$ to model $p_{\mathcal{M}_\mathcal{L}}(c|\lambda)$ as a Gaussian $\mathcal{N}(\mu_\lambda,\sigma_\lambda^2)$. Using this, the expected EI can be calculated as: $$\mathbb{E}_{\mathcal{M}_\mathcal{L}}[I_{c_{min}}]=\sigma_\lambda \cdot [u \cdot \Phi(u) + \varphi(u)],$$ where $u=\frac{c_{min}-\mu_\lambda}{\sigma_\lambda}$ and $\varphi$ and $\Phi$ represent the probability density function and cumulative distribution function of a standard normal distribution respectively. To aid in exploration of the configuration space (rather then just exploitation of the known subspace), every second configuration is selected at random.
The second method is
tree-structured parzen estimator (TPE). TPE uses separate models for $p(c)$ and $p(\lambda|c)$. $p(\lambda|c)$ is modeled as one of two density estimates conditional on whether $c$ is greater of less than a given tresholf value $c^*$:$$p(\lambda|c)=\begin{cases}l(\lambda, & \text{if }c<c^*\\g(\lambda),&\text{if }c\ge c^*\end{cases},$$ where $l(\cdot)$ is a density estimate learned from all previous hyperparameters with a corresponding loss smaller than $c^*$ and $g(\cdot)$ is a density estimate learned from all previous hyperparameters with a corresponding loss greater than or equal to $c^*$. $c^*$ is chosen as the $\gamma$-quantile of the losses TPE has obtained so far (default value is 0.15). Intuitively, this creates a density estimator for the hyperparameters that do well and a density estimator for the hyperparameters that do poorly. A proportional value of the expected EI value can be computed in closed form as:$$\mathbb{E}_{\mathcal{M}_\mathcal{L}}[I_{c_{min}}] \propto \left(\gamma + \frac{g(\lambda)}{l(\lambda)} \cdot (1-\gamma)\right)^{-1}.$$ TPE maximizes this expectation by generating many candidate hyperparameter configurations at random and picking $\lambda$ with the smallest value of $g(\lambda)/l(\lambda)$.
The ran a large number of experiments on generally decreased the error on a set of 21 data sets compared to the best algorithm with default hyperparameters and to an algorithm with its hyperparameters chosen using a random grid search. One of their conclusions that even for relatively simple approaches for the CASH problem can outperform algorithm selection by itself. Of course, the trade-off here is the computational resources that are required to search the space of hyperparameter configurations. The random grid search used an average fo 400 CPU hours per data set. Auto-WEKA used $4\times30$ (120) CPU hours. (The authors point out that their process can be ran in parallel with different random seeds which increases the performance of Auto-WEKA. This seems kind of obvious to me since the more configurations you try to better results you would be expected to achieve. I think that a nice contribution would be to have the 4 processes communicate such that $\mathcal{M}_\mathcal{L}$ can be updated from the results of all 4 runs rather than 4 single runs.) The results were also less pronounced on a test set as opposed to 10-fold cross validation.
How does Auto-WEKA compare with Metalearning
The question that conjures in my head is how does this (or any other Bayesian optimization technique, BOT, for hyperparameter selection) differ from metalearning. It seems that the main difference is that metalearning uses the meta-features from a data set and/or learning algorithm to suggest a learning algorithm/hyperparameter configuration. Not much work has been in metalearning on model selection and hyperparameter selection, as the authors have pointed out, it is a hard problem. Given a large number of meta-features and results on a set of data sets and learning algorithms, meta-learning could produce a suggested configuration faster given that the time was used previously. BOT, on the other hand, uses that computation time without using the knowledge that was previously obtained of the data sets and learning algorithms. Perhaps, one day a mechanism could be developed to combine the two.