This paper, found here, deals with identifying which hyperparameters to a machine learning algorithm are the most important. First, the hyperparameter settings to a learning algorithm often make the difference between mediocre and state-of-the-art performance. Recognizing that only a few hyperparameters are important relative to the other hyperparameters is not new (see Bergstra and Bengio, 2012 for example). In this paper, the author try to go further and quantify the relative importance of the hyperparameters that do matter.
Current methods (grid search) only give information about how different hyperparameter values perform in the context of a single instantiation of the other hyperparameters. However, algorithm designers and machine learning practitioners would like to know how the hyperparameters affect the performance in general. Trying all hyperparameter instantiations is obviously infeasible for all but the simplest cases. The authors go on to show that an approximate analysis based on predictive models can be used to solve this problem and quantify the performance of a hyperparameter instantiation in the context of all instantiations of the other hyperparameters.
The authors choose to use random forest models (using regression trees such that $\hat{y}: \Theta \rightarrow \mathbb{R}$) since they have been shown to achieve the best performance for model-based optimization in complex hyperparameter spaces (Thornton et al., 2013, Eggensperger et al., 2013). The basic idea seems to be that using a random forest to predict/map the performance of an algorithm given a hyperparameter configuration. They then use functional ANOVA to decompose the variance of the predicted algorithms performance into the contributions of variance from each hyperparameter. The assumption is made that the importance of each hyperparameter is quantified by the fraction of variance that it explains.
To train a random forest, the authors use a configuration space ${\bf \Theta}$ for an algorithm $A$ consisting of $\left\{ \Theta_1 \times \dots \times \Theta_n \right\}$. A complete instantiation of $A$'s $n$ hyperparameters is a vector ${\bf \theta} = \langle \theta_1,\dots,\theta_n \rangle$ with $\theta_i \in \Theta_i$. A performance metric $m({\bf \theta}, \pi)$ captures $A$'s performance with hyperparameter configurations ${\bf \theta} \in {\bf \Theta}$ on scenario $\pi$. To use in practical situations, the authors construct random forest predictors $\hat{y}' : \Theta \times \Pi \rightarrow \mathbb{R}$ and use them to predict, for every unique ${\bf \theta}$ in the training data, the average performance $\hat{m}_\theta = 1/m \sum_{i=1}^k m({\bf \theta}, \pi_i)$ across training scenarios $\pi_i,\dots,\pi_k$ where a training scenario is typically cross-folds. The random forest predictors $\hat{y} : {\bf \Theta} \rightarrow \mathbb{R}$ are trained using the tuples $({\bf \theta},\hat{m}_\theta)$. The authors conclude their paper showing how their procedure works in various problems.
In his paper introducing Random Forests, Breiman, 2001 described method for determining the importance of variables. I wonder if the importance could also be determined using the procedure set forth by Breiman. Using random forests to predict the performance of an algorithm with the hyperparameter configurations as the input features, a random forest can be used to determine which hyperparameters are the most important. This is because the random forest is composed of a forest of random trees (random feature selection) and each tree is trained using bootstrap aggregating (bagging)--also developed by Breiman. By measuring the variance between the trees in the forest, the importance of the input features (in this case hyperparameters) can be determined.
An implementation of their work can be found HERE.
No comments:
Post a Comment