Using Gaussian Process in Bayesian Optimization
In this post, I will explain the use of Gaussian Processes in Bayesian Optimization, applied to Hyperparameters Tuning, specifically in an existing Transfer Learning project.
Transfer Learning Project.
The base model used to tuning the Hyperparameters is a Transfer Learning project which has the following main blocks:
1. I’m using the dataset CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html) with 60000 32x32 color images in 10 classes, with 6000 images per class.
2. The first block is related to preprocessing and adaption of inputs in order to use the inputs with the predefined model Xception. This is a pre-trained model with the dataset ImageNet and the default inputs data size should be 299x299. In order to increase the data set training, I’m using data augmentation with a horizontal flip of images.
3. The second block is the use of Xception network, taking care of avoiding the head due that this layer is optimized for the ImageNet dataset.
4. The third block is some additional layers at the output of Xception in order to train it for the dataset used (CIFAR10). These layers include the output layer, a softmax in charge to categorize the inputs in the 10 classes from CIFAR10.
5. The fourth block is the Compile & Fit process.
Each one of these blocks has several hyperparameters and the goal is to maximize the validation accuracy metric.
Which hyperparameters?
The original results show a Validation Accuracy equal to 75.37% but the Training Accuracy is equal to 97.53%; this difference is a good indicator of overfitting. Based on this appreciation the hyperparameters to optimize should be those with a direct impact on the reduction of overfittings like momentum, L2, and dropout.
?What is a Gaussian Process?
In the book Gaussian Processes for Machine Learning from Rasmussen & Williams there is a complete definition of the Gaussian Process. I will try to explain with my words the intuition behind it.
First one is necessary to explain that a Gaussian Probability Distribution (GPD) is a continuous Probability Distribution defined by two parameters (mean and standard deviation) and due to its features – symmetry, bell shape, and Centra Limit Theorem, it is widely used in machine learning because based on the parameters should be possible to calculate a position and to establish a probability of this prediction.
A Gaussian Process is a generalization of the Gaussian Probability Distribution due that it is a stochastic process which governs the function properties; for each value in x we would have a continuous set of possible values of f(x); each one of these values has a probability determined by the mean and the standard deviation. In a Gaussian Process, el optimized value of f(x) should be the mean with variance zero.
Bayesian Optimization
The use of resources – computational, time, money, people – in a high complexity Project is a huge restriction when the target is to find a prediction based on the existing results.
There are several methodologies to found one value from a set of possibilities,
- Random choice
- Grid method
- Statistic methods
Random choice & grid methods consume a lot of resources and are very inefficient but the statistical methods offer greater efficiency, since it is possible to determine the f (x) that has the maximum probability and in this way with few points (few resources) to find the optimal one.
The Bayesian Optimization used in this project executes the algorithm Expected Improvement which considers how much can it improve. It means that the next point to choose is the one that has the maximum probability. One critic point is the iteration because just after some iterations and the same new points found it is possible to found the optimal maximum.
Findings.
As explained in the paragraph Transfer Learning Project there are several hyperparameters chosen to optimize the Validation Accuracy Metric.
The optimization has been implemented with the Python Library GPyOpt. Each value to optimize shall include the following information:
- Name of variable in a function to optimize
- Type of variable: discrete or continue
- Maximum and minimum values. It is a range to determine the next best point.
Besides it the GPyOpt needs additional information
- Type of model: Gaussian Process
More information about parameters used for GPyOpt in https://github.com/SheffieldML/GPyOpt/blob/master/GPyOpt/methods/bayesian_optimization.py
First Hyperparameter – Momentum_1
This momentum hyperparameter is used in the Batch Normalization layer after the Xception neural network.
Range: 0.001 to 0.9
Initial Variable Value: 0.05
Initial Validation Accuracy: 0.8137
Final Variable Value: 0.0035208
Final Validation Accuracy: 0.82220
Second Hyperparameter / L2
The hyperparameter L2 penalties the high complexity’ weights in order to reduce the overfitting. This regularizer is applied in a Dense Layer after the Xception Neural Network.
Range: 0.001 to 0.999
Initial Variable Value: 0.05
Initial Validation Accuracy: 0.8220
Final Variable Value: 0.02439626
Final Validation Accuracy: 0.82770
Third Hyperparameter / L2_2
This regularizer is applied in a Dense Layer before the softmax used to categorize the outputs for 10 classes.
Range: 0.001 to 0.999
Initial Variable Value: 0.05
Initial Validation Accuracy: 0.8220
Final Variable Value: 0.02167783
Final Validation Accuracy: 0.83080
Fourth Parameter / Epochs
The epoch hyperparameter is one cycle through the full training dataset.
Discrete type
Range: 1 to 1000
Initial Variable Value: 50
Initial Validation Accuracy: 0.83080
Final Variable Value: 400
Final Validation Accuracy: 0.8322
Fifth Parameter / Dropout
The Dropout hyperparameter is one cycle through the full training dataset.
Discrete type
Range: 0 to 1
Initial Variable Value: 0.50
Initial Validation Accuracy: 0.83080
Final Variable Value:0.055005614831763125
Final Validation Accuracy: 0.8327000141143799
Multivariate Bayesian Optimization
Below optimizations have been implemented with only one hyperparameter, but one of the features of the Gaussian Process is multivariability, which means the black box output could depend on more than one variable.
In the experiment, I included all the explained hyperparameters in only one optimization loop. Here the results.
Conclusions
The Bayesian Optimization allows us to tune the hyperparameters in a Gaussian Process reaching the better performance of the Neural Networks with few iterations and resources.
References
https://distill.pub/2020/bayesian-optimization/
https://distill.pub/2019/visual-exploration-gaussian-processes/
https://www.gaussianprocess.org/#williams-02
https://www.gaussianprocess.org/gpml/chapters/RW.pdf
https://www.asc.ohio-state.edu/gan.1/teaching/spring04/Chapter3.pdf