登录查看更多内容

Using Gaussian Process in Bayesian Optimization

Abdel Giovanny Perez

Data Science Developer at Business Support

发布日期: 2020年6月15日

In this post, I will explain the use of Gaussian Processes in Bayesian Optimization, applied to Hyperparameters Tuning, specifically in an existing Transfer Learning project.

Transfer Learning Project.

The base model used to tuning the Hyperparameters is a Transfer Learning project which has the following main blocks:

1. I’m using the dataset CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html) with 60000 32x32 color images in 10 classes, with 6000 images per class.

2. The first block is related to preprocessing and adaption of inputs in order to use the inputs with the predefined model Xception. This is a pre-trained model with the dataset ImageNet and the default inputs data size should be 299x299. In order to increase the data set training, I’m using data augmentation with a horizontal flip of images.

3. The second block is the use of Xception network, taking care of avoiding the head due that this layer is optimized for the ImageNet dataset.

4. The third block is some additional layers at the output of Xception in order to train it for the dataset used (CIFAR10). These layers include the output layer, a softmax in charge to categorize the inputs in the 10 classes from CIFAR10.

5. The fourth block is the Compile & Fit process.

Each one of these blocks has several hyperparameters and the goal is to maximize the validation accuracy metric.

Which hyperparameters?

The original results show a Validation Accuracy equal to 75.37% but the Training Accuracy is equal to 97.53%; this difference is a good indicator of overfitting. Based on this appreciation the hyperparameters to optimize should be those with a direct impact on the reduction of overfittings like momentum, L2, and dropout.

?What is a Gaussian Process?

In the book Gaussian Processes for Machine Learning from Rasmussen & Williams there is a complete definition of the Gaussian Process. I will try to explain with my words the intuition behind it.

First one is necessary to explain that a Gaussian Probability Distribution (GPD) is a continuous Probability Distribution defined by two parameters (mean and standard deviation) and due to its features – symmetry, bell shape, and Centra Limit Theorem, it is widely used in machine learning because based on the parameters should be possible to calculate a position and to establish a probability of this prediction.

A Gaussian Process is a generalization of the Gaussian Probability Distribution due that it is a stochastic process which governs the function properties; for each value in x we would have a continuous set of possible values of f(x); each one of these values has a probability determined by the mean and the standard deviation. In a Gaussian Process, el optimized value of f(x) should be the mean with variance zero.

Bayesian Optimization

The use of resources – computational, time, money, people – in a high complexity Project is a huge restriction when the target is to find a prediction based on the existing results.

There are several methodologies to found one value from a set of possibilities,

- Random choice

- Grid method

- Statistic methods

Random choice & grid methods consume a lot of resources and are very inefficient but the statistical methods offer greater efficiency, since it is possible to determine the f (x) that has the maximum probability and in this way with few points (few resources) to find the optimal one.

The Bayesian Optimization used in this project executes the algorithm Expected Improvement which considers how much can it improve. It means that the next point to choose is the one that has the maximum probability. One critic point is the iteration because just after some iterations and the same new points found it is possible to found the optimal maximum.

Findings.

As explained in the paragraph Transfer Learning Project there are several hyperparameters chosen to optimize the Validation Accuracy Metric.

The optimization has been implemented with the Python Library GPyOpt. Each value to optimize shall include the following information:

- Name of variable in a function to optimize

- Type of variable: discrete or continue

- Maximum and minimum values. It is a range to determine the next best point.

Besides it the GPyOpt needs additional information

- Type of model: Gaussian Process

More information about parameters used for GPyOpt in https://github.com/SheffieldML/GPyOpt/blob/master/GPyOpt/methods/bayesian_optimization.py

First Hyperparameter – Momentum_1

This momentum hyperparameter is used in the Batch Normalization layer after the Xception neural network.

Range: 0.001 to 0.9

Initial Variable Value: 0.05

Initial Validation Accuracy: 0.8137

Final Variable Value: 0.0035208

Final Validation Accuracy: 0.82220

Second Hyperparameter / L2

The hyperparameter L2 penalties the high complexity’ weights in order to reduce the overfitting. This regularizer is applied in a Dense Layer after the Xception Neural Network.

Range: 0.001 to 0.999

Initial Variable Value: 0.05

Initial Validation Accuracy: 0.8220

Final Variable Value: 0.02439626

Final Validation Accuracy: 0.82770

Third Hyperparameter / L2_2

This regularizer is applied in a Dense Layer before the softmax used to categorize the outputs for 10 classes.

Range: 0.001 to 0.999

Initial Variable Value: 0.05

Initial Validation Accuracy: 0.8220

Final Variable Value: 0.02167783

Final Validation Accuracy: 0.83080

Fourth Parameter / Epochs

The epoch hyperparameter is one cycle through the full training dataset.

Discrete type

Range: 1 to 1000

Initial Variable Value: 50

Initial Validation Accuracy: 0.83080

Final Variable Value: 400

Final Validation Accuracy: 0.8322

Fifth Parameter / Dropout

The Dropout hyperparameter is one cycle through the full training dataset.

Discrete type

Range: 0 to 1

Initial Variable Value: 0.50

Initial Validation Accuracy: 0.83080

Final Variable Value:0.055005614831763125

Final Validation Accuracy: 0.8327000141143799

Multivariate Bayesian Optimization

Below optimizations have been implemented with only one hyperparameter, but one of the features of the Gaussian Process is multivariability, which means the black box output could depend on more than one variable.

In the experiment, I included all the explained hyperparameters in only one optimization loop. Here the results.

Conclusions

The Bayesian Optimization allows us to tune the hyperparameters in a Gaussian Process reaching the better performance of the Neural Networks with few iterations and resources.

References

https://distill.pub/2020/bayesian-optimization/

https://distill.pub/2019/visual-exploration-gaussian-processes/

https://www.gaussianprocess.org/#williams-02

https://www.gaussianprocess.org/gpml/chapters/RW.pdf

https://www.asc.ohio-state.edu/gan.1/teaching/spring04/Chapter3.pdf

Using Gaussian Process in Bayesian Optimization

Abdel Giovanny Perez

Data Science Developer at Business Support

更多精彩文章

社区洞察

其他会员也浏览了

Simple Linear Regression

Feature Engineering in Machine Learning - Part 04

Unveiling the Potential of Support Vector Machines in Feature Engineering

Support Vector Machine (SVM) Classification

Unveiling the Art of Feature Selection in Machine Learning

Model Optimization in Machine Learning: Random vs. Grid?Search

Revisiting Support Vector Machines

Why Big Data And Machine Learning Are Important In Our Society

10 Machine Learning Algorithms Explained Using Real-World Analogies

Top 8 Machine Learning Algorithms Explained In Less Than 1 Minute Each

Automated Data Augmentation: Solving the data lack in Machine Learning

2020年10月11日

Predicting the Bitcoin Price using Neural Networks

2020年7月11日

Face Recognition & Verification. Pros & Cons.

2020年4月27日

Transfer Learning ?How to reach 88% on accuracy?

2020年4月13日

Summary - ImageNet Classification with Deep Convolutional Neural Networks

2020年3月26日

Optimization techniques in Machine Learning

2020年3月4日

Activation functions in Neural Networks

2020年2月23日

Is a new star growing in the Universe?

2019年11月8日

I wrote a Web Address, now what?

2019年8月26日

IoT: Is the microwave chatting with the freezer?

2019年7月26日

社区洞察

其他会员也浏览了

Simple Linear Regression

Feature Engineering in Machine Learning - Part 04

Unveiling the Potential of Support Vector Machines in Feature Engineering

Support Vector Machine (SVM) Classification

Unveiling the Art of Feature Selection in Machine Learning

Model Optimization in Machine Learning: Random vs. Grid?Search

Revisiting Support Vector Machines

Why Big Data And Machine Learning Are Important In Our Society

10 Machine Learning Algorithms Explained Using Real-World Analogies

Top 8 Machine Learning Algorithms Explained In Less Than 1 Minute Each