登录查看更多内容

Hyperparameters selection using Bayesian Optimization with GPyOpt over a Keras Neural Network

Juan David Tuta Botero

Data Science | Machine Learning | Artificial Intelligence

发布日期: 2022年4月18日

One of the main concerns in the development of neural networks is the correct selection of hyperparameters. These are values that we select using experience and bibliography for the optimization processes of a neural network, but we are not certain that these can become the most successful, which can become counterproductive in new models or for inexperienced developers. To solve this we are going to use a method known as Bayesian optimization by applying Gaussian processes to a neural network built to read handwritten numbers.

Gaussian processes

A Gaussian process is a random process where any point x∈Rd is assigned a random variable f(x) and where the joint distribution of a finite number of these variables p(f(x1),…,f(xN)) is itself Gaussian:

No hay texto alternativo para esta imagen

In Equation (1), f=(f(x1),…,f(xN)), μ=(m(x1),…,m(xN)) and Kij=κ(xi,xj). m is the mean function and it is common to use m(x)=0 as GPs are flexible enough to model the mean arbitrarily well. κ is a positive definite kernel function or covariance function. Thus, a Gaussian process is a distribution over functions whose shape (smoothness, …) is defined by K. If points xi and xj are considered to be similar by the kernel the function values at these points, f(xi) and f(xj), can be expected to be similar too.

Given a training dataset with noise-free function values f at inputs X, a GP prior can be converted into a GP posterior p(f?|X?,X,f) which can then be used to make predictions f? at new inputs X?. By definition of a GP, the joint distribution of observed values f and predictions f? is again a Gaussian which can be partitioned into.

where K?=κ(X,X?) and K??=κ(X?,X?). With N training data and N? new input data K is a N×N matrix , K? a N×N? matrix and K?? a N?×N? matrix. Using standard rules for conditioning Gaussians, the predictive distribution is given by:

A simple way to look at it is the Gaussian process, which deals with finding global minima in the fewest number of steps using a probability distribution.

Bayesian optimization

Many optimization problems in machine learning are black-box optimization problems where the objective function f(x) is a black-box function. We do not have an analytical expression for f nor do we know its derivatives. Evaluation of the function is restricted to sampling at a point x and getting a possibly noisy response.

If f is cheap to evaluate we could sample at many points e.g. via grid search, random search, or numeric gradient estimation. However, if function evaluation is expensive e.g. tuning hyperparameters of a deep neural network, probe drilling for oil at given geographic coordinates, or evaluating the effectiveness of a drug candidate taken from a chemical search space then it is important to minimize the number of samples drawn from the black box function f.

This is the domain where Bayesian optimization techniques are most useful. They attempt to find the global optimum in a minimum number of steps. Bayesian optimization incorporates prior belief about f and updates the prior with samples drawn from f to get a posterior that better approximates f. The model used for approximating the objective function is called surrogate model. Bayesian optimization also uses an acquisition function that directs sampling to areas where an improvement over the current best observation is likely.

Applying GPyOpt to a neural network

Now that we know how the Bayesian distribution works, we are going to apply it to a real-life case, we are going to create a simple neural network using the Keras library and apply the Gaussian optimization process to it, for this we are going to use a database called MNIST .npz which will contain a database of handwritten numbers which you can download here. Next, we will create the model, we will optimize it and finally train it. To do this we will use the next functions and select the correct hyperparameters to tune.

Importing modules

#Import?Modules

#Keras
import?tensorflow.keras?as?K


#GPyOpt?-?Cases?are?important,?for?some?reason
import?GPyOpt
from?GPyOpt.methods?import?BayesianOptimizations

#nump
import?numpy?as?npy

Function to create the Keras model: hyperparamters (lambtha,?keep_prob)

领英推荐

You, Me and Bayesian Neural Networks (BNNs)

Dean Harries 2 个月前

Information and controlling system

Journal EEJET 3 个月前

Convolutional Neural Network (CNN) - Detailed…

Nidhi Chouhan 1 个月前

def?build_model(nx,?layers,?activations,?lambtha,?keep_prob)
????"""
????Function?that?builds?a?neural?network?with?the?Keras?library
????Args:
  ????nx?is?the?number?of?input?features?to?the?network
  ????layers?is?a?list?containing?the?number?of?nodes?in?each?layer?of?the
  ????network
  ????activations?is?a?list?containing?the?activation?functions?used?for
  ????each?layer?of?the?network
  ????lambtha?is?the?L2?regularization?parameter
  ????keep_prob?is?the?probability?that?a?node?will?be?kept?for?dropout
????Returns:?the?keras?model
????"""
????inputs?=?K.Input(shape=(nx,))
????regularizer?=?K.regularizers.l2(float(lambtha))

????output?=?K.layers.Dense(layers[0],
????????????????????????????activation=activations[0],
????????????????????????????kernel_regularizer=regularizer)(inputs)

????hidden_layers?=?range(len(layers))[1:]

????for?i?in?hidden_layers:
????????dropout?=?K.layers.Dropout(1?-?float(keep_prob))(output)
????????output?=?K.layers.Dense(layers[i],?activation=activations[i],
????????????????????????????????kernel_regularizer=regularizer)(dropout)

????model?=?K.Model(inputs,?output)

????return?model

Function to optimize the model: hyperparameters (alpha, beta1)

def?optimize_model(network,?alpha,?beta1,?beta2)
????"""
????Function?that?sets?up?Adam?optimization?for?a?keras?model?with?categorical
????crossentropy?loss?and?accuracy?metrics
????Args:
????network?is?the?model?to?optimize
????alpha?is?the?learning?rate
????beta1?is?the?first?Adam?optimization?parameter
????beta2?is?the?second?Adam?optimization?parameter
????Returns:?None
????"""
????Adam?=?K.optimizers.Adam(learning_rate=float(alpha),
?????????????????????????????beta_1=float(beta1),
?????????????????????????????beta_2=beta2)

????network.compile(optimizer=Adam,
????????????????????loss="categorical_crossentropy",
????????????????????metrics=['accuracy'])

Function to train the model, early stopping and saving the best model: hyperparameters (batch-size)

def?train_model(network,?data,?labels,?batch_size,?epochs
????????????????validation_data=None,?early_stopping=False,
????????????????patience=0,?learning_rate_decay=False,
????????????????alpha=0.1,?decay_rate=1,?filepath=None,
????????????????verbose=False,?shuffle=False):
????"""
????Function?That?trains?a?model?using?mini-batch?gradient?descent
????Args:
????network?is?the?model?to?train
????data?is?a?numpy.ndarray?of?shape?(m,?nx)?containing?the?input?data
????labels?is?a?one-hot?numpy.ndarray?of?shape?(m,?classes)?containing
????the?labels?of?data
????batch_size?is?the?size?of?the?batch?used?for?mini-batch?gradient?descent
????epochs?is?the?number?of?passes?through?data?for?mini-batch?gradient?descent
????validation_data?is?the?data?to?validate?the?model?with,?if?not?None
????"""
????def?learning_rate_decay(epoch):
????????"""Function?tha?uses?the?learning?rate"""
????????alpha_0?=?alpha?/?(1?+?(decay_rate?*?epoch))
????????return?alpha_0

????callbacks?=?[]

????if?validation_data:
????????if?early_stopping:
????????????early_stop?=?K.callbacks.EarlyStopping(patience=patience)
????????????callbacks.append(early_stop)


????????if?learning_rate_decay:
????????????decay?=?K.callbacks.LearningRateScheduler(learning_rate_decay,
??????????????????????????????????????????????????????verbose=verbose)
????????????callbacks.append(decay)

????if?save_best:
????????save?=?K.callbacks.ModelCheckpoint(filepath,?save_best_only=True)
????????callbacks.append(save)

????train?=?network.fit(x=data,
????????????????????????y=labels,
????????????????????????batch_size=int(batch_size),
????????????????????????epochs=epochs,
????????????????????????validation_data=validation_data,
????????????????????????callbacks=callbacks,
????????????????????????verbose=False,
????????????????????????shuffle=shuffle)

????return?train

Using these functions we can create a generic neural network with several hyperparameters that we can optimize, in the model creation section we will choose lambtha and keep-prop two regularization parameters used in L2 and drop-out respectively. We could also modify the number of layers and the neurons that compose them as well as the different activation functions for each of the layers but this would make the optimization method take much longer and for academic purposes, we will try to keep it as simple as possible. . In the optimize function, we will use an optimization method called Adam that uses 3 hyperparameters alpha, beta_1 and beta_2, according to the bibliography the parameters that come to be more insidious in our network are alpha and beta_1 without mentioning that beta_2 generally has values close to 0.999. Finally, in the training function, we will choose the batch-size parameter, due to the size of the database we will divide it into small batches, ideally, it would take the smallest possible value at the cost of computing capacity but in the results, we will see what happens.

The first step to generating our optimization system is to create a function that will create the different networks and evaluate them accordingly and return the minimum value of loss found with the tested hyperparameters, for that we will send the value of x to the function, which will be a vector containing all the hyperparameters that we are going to evaluate.

def?object_function(x)
????????"""
????????Function?that?set?hyperparameters?of?a?keras?network:
????????????Args:?X?is?a?vector?conating?the?parameter?to?optimized?and?trained
????????????????lambtha?is?the?L2?regularization?parameter
????????????????keep_prob?is?the?probability?that?a?node?will?be?kept?for?dropout
????????????????alpha?is?the?learning?rate?in?Adam?optimizer
????????????????beta1?is?the?first?Adam?optimization?parameter
????????????????batch_size?is?the?size?of?the?batch?used?for?mini-batch ?gradient?descent
????????????Returns?the?loss?of?the?model
????????"""
????????#?x?is?5?dimentional?vector?with?the?parameter?we?want?to?optimize
????????lambtha?=?x[:,?0]
????????keep_prob?=?x[:,?1]
????????alpha?=?x[:,?2]
????????beta1?=?x[:,?3]
????????batch_size?=?x[:,?4]

????????#?Exporting?the?data,?for?handwirte?numbers?database
????????datasets?=?np.load('/content/sample_data/MNIST.npz')
????????X_train?=?datasets['X_train']
????????X_train?=?X_train.reshape(X_train.shape[0],?-1)
????????Y_train?=?datasets['Y_train']
????????Y_train_oh?=?one_hot(Y_train)
????????X_valid?=?datasets['X_valid']
????????X_valid?=?X_valid.reshape(X_valid.shape[0],?-1)
????????Y_valid?=?datasets['Y_valid']
????????Y_valid_oh?=?one_hot(Y_valid)

????????#?Building?the?model?using?Keras?library
????????network?=?build_model(784,?[256,?256,?10],?['relu',?'relu',?'softmax'],?lambtha,?keep_prob)

????????#?Optimizing?the?model?using?adam?optimizer
????????beta2?=?0.999
????????optimize_model(network,?alpha,?beta1,?beta2)

????????#?Training?the?model?using?early?stopping?and?saving?the?best?modle?in?bayes_opt.txt'
????????epochs?=?100
????????history?=?train_model(network,?X_train,?Y_train_oh,?batch_size,?epochs,
??????????????????????????????validation_data=(X_valid,?Y_valid_oh),?early_stopping=True,
??????????????????????????????patience=3,?learning_rate_decay=True)

????????return?(history.history['val_loss'][-1])

Once we have the function that we want to optimize, we have to determine the borders in which the different variables will be evaluated in the Gaussian optimization model to find the best combination of parameters within these borders. These borders will be chosen using a bibliography, or previous experience.

#?Setting?the?bounds?of?network?parameter?for?the?bayeyias?optimizatio
????bounds?=?[{'name':?'lambtha',?'type':?'continuous','domain' (0.0001,?0.0005)},
??????????????{'name':?'keep_prob',?'type':?'continuous','domain':?(0.80,?0.95)},
??????????????{'name':?'alpha',?'type':?'continuous','domain':?(0.001,?0.005)},
??????????????{'name':?'beta1',?'type':?'continuous',?'domain':?(0.9,?0.99)},
??????????????{'name':?'batch_size',?'type':?'discrete',?'domain':?(50,?70)}]

Finally, we create the method using the GPyOpt library in our function and using the previously exposed boundary parameters. We are going to set certain conditions so that the optimization process ends after 30 iterations or if the difference between the optimal ones is very small to save process time.

#?Creating?the?GPyOpt?method?using?Bayesian?Optimizatio
????my_Bayes_opt?=?GPyOpt.methods.BayesianOptimization(object_function,?domain=bounds)


????#Stop?conditions
????max_time??=?None?
????max_iter??=?30
????tolerance?=?1e-8


????#Running?the?method
????my_Bayes_opt.run_optimization(max_iter?=?max_iter,
??????????????????????????????????max_time?=?max_time,
??????????????????????????????????eps?=?tolerance)n

Results

After a process of around 2 hours, we find that the best combination of parameters within the borders and the proposed network is the following.

===================
Value of lambtha that minimises the losses in the networ is: 0.0005
Value of keep_prob that minimises the losses in the networ is: 0.95
Value of alpha that minimises the losses in the networ is: 0.005
Value of beta1 that minimises the losses in the networ is: 0.9779735360107488
Value of batch_size that minimises the losses in the networ is: 70.0
Minimum value of the loss: 0.32000303268432617
=====================

Also, we can visualize how the algorithm explored the space by looking at the distance between consecutive evaluations. Most of the time there is a sizeable distance between evaluations but on occasion, we see consecutive evaluations that are very close to each other - these evaluations typically correspond to a reduction in the value of the best-selected sample.

If you wanna check the full code and tried by yourself I leave you the link from a google collab full code with and simple exercise here.

https://colab.research.google.com/drive/1OWMyV8poLSJ0PV6Vc_YBVFtYsS6U7dZc?authuser=1#scrollTo=xlr91nN-lgvw

Conclutions

If we carefully observe the optimal hyperparameters, they tend to have values very close to the border, this means that very possibly the optimum of the function is outside the frontiers we established, so it would be worth re-evaluating them. The second thing we can see is that in a small number of steps we managed to find values very close to the optimum, in only the first 3 iterations, in the last 3 iterations we again see a significant change, which supports the theory that the optimum is found outside the established border parameters.

Bibliography Gaussian processes: https://krasserm.github.io/2018/03/19/gaussian-processes/ Bayesian optimization: https://krasserm.github.io/2018/03/21/bayesian-optimization/ GpyOPt: https://www.blopig.com/blog/wp-content/uploads/2019/10/GPyOpt-Tutorial1.html GPyOpt Constrains: https://nbviewer.org/github/SheffieldML/GPyOpt/blob/devel /manual/GPyOpt_constrained_ optimization.ipynb

要查看或添加评论，请登录

Juan David Tuta Botero的更多文章

How to perform automated data augmentation

2022年8月10日

How to perform automated data augmentation

Every day of our lives we see regularly how machine learning takes more and more prominence in the different economic…
How to create an RNN (Recurrent Neural Network) capable of predicting the behavior of the stock markets and cryptocurrencies

2022年5月11日

How to create an RNN (Recurrent Neural Network) capable of predicting the behavior of the stock markets and cryptocurrencies

In this article, we will see how to build an artificial intelligence using an RNN (Recurrent neural network) to predict…
Transfer Learning, how to use pre-trained neural networks to apply to your own

2022年2月21日

Transfer Learning, how to use pre-trained neural networks to apply to your own

Abstract This review is provided a detailed overview of how to develop a Neural network able to recognize and…
Convolutional Neural Networks how Artificial intelligence see

2022年2月6日

Convolutional Neural Networks how Artificial intelligence see

Recently the idea to use Artificial intelligence to analyze images and videos is becoming a trending topic even for…
Optimization operations in supervised learning and hyperparameter choices

2022年1月17日

Optimization operations in supervised learning and hyperparameter choices

In the current world, it’s hard to imagine one industry or company that is not interested to implement machine learning…
Activation functions in machine learning and Neural Networks

2022年1月5日

Activation functions in machine learning and Neural Networks

It’s been a while since the last article I wrote, today we are going to talk about activation functions, this concept…
What happened when you search in your browser?

2021年9月6日

What happened when you search in your browser?

The Internet has become one of the most important tools in the current society. It is hard to believe any activity that…
IoT is one step closer to the future

2021年8月19日

IoT is one step closer to the future

What is the first thing that comes to your mind when you think of the word Iot? In my case, I didn't know anything…
What is Recursion? Computer science

2021年6月17日

What is Recursion? Computer science

Every day of our lives we usually find notions to interpret the reality from where we are living, and how we perceived…

1 条评论
Differences between static and dynamic libraries

2021年5月4日

Differences between static and dynamic libraries

Why using libraries in general A library in C is a collection of objects files exposed for use and build other…

1 条评论

See all articles

Hyperparameters selection using Bayesian Optimization with GPyOpt over a Keras Neural Network

Juan David Tuta Botero

Data Science | Machine Learning | Artificial Intelligence

Gaussian processes

Bayesian optimization

Applying GPyOpt to a neural network

领英推荐

Results

Conclutions

Juan David Tuta Botero的更多文章

社区洞察

其他会员也浏览了

Comparative Analysis: ARIMA's Box-Jenkins Approach vs. LSTM's Neural Network Structure in Time Series Forecasting

BxD Primer Series: Long Short-Term Memory (LSTM) Neural Networks

Gradient Descent

How to train your Neural Network

The Real Impact of Pruning Neural Networks

Neural Network, Types, Codes and Applications

The MPAI 2022 Calls for Technologies – Part 3 (Neural Network Watermarking)

The Vanishing Gradient Problem?

Pooling Layers for Convolutional Neural Networks

Importance of connecting nodes in NN

Gaussian processes

Bayesian optimization

Applying GPyOpt to a neural network

领英推荐

Results

Conclutions

Juan David Tuta Botero的更多文章

How to perform automated data augmentation

How to create an RNN (Recurrent Neural Network) capable of predicting the behavior of the stock markets and cryptocurrencies

Transfer Learning, how to use pre-trained neural networks to apply to your own

Convolutional Neural Networks how Artificial intelligence see

Optimization operations in supervised learning and hyperparameter choices

Activation functions in machine learning and Neural Networks

What happened when you search in your browser?

IoT is one step closer to the future

What is Recursion? Computer science

Differences between static and dynamic libraries

社区洞察

其他会员也浏览了

Comparative Analysis: ARIMA's Box-Jenkins Approach vs. LSTM's Neural Network Structure in Time Series Forecasting

BxD Primer Series: Long Short-Term Memory (LSTM) Neural Networks

Gradient Descent

How to train your Neural Network

The Real Impact of Pruning Neural Networks

Neural Network, Types, Codes and Applications

The MPAI 2022 Calls for Technologies – Part 3 (Neural Network Watermarking)

The Vanishing Gradient Problem?

Pooling Layers for Convolutional Neural Networks

Importance of connecting nodes in NN