Hyperparameters selection using Bayesian Optimization with GPyOpt over a Keras Neural Network

Hyperparameters selection using Bayesian Optimization with GPyOpt over a Keras Neural Network

One of the main concerns in the development of neural networks is the correct selection of hyperparameters. These are values that we select using experience and bibliography for the optimization processes of a neural network, but we are not certain that these can become the most successful, which can become counterproductive in new models or for inexperienced developers. To solve this we are going to use a method known as Bayesian optimization by applying Gaussian processes to a neural network built to read handwritten numbers.

Gaussian processes

A Gaussian process is a random process where any point x∈Rd is assigned a random variable f(x) and where the joint distribution of a finite number of these variables p(f(x1),…,f(xN)) is itself Gaussian:

No hay texto alternativo para esta imagen

In Equation (1), f=(f(x1),…,f(xN)), μ=(m(x1),…,m(xN)) and Kij=κ(xi,xj). m is the mean function and it is common to use m(x)=0 as GPs are flexible enough to model the mean arbitrarily well. κ is a positive definite kernel function or covariance function. Thus, a Gaussian process is a distribution over functions whose shape (smoothness, …) is defined by K. If points xi and xj are considered to be similar by the kernel the function values at these points, f(xi) and f(xj), can be expected to be similar too.

Given a training dataset with noise-free function values f at inputs X, a GP prior can be converted into a GP posterior p(f?|X?,X,f) which can then be used to make predictions f? at new inputs X?. By definition of a GP, the joint distribution of observed values f and predictions f? is again a Gaussian which can be partitioned into.

No hay texto alternativo para esta imagen

where K?=κ(X,X?) and K??=κ(X?,X?). With N training data and N? new input data K is a N×N matrix , K? a N×N? matrix and K?? a N?×N? matrix. Using standard rules for conditioning Gaussians, the predictive distribution is given by:

No hay texto alternativo para esta imagen

A simple way to look at it is the Gaussian process, which deals with finding global minima in the fewest number of steps using a probability distribution.

No hay texto alternativo para esta imagen


Bayesian optimization

Many optimization problems in machine learning are black-box optimization problems where the objective function f(x) is a black-box function. We do not have an analytical expression for f nor do we know its derivatives. Evaluation of the function is restricted to sampling at a point x and getting a possibly noisy response.

If f is cheap to evaluate we could sample at many points e.g. via grid search, random search, or numeric gradient estimation. However, if function evaluation is expensive e.g. tuning hyperparameters of a deep neural network, probe drilling for oil at given geographic coordinates, or evaluating the effectiveness of a drug candidate taken from a chemical search space then it is important to minimize the number of samples drawn from the black box function f.

This is the domain where Bayesian optimization techniques are most useful. They attempt to find the global optimum in a minimum number of steps. Bayesian optimization incorporates prior belief about f and updates the prior with samples drawn from f to get a posterior that better approximates f. The model used for approximating the objective function is called surrogate model. Bayesian optimization also uses an acquisition function that directs sampling to areas where an improvement over the current best observation is likely.

No hay texto alternativo para esta imagen

Applying GPyOpt to a neural network

Now that we know how the Bayesian distribution works, we are going to apply it to a real-life case, we are going to create a simple neural network using the Keras library and apply the Gaussian optimization process to it, for this we are going to use a database called MNIST .npz which will contain a database of handwritten numbers which you can download here. Next, we will create the model, we will optimize it and finally train it. To do this we will use the next functions and select the correct hyperparameters to tune.

Importing modules

#Import?Modules

#Keras
import?tensorflow.keras?as?K


#GPyOpt?-?Cases?are?important,?for?some?reason
import?GPyOpt
from?GPyOpt.methods?import?BayesianOptimizations

#nump
import?numpy?as?npy        

Function to create the Keras model: hyperparamters (lambtha,?keep_prob)

def?build_model(nx,?layers,?activations,?lambtha,?keep_prob)
????"""
????Function?that?builds?a?neural?network?with?the?Keras?library
????Args:
  ????nx?is?the?number?of?input?features?to?the?network
  ????layers?is?a?list?containing?the?number?of?nodes?in?each?layer?of?the
  ????network
  ????activations?is?a?list?containing?the?activation?functions?used?for
  ????each?layer?of?the?network
  ????lambtha?is?the?L2?regularization?parameter
  ????keep_prob?is?the?probability?that?a?node?will?be?kept?for?dropout
????Returns:?the?keras?model
????"""
????inputs?=?K.Input(shape=(nx,))
????regularizer?=?K.regularizers.l2(float(lambtha))

????output?=?K.layers.Dense(layers[0],
????????????????????????????activation=activations[0],
????????????????????????????kernel_regularizer=regularizer)(inputs)

????hidden_layers?=?range(len(layers))[1:]

????for?i?in?hidden_layers:
????????dropout?=?K.layers.Dropout(1?-?float(keep_prob))(output)
????????output?=?K.layers.Dense(layers[i],?activation=activations[i],
????????????????????????????????kernel_regularizer=regularizer)(dropout)

????model?=?K.Model(inputs,?output)

????return?model        

Function to optimize the model: hyperparameters (alpha, beta1)

def?optimize_model(network,?alpha,?beta1,?beta2)
????"""
????Function?that?sets?up?Adam?optimization?for?a?keras?model?with?categorical
????crossentropy?loss?and?accuracy?metrics
????Args:
????network?is?the?model?to?optimize
????alpha?is?the?learning?rate
????beta1?is?the?first?Adam?optimization?parameter
????beta2?is?the?second?Adam?optimization?parameter
????Returns:?None
????"""
????Adam?=?K.optimizers.Adam(learning_rate=float(alpha),
?????????????????????????????beta_1=float(beta1),
?????????????????????????????beta_2=beta2)

????network.compile(optimizer=Adam,
????????????????????loss="categorical_crossentropy",
????????????????????metrics=['accuracy'])        

Function to train the model, early stopping and saving the best model: hyperparameters (batch-size)

def?train_model(network,?data,?labels,?batch_size,?epochs
????????????????validation_data=None,?early_stopping=False,
????????????????patience=0,?learning_rate_decay=False,
????????????????alpha=0.1,?decay_rate=1,?filepath=None,
????????????????verbose=False,?shuffle=False):
????"""
????Function?That?trains?a?model?using?mini-batch?gradient?descent
????Args:
????network?is?the?model?to?train
????data?is?a?numpy.ndarray?of?shape?(m,?nx)?containing?the?input?data
????labels?is?a?one-hot?numpy.ndarray?of?shape?(m,?classes)?containing
????the?labels?of?data
????batch_size?is?the?size?of?the?batch?used?for?mini-batch?gradient?descent
????epochs?is?the?number?of?passes?through?data?for?mini-batch?gradient?descent
????validation_data?is?the?data?to?validate?the?model?with,?if?not?None
????"""
????def?learning_rate_decay(epoch):
????????"""Function?tha?uses?the?learning?rate"""
????????alpha_0?=?alpha?/?(1?+?(decay_rate?*?epoch))
????????return?alpha_0

????callbacks?=?[]

????if?validation_data:
????????if?early_stopping:
????????????early_stop?=?K.callbacks.EarlyStopping(patience=patience)
????????????callbacks.append(early_stop)


????????if?learning_rate_decay:
????????????decay?=?K.callbacks.LearningRateScheduler(learning_rate_decay,
??????????????????????????????????????????????????????verbose=verbose)
????????????callbacks.append(decay)

????if?save_best:
????????save?=?K.callbacks.ModelCheckpoint(filepath,?save_best_only=True)
????????callbacks.append(save)

????train?=?network.fit(x=data,
????????????????????????y=labels,
????????????????????????batch_size=int(batch_size),
????????????????????????epochs=epochs,
????????????????????????validation_data=validation_data,
????????????????????????callbacks=callbacks,
????????????????????????verbose=False,
????????????????????????shuffle=shuffle)

????return?train        

Using these functions we can create a generic neural network with several hyperparameters that we can optimize, in the model creation section we will choose lambtha and keep-prop two regularization parameters used in L2 and drop-out respectively. We could also modify the number of layers and the neurons that compose them as well as the different activation functions for each of the layers but this would make the optimization method take much longer and for academic purposes, we will try to keep it as simple as possible. . In the optimize function, we will use an optimization method called Adam that uses 3 hyperparameters alpha, beta_1 and beta_2, according to the bibliography the parameters that come to be more insidious in our network are alpha and beta_1 without mentioning that beta_2 generally has values close to 0.999. Finally, in the training function, we will choose the batch-size parameter, due to the size of the database we will divide it into small batches, ideally, it would take the smallest possible value at the cost of computing capacity but in the results, we will see what happens.

The first step to generating our optimization system is to create a function that will create the different networks and evaluate them accordingly and return the minimum value of loss found with the tested hyperparameters, for that we will send the value of x to the function, which will be a vector containing all the hyperparameters that we are going to evaluate.

def?object_function(x)
????????"""
????????Function?that?set?hyperparameters?of?a?keras?network:
????????????Args:?X?is?a?vector?conating?the?parameter?to?optimized?and?trained
????????????????lambtha?is?the?L2?regularization?parameter
????????????????keep_prob?is?the?probability?that?a?node?will?be?kept?for?dropout
????????????????alpha?is?the?learning?rate?in?Adam?optimizer
????????????????beta1?is?the?first?Adam?optimization?parameter
????????????????batch_size?is?the?size?of?the?batch?used?for?mini-batch ?gradient?descent
????????????Returns?the?loss?of?the?model
????????"""
????????#?x?is?5?dimentional?vector?with?the?parameter?we?want?to?optimize
????????lambtha?=?x[:,?0]
????????keep_prob?=?x[:,?1]
????????alpha?=?x[:,?2]
????????beta1?=?x[:,?3]
????????batch_size?=?x[:,?4]

????????#?Exporting?the?data,?for?handwirte?numbers?database
????????datasets?=?np.load('/content/sample_data/MNIST.npz')
????????X_train?=?datasets['X_train']
????????X_train?=?X_train.reshape(X_train.shape[0],?-1)
????????Y_train?=?datasets['Y_train']
????????Y_train_oh?=?one_hot(Y_train)
????????X_valid?=?datasets['X_valid']
????????X_valid?=?X_valid.reshape(X_valid.shape[0],?-1)
????????Y_valid?=?datasets['Y_valid']
????????Y_valid_oh?=?one_hot(Y_valid)

????????#?Building?the?model?using?Keras?library
????????network?=?build_model(784,?[256,?256,?10],?['relu',?'relu',?'softmax'],?lambtha,?keep_prob)

????????#?Optimizing?the?model?using?adam?optimizer
????????beta2?=?0.999
????????optimize_model(network,?alpha,?beta1,?beta2)

????????#?Training?the?model?using?early?stopping?and?saving?the?best?modle?in?bayes_opt.txt'
????????epochs?=?100
????????history?=?train_model(network,?X_train,?Y_train_oh,?batch_size,?epochs,
??????????????????????????????validation_data=(X_valid,?Y_valid_oh),?early_stopping=True,
??????????????????????????????patience=3,?learning_rate_decay=True)

????????return?(history.history['val_loss'][-1])        

Once we have the function that we want to optimize, we have to determine the borders in which the different variables will be evaluated in the Gaussian optimization model to find the best combination of parameters within these borders. These borders will be chosen using a bibliography, or previous experience.

#?Setting?the?bounds?of?network?parameter?for?the?bayeyias?optimizatio
????bounds?=?[{'name':?'lambtha',?'type':?'continuous','domain' (0.0001,?0.0005)},
??????????????{'name':?'keep_prob',?'type':?'continuous','domain':?(0.80,?0.95)},
??????????????{'name':?'alpha',?'type':?'continuous','domain':?(0.001,?0.005)},
??????????????{'name':?'beta1',?'type':?'continuous',?'domain':?(0.9,?0.99)},
??????????????{'name':?'batch_size',?'type':?'discrete',?'domain':?(50,?70)}]        

Finally, we create the method using the GPyOpt library in our function and using the previously exposed boundary parameters. We are going to set certain conditions so that the optimization process ends after 30 iterations or if the difference between the optimal ones is very small to save process time.

#?Creating?the?GPyOpt?method?using?Bayesian?Optimizatio
????my_Bayes_opt?=?GPyOpt.methods.BayesianOptimization(object_function,?domain=bounds)


????#Stop?conditions
????max_time??=?None?
????max_iter??=?30
????tolerance?=?1e-8


????#Running?the?method
????my_Bayes_opt.run_optimization(max_iter?=?max_iter,
??????????????????????????????????max_time?=?max_time,
??????????????????????????????????eps?=?tolerance)n        

Results

After a process of around 2 hours, we find that the best combination of parameters within the borders and the proposed network is the following.

===================
Value of lambtha that minimises the losses in the networ is: 0.0005
Value of keep_prob that minimises the losses in the networ is: 0.95
Value of alpha that minimises the losses in the networ is: 0.005
Value of beta1 that minimises the losses in the networ is: 0.9779735360107488
Value of batch_size that minimises the losses in the networ is: 70.0
Minimum value of the loss: 0.32000303268432617
=====================        

Also, we can visualize how the algorithm explored the space by looking at the distance between consecutive evaluations. Most of the time there is a sizeable distance between evaluations but on occasion, we see consecutive evaluations that are very close to each other - these evaluations typically correspond to a reduction in the value of the best-selected sample.

No hay texto alternativo para esta imagen

If you wanna check the full code and tried by yourself I leave you the link from a google collab full code with and simple exercise here.

https://colab.research.google.com/drive/1OWMyV8poLSJ0PV6Vc_YBVFtYsS6U7dZc?authuser=1#scrollTo=xlr91nN-lgvw

Conclutions

If we carefully observe the optimal hyperparameters, they tend to have values very close to the border, this means that very possibly the optimum of the function is outside the frontiers we established, so it would be worth re-evaluating them. The second thing we can see is that in a small number of steps we managed to find values very close to the optimum, in only the first 3 iterations, in the last 3 iterations we again see a significant change, which supports the theory that the optimum is found outside the established border parameters.

Bibliography Gaussian processes: https://krasserm.github.io/2018/03/19/gaussian-processes/ Bayesian optimization: https://krasserm.github.io/2018/03/21/bayesian-optimization/ GpyOPt: https://www.blopig.com/blog/wp-content/uploads/2019/10/GPyOpt-Tutorial1.html GPyOpt Constrains: https://nbviewer.org/github/SheffieldML/GPyOpt/blob/devel /manual/GPyOpt_constrained_ optimization.ipynb

要查看或添加评论,请登录

Juan David Tuta Botero的更多文章

社区洞察

其他会员也浏览了