Configure Deep Learning Architecture

Configure Deep Learning Architecture

Deep Learning can used in wide range of domains – Ecommerce, Supply Chain, Transportation, Medicine etc. and there are some good libraries available such as Keras, Tensorflow, Pytorch etc which will help you build models in just few lines of code. However, configuring a good network is still a challenge.

Have you ever been confused on how to configure and tune deep learning architecture to get better performance?

I have been working on Deep Learning project at General Mills but got stuck with poor accuracy. Hence, I started researching and looking out ways to improve architecture and tune hyperparameters to improve performance of Neural Nets on hold out dataset and would like to share my learnings with you all.

Configuring deep learning architecture is an art. There is no hard and fast rule or thumb rules for any given problem. However, there are some techniques to address specific issues when configuring and training a neural network.

Note - I am assuming, reader of this post will have basic working knowledge of deep learning and machine learning. I won’t get into mathematics of every technique however wherever required I will talk about it, so everyone can make sense.

3 Major types of problems:

  • Not efficiently learned on training set – Model could not capture pattern from the training data well and hence performed bad on training set itself which is known as underfitting.
  • Failed ... not Generalized – Model did exceptionally well on training but failed to generalize. Hence bad performance on test set which is known as overfitting.
  • Problem with Prediction - Problems with predictions manifest as the stochastic training algorithm having a strong influence on the final model, causing a high variance in behavior and performance.

To tackle these problems:

  • Better Learning - Techniques that improve or accelerate the adaptation of neural network model weights in response to a training dataset.
  • Better Generalization - Techniques that improve the performance of a neural network model on a holdout dataset.
  • Better Predictions - Techniques that reduce the variance in the performance of a final model.

One of the approach could be identify the type of problem you have and then try to identify a technique that work best for your problem.

Improving performance of deep learning model could be heavily depend on how you tune hyperparameters. Using grid search approach would not be a good idea. Because in deep learning, there could be lot of hyperparameters and can take lot of time. It is also difficult to know in advance which hyperparameters are going to be most important to your problem. Try grid search only when you have very few hyperparameters to tune. Instead, try random values and then check how your model is performing on tuning each hyperparameter.

Again, there is no hard rule/ order to follow as such but you can try this order to start exploring or tuning hyperparameters to get better results:

  • Optimizer
  • Learning Rate
  • # of hidden units
  • # of hidden layers
  • Activation function

Let's explore these hyper-parameters

What is Optimizer?

Goal of machine learning algorithms is to minimize the loss function where loss function is the mathematical way to measure the performance of your model i.e how wrong your predictions are. During training process, algorithm tweaks parameters (weights) of the model, so as to minimize the the loss function. But the questions are, how to change the parameter? What is the method? By how much should we should change and when? How will we know whether our tweaking of weights are in the direction?

To answer these questions, Optimizer comes into picture. They tie together the loss function and model parameters by updating the model in response to the output of the loss function. The loss function is the guide to the terrain, telling the optimizer when it’s moving in the right or wrong direction.

One of the most popular optimizer is Gradient Descent. Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model. Parameters refer to coefficients in Linear Regression and weights in neural networks. This algorithm is used to optimize/train most of the machine learning algorithms. Its fast, robust and flexible. Here is the working of Gradient Descent:

  1. Calculate what a small change in each individual weight would do to the loss function
  2. Adjust each individual weight based on its gradient (i.e. take a small step in the determined direction)
  3. Keep doing steps #1 and #2 until the loss function gets as low as possible

Variants of Gradient Descent:

1) Stochastic Gradient Descent

2) Mini Batch Gradient Descent

3) Batch Gradient Descent

There are different algorithms which are used to further optimize Gradient Descent and everyone has their pros and cons. Here are few names of the algorithms.

1) Momentum

2) Nesterov accelerated gradient

3) Adam (Adaptive Momemt Estimation)

4) AdaDelta

5) AdaMax

What is Learning Rate?

Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient. The lower the value, the slower we travel along the downward slope. While this might be a good idea (using a low learning rate) in terms of making sure that we do not miss any local minima, it could also mean that we’ll be taking a long time to converge — especially if we get stuck on a plateau region.

Changing our weights too fast by adding or subtracting too much (i.e. taking steps that are too large) can hinder our ability to minimize the loss function. We don’t want to make a jump so large that we skip over the optimal value for a given weight.

Sigmoid: ancient, slow, poor conditioning for learning w SGD;

tanh: is just sigmoid offset by a constant. learns faster, but still: don’t use it

relu: quick, reliable, easy, not quite state of the art

leaky relu: a little better than relu, but has additional param to tune

prelu: also better than relu, doesn’t need to be tuned, can overfit a little

randomized relu: maybe best of both worlds - default tuning usually good, doesn’t overfit as much as relu

elu: supposedly better than relu, but i’ve never seen it be so, also uses more compute

offset relu: subtract a small constant from relu to give it some negative activations, better than relu

softmax: probably only use at end of network as final classifier, kinda in a class of its own


How many hidden units should I use?

In most situations, there is no way to determine the best number of hidden units without training several networks and estimating the generalization error of each. If you have too few hidden units, you will get high training error and high generalization error due to underfitting and high statistical bias. If you have too many hidden units, you may get low training error but still have high generalization error due to overfitting and high variance.

The best number of hidden units depends in a complex way on:

  • the numbers of input and output units
  • the number of training cases
  • the amount of noise in the targets
  • the complexity of the function or classification to be learned
  • the architecture
  • the type of hidden unit activation function
  • the training algorithm
  • regularization


What is hidden layer and how many hidden layers should I use?

As per technopedia, A hidden layer in an artificial neural network is a layer in between input layers and output layers, where artificial neurons take in a set of weighted inputs and produce an output through an activation function. It is a typical part of nearly any neural network in which engineers simulate the types of activity that go on in the human brain.

You may not need any hidden layers at all. Linear and generalized linear models are useful in a wide variety of applications (McCullagh and Nelder 1989). And even if the function you want to learn is mildly nonlinear, you may get better generalization with a simple linear model than with a complicated nonlinear model if there is too little data or too much noise to estimate the non-linearities accurately.

In MLPs with step/threshold/Heaviside activation functions, you need two hidden layers for full generality (Sontag 1992). In MLPs with any of a wide variety of continuous nonlinear hidden-layer activation functions, one hidden layer with an arbitrarily large number of units suffices for the "universal approximation" property But there is no theory yet to tell you how many hidden units are needed to approximate any given function.

If you have only one input, there seems to be no advantage to using more than one hidden layer. But things get much more complicated when there are two or more inputs.

Activation function

Activation function decides, whether a neuron should be activated or not by calculating weighted sum and further adding bias with it. The purpose of the activation function is to introduce non-linearity into the output of a neuron. A neural network without an activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks.

Popular activation functions are:

1) Step function:

Activation function A = “activated” if Y > threshold else not
Alternatively, A = 1 if y> threshold, 0 otherwise

2) Linear function:

f(x) = cx

3) Sigmoid function:

f(x)=1/(1+e^-x)

4) Tanh

tanh(x)=2/(1+e^(-2x)) -1

5) ReLu

f(x)=max(0,x)

6) Leaky ReLu

f(x)= ax, x<0
= x, x>=0

Depending upon the properties of the problem we might be able to make a better choice for easy and quicker convergence of the network.

  • Sigmoid functions and their combinations generally work better in the case of classifiers
  • Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient problem
  • ReLU function is a general activation function and is used in most cases these days
  • If we encounter a case of dead neurons in our networks the leaky ReLU function is the best choice
  • Always keep in mind that ReLU function should only be used in the hidden layers
  • As a rule of thumb, you can begin with using ReLU function and then move over to other activation functions in case ReLU doesn’t provide with optimum results

There are other hyper-parameters to look into if your neural networks is still not performing well.

  • Weight Regularization
  • Batch Normalization
  • Adding Noise
  • Early Stopping

I tried to combine and share my learnings too from my deep learning project. I would love to hear how do you tune your deep learning hyper-parameters and which hyper-parameters helped you to improve performance of neural nets in your project.

Happy Learning. Happy Sharing ...

References and credits:

https://cs231n.github.io/neural-networks-1/

https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html   

https://www.geeksforgeeks.org/activation-functions-neural-networks/ ...

https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10

https://www.analyticsvidhya.com/blog/2017/10/fundamentals-deep-learning-activation-functions-when-to-use-them/

https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0

https://towardsdatascience.com/what-are-hyperparameters-and-how-to-tune-the-hyperparameters-in-a-deep-neural-network-d0604917584a

ftp://ftp.sas.com/pub/neural/FAQ3.html#A_hu

https://ruder.io/optimizing-gradient-descent/

https://machinelearningmastery.com/

Abilash Nair

Sr AI Technology Architect

5 年

Good article. Lot of learnings

Saud Meethal

Analyst at Info-Tech Research Group Inc.

5 年

Well summarized!

回复

要查看或添加评论,请登录

Rohan Chikorde的更多文章

社区洞察

其他会员也浏览了