登录查看更多内容

Configure Deep Learning Architecture

Rohan Chikorde

VP - AIML at BNY Mellon | 16k+ followers | AIML Corporate Trainer | University Professor | Speaker

发布日期: 2019年1月6日

Deep Learning can used in wide range of domains – Ecommerce, Supply Chain, Transportation, Medicine etc. and there are some good libraries available such as Keras, Tensorflow, Pytorch etc which will help you build models in just few lines of code. However, configuring a good network is still a challenge.

Have you ever been confused on how to configure and tune deep learning architecture to get better performance?

I have been working on Deep Learning project at General Mills but got stuck with poor accuracy. Hence, I started researching and looking out ways to improve architecture and tune hyperparameters to improve performance of Neural Nets on hold out dataset and would like to share my learnings with you all.

Configuring deep learning architecture is an art. There is no hard and fast rule or thumb rules for any given problem. However, there are some techniques to address specific issues when configuring and training a neural network.

Note - I am assuming, reader of this post will have basic working knowledge of deep learning and machine learning. I won’t get into mathematics of every technique however wherever required I will talk about it, so everyone can make sense.

3 Major types of problems:

Not efficiently learned on training set – Model could not capture pattern from the training data well and hence performed bad on training set itself which is known as underfitting.
Failed ... not Generalized – Model did exceptionally well on training but failed to generalize. Hence bad performance on test set which is known as overfitting.
Problem with Prediction - Problems with predictions manifest as the stochastic training algorithm having a strong influence on the final model, causing a high variance in behavior and performance.

To tackle these problems:

Better Learning - Techniques that improve or accelerate the adaptation of neural network model weights in response to a training dataset.
Better Generalization - Techniques that improve the performance of a neural network model on a holdout dataset.
Better Predictions - Techniques that reduce the variance in the performance of a final model.

One of the approach could be identify the type of problem you have and then try to identify a technique that work best for your problem.

Improving performance of deep learning model could be heavily depend on how you tune hyperparameters. Using grid search approach would not be a good idea. Because in deep learning, there could be lot of hyperparameters and can take lot of time. It is also difficult to know in advance which hyperparameters are going to be most important to your problem. Try grid search only when you have very few hyperparameters to tune. Instead, try random values and then check how your model is performing on tuning each hyperparameter.

Again, there is no hard rule/ order to follow as such but you can try this order to start exploring or tuning hyperparameters to get better results:

Optimizer
Learning Rate
# of hidden units
# of hidden layers
Activation function

Let's explore these hyper-parameters

What is Optimizer?

Goal of machine learning algorithms is to minimize the loss function where loss function is the mathematical way to measure the performance of your model i.e how wrong your predictions are. During training process, algorithm tweaks parameters (weights) of the model, so as to minimize the the loss function. But the questions are, how to change the parameter? What is the method? By how much should we should change and when? How will we know whether our tweaking of weights are in the direction?

To answer these questions, Optimizer comes into picture. They tie together the loss function and model parameters by updating the model in response to the output of the loss function. The loss function is the guide to the terrain, telling the optimizer when it’s moving in the right or wrong direction.

One of the most popular optimizer is Gradient Descent. Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model. Parameters refer to coefficients in Linear Regression and weights in neural networks. This algorithm is used to optimize/train most of the machine learning algorithms. Its fast, robust and flexible. Here is the working of Gradient Descent:

Calculate what a small change in each individual weight would do to the loss function
Adjust each individual weight based on its gradient (i.e. take a small step in the determined direction)
Keep doing steps #1 and #2 until the loss function gets as low as possible

Variants of Gradient Descent:

1) Stochastic Gradient Descent

2) Mini Batch Gradient Descent

3) Batch Gradient Descent

There are different algorithms which are used to further optimize Gradient Descent and everyone has their pros and cons. Here are few names of the algorithms.

1) Momentum

2) Nesterov accelerated gradient

3) Adam (Adaptive Momemt Estimation)

4) AdaDelta

5) AdaMax

What is Learning Rate?

Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient. The lower the value, the slower we travel along the downward slope. While this might be a good idea (using a low learning rate) in terms of making sure that we do not miss any local minima, it could also mean that we’ll be taking a long time to converge — especially if we get stuck on a plateau region.

Changing our weights too fast by adding or subtracting too much (i.e. taking steps that are too large) can hinder our ability to minimize the loss function. We don’t want to make a jump so large that we skip over the optimal value for a given weight.

Sigmoid: ancient, slow, poor conditioning for learning w SGD;

tanh: is just sigmoid offset by a constant. learns faster, but still: don’t use it

relu: quick, reliable, easy, not quite state of the art

leaky relu: a little better than relu, but has additional param to tune

prelu: also better than relu, doesn’t need to be tuned, can overfit a little

randomized relu: maybe best of both worlds - default tuning usually good, doesn’t overfit as much as relu

elu: supposedly better than relu, but i’ve never seen it be so, also uses more compute

offset relu: subtract a small constant from relu to give it some negative activations, better than relu

softmax: probably only use at end of network as final classifier, kinda in a class of its own

How many hidden units should I use?

In most situations, there is no way to determine the best number of hidden units without training several networks and estimating the generalization error of each. If you have too few hidden units, you will get high training error and high generalization error due to underfitting and high statistical bias. If you have too many hidden units, you may get low training error but still have high generalization error due to overfitting and high variance.

The best number of hidden units depends in a complex way on:

the numbers of input and output units
the number of training cases
the amount of noise in the targets
the complexity of the function or classification to be learned
the architecture
the type of hidden unit activation function
the training algorithm
regularization

What is hidden layer and how many hidden layers should I use?

As per technopedia, A hidden layer in an artificial neural network is a layer in between input layers and output layers, where artificial neurons take in a set of weighted inputs and produce an output through an activation function. It is a typical part of nearly any neural network in which engineers simulate the types of activity that go on in the human brain.

You may not need any hidden layers at all. Linear and generalized linear models are useful in a wide variety of applications (McCullagh and Nelder 1989). And even if the function you want to learn is mildly nonlinear, you may get better generalization with a simple linear model than with a complicated nonlinear model if there is too little data or too much noise to estimate the non-linearities accurately.

In MLPs with step/threshold/Heaviside activation functions, you need two hidden layers for full generality (Sontag 1992). In MLPs with any of a wide variety of continuous nonlinear hidden-layer activation functions, one hidden layer with an arbitrarily large number of units suffices for the "universal approximation" property But there is no theory yet to tell you how many hidden units are needed to approximate any given function.

If you have only one input, there seems to be no advantage to using more than one hidden layer. But things get much more complicated when there are two or more inputs.

Activation function

Activation function decides, whether a neuron should be activated or not by calculating weighted sum and further adding bias with it. The purpose of the activation function is to introduce non-linearity into the output of a neuron. A neural network without an activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks.

Popular activation functions are:

1) Step function:

Activation function A = “activated” if Y > threshold else not

Alternatively, A = 1 if y> threshold, 0 otherwise

2) Linear function:

f(x) = cx

3) Sigmoid function:

f(x)=1/(1+e^-x)

4) Tanh

tanh(x)=2/(1+e^(-2x)) -1

5) ReLu

f(x)=max(0,x)

6) Leaky ReLu

f(x)= ax, x<0

= x, x>=0

Depending upon the properties of the problem we might be able to make a better choice for easy and quicker convergence of the network.

Sigmoid functions and their combinations generally work better in the case of classifiers
Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient problem
ReLU function is a general activation function and is used in most cases these days
If we encounter a case of dead neurons in our networks the leaky ReLU function is the best choice
Always keep in mind that ReLU function should only be used in the hidden layers
As a rule of thumb, you can begin with using ReLU function and then move over to other activation functions in case ReLU doesn’t provide with optimum results

There are other hyper-parameters to look into if your neural networks is still not performing well.

Weight Regularization
Batch Normalization
Adding Noise
Early Stopping

I tried to combine and share my learnings too from my deep learning project. I would love to hear how do you tune your deep learning hyper-parameters and which hyper-parameters helped you to improve performance of neural nets in your project.

Happy Learning. Happy Sharing ...

References and credits:

https://cs231n.github.io/neural-networks-1/

https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

https://www.geeksforgeeks.org/activation-functions-neural-networks/ ...

https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10

https://www.analyticsvidhya.com/blog/2017/10/fundamentals-deep-learning-activation-functions-when-to-use-them/

https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0

https://towardsdatascience.com/what-are-hyperparameters-and-how-to-tune-the-hyperparameters-in-a-deep-neural-network-d0604917584a

ftp://ftp.sas.com/pub/neural/FAQ3.html#A_hu

https://ruder.io/optimizing-gradient-descent/

https://machinelearningmastery.com/

Abilash Nair

Sr AI Technology Architect

5 年

Good article. Lot of learnings

1 次回应

Saud Meethal

Analyst at Info-Tech Research Group Inc.

5 年

Well summarized!

查看更多评论

要查看或添加评论，请登录

Rohan Chikorde的更多文章

Key Steps to Learn Machine Learning in 2024

2024年3月10日

Key Steps to Learn Machine Learning in 2024

Welcome to the exciting world of machine learning! Whether you're a complete beginner or have some programming…

3 条评论
From Content to Art: An Introduction to Neural Style Transfer using Python and TensorFlow

2023年2月15日

From Content to Art: An Introduction to Neural Style Transfer using Python and TensorFlow

In this blog post on Neural Style Transfer - a technique that allows you to combine the content of one image with the…
Dask vs Spark

2021年7月8日

Dask vs Spark

#Apache Spark is a popular distributed computing tool for tabular datasets that is growing to become a dominant name in…

1 条评论
How to Handle Large Data for Machine Learning

2021年6月30日

How to Handle Large Data for Machine Learning

Many times, data scientist or analyst finds difficulty to fit large data (multiple #GB/#TB) into memory and this is a…

2 条评论
Quick Understanding: Instance segmentation vs. Semantic segmentation in Image Analysis

2020年3月12日

Quick Understanding: Instance segmentation vs. Semantic segmentation in Image Analysis

Explaining the differences between traditional image classification, object detection, semantic segmentation, and…

2 条评论
Recurrent Neural Networks (#RNN) and #LSTM- Deep Learning

2018年10月18日

Recurrent Neural Networks (#RNN) and #LSTM- Deep Learning

What do you do if the patterns in your data change with time? In that case, your best bet is to use a recurrent neural…
Deep Learning vs Traditional Machine Learning... Which one I should use?

2018年8月25日

Deep Learning vs Traditional Machine Learning... Which one I should use?

Over the past several years, deep learning has become the go-to technique for most AI type problems, overshadowing…
Use Cases of Deep Learning

2018年7月28日

Use Cases of Deep Learning

Deep Learning (DL) has become more than just a buzzword in the Artificial Intelligence (AI) community – it is reshaping…

11 条评论
Simplifying Deep Learning - Part II

2018年2月10日

Simplifying Deep Learning - Part II

Outline of Deep Belief Nets Algorithm An RBM can extract features and reconstruct input data, but it still lacks the…

1 条评论
Simplifying Deep Learning - Part I

2017年11月19日

Simplifying Deep Learning - Part I

If you are looking out simplify deep learning so as to make sense out of technical details, then here you go…

7 条评论

See all articles

Configure Deep Learning Architecture

Rohan Chikorde

VP - AIML at BNY Mellon | 16k+ followers | AIML Corporate Trainer | University Professor | Speaker

3 Major types of problems:

What is Optimizer?

Variants of Gradient Descent:

What is Learning Rate?

How many hidden units should I use?

What is hidden layer and how many hidden layers should I use?

Activation function

Happy Learning. Happy Sharing ...

Rohan Chikorde的更多文章

社区洞察

其他会员也浏览了

FIFTY Transfer Learning Models (for Deep Neural Networks) From Keras & PyTorch with Useful Links (for advanced ML Practitioners) - Shailendra Kadre

SuperHuman: A Machine Learning Algorithm that Interconnects All Aspects of Machine Learning and “Helps Everyday Humans Become Superhuman”

Deep Learning Resources and Study Path For Aspiring Data Scientist

My Review on Deep learning Book "The Deep Learning with Keras Workshop"

Understanding Deep Neural Networks Training Course

Deep Learning Model

Mastering TensorFlow-Your Path to Deep Learning Excellence

Few examples of Machine Learning Deep Neural Network Applications in Python with source code for your projects

Microsoft Research Unveils Three Efforts to Advance Deep Generative Models

3 Major types of problems:

What is Optimizer?

Variants of Gradient Descent:

What is Learning Rate?

How many hidden units should I use?

What is hidden layer and how many hidden layers should I use?

Activation function

Happy Learning. Happy Sharing ...

Rohan Chikorde的更多文章

Key Steps to Learn Machine Learning in 2024

From Content to Art: An Introduction to Neural Style Transfer using Python and TensorFlow

Dask vs Spark

How to Handle Large Data for Machine Learning

Quick Understanding: Instance segmentation vs. Semantic segmentation in Image Analysis

Recurrent Neural Networks (#RNN) and #LSTM- Deep Learning

Deep Learning vs Traditional Machine Learning... Which one I should use?

Use Cases of Deep Learning

Simplifying Deep Learning - Part II

Simplifying Deep Learning - Part I

社区洞察

其他会员也浏览了

FIFTY Transfer Learning Models (for Deep Neural Networks) From Keras & PyTorch with Useful Links (for advanced ML Practitioners) - Shailendra Kadre

SuperHuman: A Machine Learning Algorithm that Interconnects All Aspects of Machine Learning and “Helps Everyday Humans Become Superhuman”

Deep Learning Resources and Study Path For Aspiring Data Scientist

My Review on Deep learning Book "The Deep Learning with Keras Workshop"

Understanding Deep Neural Networks Training Course

Deep Learning Model

Mastering TensorFlow-Your Path to Deep Learning Excellence

Few examples of Machine Learning Deep Neural Network Applications in Python with source code for your projects

Microsoft Research Unveils Three Efforts to Advance Deep Generative Models