Deep Learning - Introduction to Neural Networks

Deep Learning - Introduction to Neural Networks

Building on my previous articles, 'Data Analytics - A Philosophical Perspective', 'Machine Learning - A Practical Perspective', and 'Social Media Marketing - A Novel Perspective', I plan to write a series of articles to share my thoughts about Deep Learning. The objective is to first introduce the concept and then dive deeper into specific applications and better learning practices.

Let's start by answering the basic question, i.e. what exactly does the term deep learning mean? Learning theory is an increasingly popular discipline where the interest is to understand how humans perceive things and make decisions, and then use that knowledge to make artificially intelligent machines. Deep learning is a particular way to achieve this by using something called a neural network.

Neural Network Architecture

A neural network is a complex non-linear architecture of connections between the features and the response of interest. In the context of a housing price prediction problem, the features can be the area of a house, no. of bedrooms, proximity to the market, etc. while the response is the price of the house. The term neural network is due to an architectural resemblance to the interaction of neurons in the nervous system.

Some other common applications of deep learning are online advertising - to predict whether a user will click or not on a particular ad using the ad and user information, image recognition - to recognize the objects in an image using the values of different pixels in the image, speech recognition - to recognize an audio and generate it's text script, machine translation - to translate a given English sentence into a French sentence, and autonomous driving - to predict the position of other objects on the road using the photographs of the road and radar information.

All of the above are nothing but some supervised learning tasks. Hence, an obvious question arises, i.e. why can't we simply use the traditional machine learning algorithms? The answer is that empirically the performance of most of the traditional algorithms tend to stagnate after they are exposed to a certain number of training examples. Contrary to this performance of a (sufficiently deep) neural network tends to steadily increase even after it is exposed to a lot of training data. Hence, in the age of information where massive datasets and computational capabilities are available, it makes sense to use neural networks to achieve a higher level of performance.

Let's now talk more about a neural network. A neural network is a complex network of neurons which are its basic building blocks. A neuron receives inputs from the previous neurons and outputs some 'activation' of the inputs. An activation is nothing but some function of a 'transfer function' of the inputs created using some unknown weights. The transfer function is usually a summation which is the sum of products of different inputs with their respective weights. Common choices of activation functions are sigmoid, rectified linear unit (ReLU), and hyperbolic tangent. Empirically, hyperbolic tangent and ReLU have been seen to be working better than the sigmoid. A pertinent question here is that, why do we need non-linear activations. The answer is given due to a theorem by Leshno et al. (1993) which says that as long as the activation is non-polynomial, a neural network can emulate any function whatsoever. The simple explanation is that howsoever complex combination of polynomials I create through the network architecture I will always end up with a polynomial and hence in order to be able to learn non-polynomial functions non-linear activations will be required.

The ultimate goal of any learning exercise is to minimize some 'cost function' of interest. The cost function is nothing but a measure of dispersion between the observed and expected (post-learning) response. The most commonly used cost function is squared error cost function which is the average of the squared distances between the observed and expected responses for the different training/test examples.

Typically, neural networks are learned, i.e. the cost minimization is achieved by 'iteratively' performing the following three steps viz. forward propagation, backpropagation, and optimization. The word iteratively refers to the fact that the above steps are performed repeatedly until some 'convergence criterion' is satisfied.

Forward propagation refers to computing activations of all the neurons in the neural network using the current weights which are randomly initialized to start with. Random initialization of weights is very necessary otherwise all the units may learn the same thing if they are all initialized to say 0 or 1 etc. There is a smart way of initialization proposed by Glorot (2010) which suggests initializing the weights randomly from a uniform distribution over (-r, r) where r is a decreasing function of the number of incoming and outgoing arrows from a particular neuron.

Backpropagation is an algorithm proposed by Hinton et al. (1986) to calculate the derivatives/gradients of the cost function with respect to the unknown weights. The derivatives or interest are computed by exploiting the 'chain rule' which involves computation of the derivatives with respect to different activations by going backward from the last layer to the first layer. Hence the word backpropagation.

The derivatives are then fed into a gradient descent type optimization algorithm to minimize the cost function with respect to the unknown weights. This is done by iteratively descending in the direction of the derivative of the cost function at the current weight, where the speed of the descent is controlled by something called 'learning rate'.

next weight  =  current weight  ?  learning rate  ×  gradient of cost at current weight

Learning rate is a positive number which controls the amount of change in the values of the weights between two successive iterations. A small learning rate may make the gradient descent slow to converge, while a large learning rate may make the gradient descent diverge.

No alt text provided for this image

A simple method of choosing the learning rate suggested by Ng (2010) is to try some values of the learning rate like ..., 0.001, 0.01, 0.1, 1, ... etc. for some iterations in the beginning and pick the one for which the cost decays at the fastest rate. Moreover, increasing the learning rate beyond some threshold will cause the cost to increase.

The above three steps viz. forward propagation, backpropagation, and optimization are repeated until the difference between the cost function values for two successive iterations becomes negligible. The performance of the learned neural network is then evaluated on a test set using some cross-validation scheme.

It is important to note that, all the exogenously determined quantities like the activation function, the no. of layers, the no. of neurons in each layer, learning rate, etc. affect the performance of the neural network to a great extent and hence need to be chosen judiciously. These exogenously determined quantities are called hyperparameters. The performance of the neural network also depends on the treatment of overfitting and the choice of optimization algorithm etc. These issues along with some other subtler aspects of neural networks will be addressed in the next articles.

Mushfik Rizvi

Senior Manager- MDM (Finance & Operations)

5 年

Knowledge boaster...Thanks for sharing sir Abhishek K. Umrawal

Mustafa Lokhandwala

Applied Operations Research and Data Science | PhD Industrial Engineering

5 年

This is a really good initiative. Keep posting Abhishek K. Umrawal. I look forward to the rest of the posts in the series.??

Rohit Dhankar

Associate Manager ML at Accenture

5 年

Abhishek K. Umrawal - Sir , Kindly pardon my enthusiasm - here you have mentioned , we prefer Neural nets as they provide better results with larger amounts of training data , which otherwise stagnates "traditional algorithms" . Another reason for preferring neural nets -? LOGISTIC REGRESSION - Sigmoid or Binomial Logistic Regression . Classify Response variable into - TWO classes. SOFTMAX REGRESSION - Multinominal Logistic Regression. Classify Response variable into - more than TWO classes. SOFTMAX is a generalization of the SIGMOID - as it can be used for any incident of - more than TWO classes of the Dependent variable.The Classes of DEPENDENT variable need to be mutually exclusive with No Overlap. Source_1 --?https://github.com/Computer-Vision-Dhankar-Rohit/Computer-Vision---Open-Source_1 Source_2 - https://deeplearning.stanford.edu/tutorial/supervised/SoftmaxRegression/ Thanks

Rohit Dhankar

Associate Manager ML at Accenture

5 年

Abhishek K. Umrawal- Sir another critical point needs clarification , you hinted upon it - needs further reading / references - "choices of activation functions are sigmoid, rectified linear unit (ReLU), and hyperbolic tangent" . I found this post useful to understand RELU etc ..? https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6

Rohit Dhankar

Associate Manager ML at Accenture

5 年

Abhishek K. Umrawal- having studied this from you physically in classroom , i feel two things missing here . 1/ Why the term DEEP , how deep is a Deep neural net , why cant we have same results with a shallow net of nodes ... 2/ Code :)

要查看或添加评论,请登录

Abhishek Umrawal的更多文章

  • Social Media Marketing - A Novel Perspective

    Social Media Marketing - A Novel Perspective

    Building on my previous articles, 'Data Analytics - A Philosophical Perspective', and 'Machine Learning - A Practical…

    13 条评论
  • Machine Learning - A Practical Perspective

    Machine Learning - A Practical Perspective

    Building on my previous article 'Data Analytics - A Philosophical Perspective', I would like to share my thoughts about…

    27 条评论
  • Expansion of manufacturing of Apple iPhone in India

    Expansion of manufacturing of Apple iPhone in India

    Apple Inc., a leading manufacturer of computers and smartphones is planning to make a foreign direct investment (FDI)…

  • Data Analytics – A Philosophical Perspective

    Data Analytics – A Philosophical Perspective

    Data Analytics — a latest ‘buzzword’ in the market, which everyone claims to be using. Yes, it is somewhat true because…

    17 条评论

社区洞察

其他会员也浏览了