Compute differentiable random variable and derivative gradients in artificial neural networks

Compute differentiable random variable and derivative gradients in artificial neural networks

As Yann le Cun mentioned that in his paper "A Theoretical Framework for Back-Propagation"

The central problem that back-propagation solves is the evaluation of the influence of a parameter on a function whose computation involves several elementary steps.

 The Neural Network's architecture with back-propagation is to obtain with the ground truth. It illustrates RULES,KNOWLEDGE AND LEARNING in solving the problem statements.

Artificial Neural Networks or Deep Learning models

Neural Nets = weighted DAG ( Directed Acyclic Graph ), n+1 roots(inputs), one leaf(output) plus some non-linearity on other vertices.

For suppose images are input to Deep networks. It naturally integrate all the features in every layers with feature extraction. We have input layer, hidden layers(intermediate layers) and output layers.

The organization of networks architectural level two types and they are

  1. Categories of learning:-

a. Supervised learning

b.Semi supervised learning

c.Unsupervised learning

d.Weakly supervised learning

2. Categories of structure: 1. Feedforward networks 2.Feedback networks

Image Feature extraction levels:

Low-level features: Pixel intensity of Edges and dark spots

Tasks: Image enhancement and sharpening 

Mid-level features: Edges and contours of eyes, ears and nose

Tasks: Segmentation, Description & Classification.

High-level features: Facial structure with the regions

Tasks: Recognition

Processing is characterized by patterns of actions across simple processing units connected together into complex networks. Nowadays all the modules define to the proposed problems are multi-architectural neural networks which is a complex networks undergoing many intermediate internal representations in hidden units resulting to deal the hyperparameters and stating to optimize the network to obtain the stability.

Connectionism places an emphasis on learning internal representations.

Back-propagation

Back propagation = chain rule + gradient descent.

problem: The central problem that back propagation solves is the evaluation of the influence of a parameter on a function whose computation includes several elementary steps.

solution: The solution to this problem is given by the chain rule.

The purpose of computing partial derivatives of the states with respect to the parameters in the system wish to minimize an objective function(cost function/ loss function) which measures how far the behavior of the network is from a desired behavior.

The use of back propagated variables for computing derivates is apparent in the classical literature. In optimal control, the back propagated vector is called the co-state or adjoint state, and the corresponding backward system the adjoint adjoint system.

What is Back-Propagation ?

It's a learning law which describe the weight vector for the 'i' th processing unit at time instant(t+1) in terms of the weight vector at time instant(t) .

where i= 1,2....n real numbers

Learning laws use only local information for adjusting the weight of the connection between two units.

Eg: 1.Hebbian learning law : Initial w=0 , unsupervised learning

2. Perceptron learning law:   Initial w=random, supervised learning

3. Delta learning law :  Initial w=random, supervised learning

4. Widrow-Hoff learning law: Initial weights = random, supervised learning

5. Correlation learning law: Initial weights = 0 , supervised learning

6. Winner take-all learning law: Initial weights=Random but Normalised, unsupervised learning

7. Outstar learning law: Initial weights=0, supervised learning

Summary of Learning methods

Hebbian Learning

  • Basic Hebbian learning
  • Differential Hebbian learning
  • Stochastic versions

Error correction learning - learning with a teacher

  • Perceptron learning
  • Delta learning
  • LMS learning

While more computationally powerful networks could be described, there was no algorithm to learn the connection weights of these systems.

Such networks required the postulation of additional internal or “hidden” processing units, which could adopt intermediate representational states in the mapping between input and output patterns.

An algorithm (back-propagation) able to learn those states was discovered independently several times. 

In the back-propagation process :

In forward: we compute weights.

In backword : we compute gardients. 

We also minimize the back propagation error 

Gradients 

We introduced an interpretation of gradients in the space of models from the perspective of model uncertainty.

Model uncertainty: Uncertainty in model parameters due to limited data.

If the gradient is small, the model is certain about the given input x to the function f(x).

else the gradient is large, the model is uncertain about the given input x to the function f(x).

Thus gradient descent is proposed to find the local minima and maxima to get down the computational task and resolve the uncertainty issue in the model.

Autograd: automatic differentiation

Central to all neural networks in PyTorch is the autograd package. The autograd package provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be different.

requires_grad parameter

If you set its attribute .requires_grad as True, it starts to track all operations on it. When you finish your computation you can call .backward() and have all the gradients computed automatically. The gradient for this tensor will be accumulated into .grad attribute.

Gradients

The backprop contains a single scalar, out.backward() is equivalent to out.backward(torch.tensor(1)).

The tensor matrix of constant. Lets call the out tensor "O". so we have that Therefore ?o/?xi=constant.

To obtain O we need computer tensor matrix

Many existing methods are available to minimize the back propagation error in state-of-the-art models.

Extension of Back propagation is “Backpropagation through time(BPTT)” and forget gate. 

Dropout

Dropout is a technique for improving neural networks by reducing overfitting. Random dropout breaks up these co-adaptations by making the presence of any particular hidden unit unreliable. This technique was found to improve the performance of neural nets in a wide variety of application domains including object classification, digit recognition, speech recognition, document classification and analysis of computational biology data. 

Dropouts value need to be high with high learning rate and momentum. The number of features parameters are less to match with the ground truth. For stability of the network gathers feature values with high values to its momentum to optimize in obtaining stability and learning rate.

No alt text provided for this image

Interconnections

The interconnections in the state-of-the-art systems can be residual connections, shared weights, dropout, element-wise addition and stacked layers with inductive bias.

For image processing and computer vision mostly commonly used networks are ConvNets

In this paper the author clearly stated how feature extraction of different levels into each processing units.


To download the content , you can ping me . Thank your time.

Ranjith Katta

要查看或添加评论,请登录

Ranjith Katta的更多文章

社区洞察

其他会员也浏览了