Foundations of Neural Nets

Foundations of Neural Nets

It has been a while I did anything related to Machine Learning or Deep learning so I decided to revisit it. Having thought so, I enrolled myself in Coursera's Deep Learning Specialisation taught by none other than Andrew Ng. I also decided to write about my learnings. So here goes my first article in this space. This article is primarily about the foundational concepts for Deep Learning and Deep Neural Nets.

Even though I try to go into details, writing everything covered in the course is not possible so this is an abridged version of my learnings. If you are interested to know more please check out Andrew's course on Coursera.

In this article I'll be covering the following:

  1. Logistic Regression for Binary Classification
  2. Logistic Regression's Cost function
  3. Gradient Descent & It's algorithm
  4. Gradient Descent for Logistic Regression
  5. Vectorising Logistic Regression for m examples n features with O(k) time complexity
  6. Shallow Neural Network

Logistic Regression for Binary Classification

First of all, you might be thinking about what logistic regression is doing in a neural net's article. The thing is, the math behind neural nets and logistic regression is very similar. It is much easier to understand logistic regression than neural nets. When we switch to the sixth section of this article I'll explain how logistic regression connects with neural nets.

So, What is Logistic Regression and where do we use it?

Logistic Regression is a predictive model which determines the category of any given data sample. Let me give you an example, when trained on historical transactional data, it (logistic regression) can determine if a transaction is fraudulent or not.

What is Binary Classification?

Binary classification is when the maximum number of outcomes of the predictive model is 2 (ex: fraudulent transaction or normal transaction) if it exceeds 2 it is called Multi-class classification.

Logistic Regression's Formula

When given some input data logistic regression can make a prediction. Don't you think that is awesome? So how does logistic regression do what it does? Let's dive in....

It uses a formula

y = sigmoid( w * x + b )        

Let's go through each element of the formula:

  • y is the output variable (ex: fraud or not, cat or dog etc)
  • x is the input variable (ex: time, amount, location etc of a transaction)
  • w & b are the variables that the model learns while training (technically these are called weights and biases)
  • sigmoid function

A little note on sigmoid

def sigmoid(x):
 
   return  1 / (1 + np.exp(-x))        
sigmoid graph

if you remember, `y= wx + b` is the formula of a straight line, so for any given value of x, y will have a different value (this type of prediction is called regression) to covert this regression problem into a classification problem (i.e continuous output value to a classification output) we use a sigmoid function.

Logistic Regression's Cost Function

Given training data such as { (x1, y1), (x2, y2), ....., (xm, ym) } for each training example ( xi, yi) the above formula will be applied and the resulting y^ (y hat) will be calculated.

We want y^ to be as close to y[i] as possible. In order to do that, we have to first calculate how much the prediction was overshot (loss or cost) and then we have to employ ways to reduce that loss (we use gradient descent & backprop to achieve this).

Loss Function: The function used to calculate the loss for a given example (xi, yi), we can employ many loss functions like mean squared error, log loss, huber loss etc but the idea is that whichever loss function we employ, it has to make the training problem convex, so for binary classification, we use the following loss function.

L (y^, yi) = - (y*log(y^) + (1 - y)*log(1-y^))        

Cost Function: The summation off all losses is called cost function.

# J(w, b) = 1/m * (sum of all losses)

def cost_func(y, y^, m, w, b):
  cost = 0
  for i in range(m):
    cost += - (y[i]*log(y^[i]) + (1 - y[i])*log(1-y^[i]))
  return -(cost) / m        

Gradient Descent

Now that we have the total cost of a given training set (we know how off our current prediction is, from the original output) we can employ various methods to reduce that error and make the predictive model's prediction as accurate as possible.

gradient descent graph

This is a 2d representation of the gradient descent approach but there is a 3d version of it as well, we'll get to it later.

The idea behind this visualisation is that we plot the cost function against the variable w and we find the value of w for which the cost function is minimal.

We can extend the same to 3d if we plot the cost function again for both the changing variables w & b. In order to find this work, the plot should be convex and have only one global minima. That is the reason we need to use a loss function that gives such values during training.

The algorithm: We use partial derivatives for w and b to get the direction in which we need to reduce w and b and we subtract the partially derived value to find global minima. The calculation of partial derivatives and updating weights and biases are called BACKPROPAGATION

# alpha is the learning rate & dw is partial derivative of cost func over w

w = w - alpha * dw 

# alpha is the learning rate & dw is partial derivative of cost func over b

b = b - alpha * db # alpha is the learning rate & db is partial derivative        

Gradient Descent for Logistic Regression

We've discussed the gradient descent algo for only one input feature (x) and only one weight (w) but in real-world scenarios, there will be many input features (i.e each x value will be an array of a given length) therefore we will need a separate w for each feature (w will also become an array). So for such a case, the algo will change to the following.

J = 0
dw = [0 * n] # n is the number of features
db = 0

for i in range(m):
    
      z[i] = w * x[i] + b # w and x are arrays,therefore z is also an array
      
      a[i] = sigmoid(z[i]) # prediction
      
      J += -(y[i]*log(y^[i]) + (1 - y[i])*log(1-y^[i])) # cost
      
      
      
      #derivatives
     
      dz[i] = a[i] - y[i]
      
      for j in range(n):
          dw[i] += x[i][j] * dz[i]

      db += dz[i]        

Optimisation: If you see the above example it is not optimal we have two loops, one going over m examples and one going over n features making the time complexity O(m * n). We could optimise this using vectorising. A vectorised version of the above algo will look like this

# random initialisation
dz = [random_value * m]
w = [random_value * m]
b = random_value
A = [random_value * m]
Y = [y[1], ...., y[m]]
X = [x[1], ..., x[m]]

dz = A - Y

dzT = np.transpose(dz)

for i in range(1000): # this is for 1000 epochs

   db = 1 /m * (np.sum(dz))

   dw = 1 / m * X * dzT

   w = w - alpha * dw # alpha is learning rate

   b = b - alpha * db         

We still have an explicit loop for training the model on the same data for x number of epochs but that neither depends on m or n so we changed O(m * n) to O(k) where k is a constant number of epochs which is much better.

Shallow Neural Networks

So why did we discuss logistic regression in an article about neural networks?

That is because they both are very closely related. If you observe the below image. In logistic regression, there is just one unit of computation (i.e using the formula to calculate a regressive output and applying sigmoid on it to calculate a classification output).

No alt text provided for this image

Whereas in a neural network we will have many such computation units (also called neurons) Not only do they have many neurons they are also placed in layers with each layer containing many neurons and a Neural network will have many such layers.

Each layer can have its own activation function (like sigmoid in our logistic classification example). But the idea is when the cost function is computed the loss is backpropagated through each neuron updating its own weight and bias.

Why do neural networks work?

Neural networks particularly deep neural networks work very well for a few problems such as image recognition, speech recognition, text analysis. Stay tuned because I'll be writing another article detailing why they work and how to tune the hyper-parameters for developing highly accurate deep neural nets.

要查看或添加评论,请登录

Pranav Kumar PB的更多文章

社区洞察

其他会员也浏览了