BxD Primer Series: Introduction to Neural Networks and Perceptron

BxD Primer Series: Introduction to Neural Networks and Perceptron

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is an?Introduction to Neural Networks and Perceptron. Let’s get started:

Introduction to Neural Networks

Neural networks are based on the philosophy that - Any output can be modeled as a function of inputs and this function need not be known beforehand, rather some basic building blocks can assembled and coupled with forward and backward propagation techniques to create a ‘data learned network’. This ‘network’ becomes the function and will generate the output if given an input.

Analogically, it has roots in functional understanding of living organisms, where neuron is thought as basic unit of cognition and a functional living body (that can perform many general tasks) is thought as a large collection of neurons.

Basic Building Blocks:

The basic building blocks of a neural network are neurons, weights, bias, activation functions, and layers:

No alt text provided for this image

??Neurons: Neurons are the basic computation unit in a neural network. They receive input, process it through a function and produce an output.

  1. Input neurons: These neurons receive input from the environment or from other neurons in the network. They pass this input on to other neurons in the network.
  2. Hidden neurons: These neurons does not directly touch the input or output of network. They are used to process the input and produce intermediate representations of data.
  3. Output neurons: These neurons produce the final output of network. They take input from the hidden layer(s) and produce the final predictions or decisions.
  4. Recurrent neurons: These neurons have feedback connections that allow them to send output back to themselves or to other neurons in network. This allows the network to maintain a memory of past inputs and to produce output that depends on previous inputs.
  5. Convolutional neurons: These neurons are designed to process images or other multi-dimensional data by performing convolutions on the input.
  6. Long short-term memory (LSTM) neurons: These neurons are commonly used in recurrent neural networks (RNNs) and are designed to handle long-term dependencies in sequential data.
  7. Self-organizing map (SOM) neurons: SOM neurons are designed to cluster similar inputs together.
  8. Spiking neurons: These neurons model the behavior of biological neurons more closely than traditional artificial neurons. They use a spike-based information propagation instead of a continuous signal and can be used for temporal pattern recognition.
  9. Adaptive resonance theory (ART) neurons: These neurons are designed to learn and recognize patterns in noisy or ambiguous data. Also known as denoising neurons.


??Weights: Weights are the parameters that neural network learns during training. They are part of the function used in neuron.


??Bias: Bias is an additional parameter that is added to the input of a neuron. It allows the neuron to learn an offset or bias in data.


??Activation functions: Activation functions are used to introduce non-linearity into the neural network. They are applied as the last calculation step of neuron.

Reasons to use activation function:

  1. Introduce non-linearity: Without an activation function, the output of a neural network would be linear function of its input, which limit its capacity to model complex relationships in data. Activation functions introduce non-linearity into the output of a neuron.
  2. Control the output range: Activation functions are designed to constrain output of a neuron to a specific range, such as between 0 and 1. This helps to ensure that the output of network is interpretable and can be used to make predictions.
  3. Enable backpropagation: Backpropagation is a key algorithm used to train neural networks by adjusting the weights of connections between neurons. Activation functions are required for this algorithm to work, as they enable calculation of the derivative of output with respect to the input and weights.
  4. Improve convergence: Activation functions, such as ReLU, have been shown to improve convergence of neural networks during training, which means that they can learn the desired function more quickly and with fewer training examples.

Types of activation functions:

  • Identity?function simply returns its input value. It is often used in regression problems, where the goal is to predict a continuous value.
  • Binary step?function returns 1 if the input is greater than or equal to zero, and 0 otherwise. It is often used for binary classification problems.
  • Sigmoid?function maps any real-valued number to a value between 0 and 1. It is often used for binary classification problems.
  • Tanh?function maps any real-valued number to a value between -1 and 1.

No alt text provided for this image

  • ReLU (Rectified Linear Unit)?function returns 0 for any negative input and the input itself for any non-negative input.
  • Leaky ReLU?function is similar to ReLU but with a small positive slope for negative inputs. This helps to avoid "dying ReLU" problem, where neurons in the network become inactive and stop learning.
  • Parametric ReLU (PReLU)?is a variant of Leaky ReLU where the slope of negative part of function is learned during training instead of being fixed.
  • Exponential ReLU (ELU)?is a variant of ReLU that uses an exponential function for negative inputs. This help to reduce the vanishing gradient problem and improve the robustness of network to noise in input data.

No alt text provided for this image

  • Softplus?function is a smoothed version of ReLU function. It maps any real-valued number to a positive value, which makes it useful for ensuring that the output of a neural network is always positive.
  • Swish?function is similar to ReLU but with a smoother and continuous curve. Swish has been shown to improve the accuracy of deep neural networks compared to ReLU.

No alt text provided for this image

  • Softmax?function maps a vector of real numbers to a probability distribution, where the sum of all the probabilities is equal to 1. It is used as an output activation function for multi-class classification problems.

No alt text provided for this image

Where,

  • z?is a vector of real numbers
  • K?is the dimensionality of vector
  • σ(z)_j?is the j’th element of softmax function output


??Layers: Layers are the basic organizational structure of a neural network. They are composed of a group of neurons. The most common types of layers are input, hidden, and output layers.

  1. Input layer is the first layer of neural network. It receives the input data and passes it to next layer for processing.
  2. Hidden layer(s) are between the input and output layers. They perform complex calculations on input data and pass the output to next layer.
  3. Output layer is the last layer of neural network. It produces the output of network, which could be a prediction, classification, or generation based on the input data.
  4. Convolutional layers are commonly used for image processing tasks. They use filters to extract features from the input image.
  5. Pooling layers are often used in conjunction with convolutional layers. They reduce the size of input by down-sampling the features extracted by convolutional layer.
  6. Recurrent layers are used for processing sequential data, such as speech or text. They allow the network to maintain a memory of previous inputs and use that information to process current input.
  7. Dropout layers are used to prevent overfitting in neural network. They randomly drop out some of the neurons in a layer during training to prevent the network from becoming too specialized to training data.
  8. Batch normalization layers are used to improve stability and speed of training. They normalize the input data for each mini-batch during training.
  9. Embedding layers are commonly used in natural language processing (NLP) tasks. They map each word in a text corpus to a numerical vector representation, which can be used as input to the neural network.
  10. Attention layers are also used in NLP tasks, such as translation or text summarization. They allow the network to focus on certain parts of input sequence, rather than processing entire sequence at once.
  11. Normalization layers are similar to batch normalization layers, but they normalize the input data for each individual example, rather than for each mini-batch.
  12. Skip connection layers allow the network to bypass one or more layers and pass the output of a previous layer directly to a later layer. This helps to prevent the vanishing gradient problem and improve the flow of information through the network.


Notation:

Throughout our series, we will be using a slightly ‘counter-culture’ but more descriptive notation of neural networks. Aaron Master provides a more detailed explanation of this notation?here .

In this notation, we have:

  • One input layer with inputs x1, x2 and bias
  • One hidden layer with three neurons. The neurons use ReLU as activation function, hence you see?R?written in them
  • One output layer, which uses softmax activation function, hence you see?S?written in them
  • Hidden and output layers have their parameters written below them:
  • Superscript [1] and [2] denote the first and second layers where outputs are generated
  • W[1] is the weight matrix of 3x2 dimension in first layer, 3 neurons and 2 inputs (x1, x2)
  • W[2] is the weight matrix of 1x3 dimension in second layer, 1 neuron and 3 inputs
  • b[1] is the bias matrix of 3x1 dimension in first layer, 3 neurons and 1 bias input
  • b[2] is the bias matrix of 1x1 dimension in first layer, 1 neuron and 1 bias input
  • ‘Parameters:’ signify that these can be learned during training
  • Arrows denote the connectivity of neurons

No alt text provided for this image

General Steps for Building a Neural Network:

  1. Define the problem: It could be a classification task, or a regression task, or a probabilistic prediction, or a generative task or anything that you can formulate mathematically.
  2. Collect and preprocess the data: This involves cleaning the data, formatting it in a way that neural network can understand, and splitting it into training, validation, and testing sets.
  3. Choose the architecture: It refers to the layout of neuron layers and connections between them. There are many different types of architectures to choose from, such as feedforward, convolutional, recurrent etc. Choice of architecture depends on type of problem and characteristics of data.
  4. Select the activation function: Choice of activation function impact the performance of network, training time and quality. ReLU, sigmoid, and tanh are common choices.
  5. Determine the number of layers: More layers can provide greater representational power, but can also make the network harder to train. Optimal number of layers depends on the complexity of problem and amount of data available.
  6. Choose the loss function: It is used to measure the difference between predicted output of network and actual output. Choice of loss function depends on the type of problem, such as mean squared error for regression problems and binary cross-entropy loss for classification problems.
  7. Select the optimizer: It updates the parameters of network during training to minimize the loss function. There are many different optimizers to choose from, such as stochastic gradient descent (SGD), Adam, Adagrad etc.
  8. Set the hyper-parameters: Hyperparameters are settings that are chosen before training begins, such as the learning rate, batch size, and regularization strength. These choices have a significant impact on the performance of network.
  9. Train the network?on training data. This involves feeding the data through the network, calculating the loss, and updating the parameters through backpropagation.
  10. Evaluate the performance: After training is complete, you need to evaluate the performance of network on validation and testing data.
  11. Tune the model: Based on the performance of network, you may need to make changes to the architecture, hyperparameters, or other settings to improve its performance.
  12. Deploy the model: This involves integrating the network with other software systems, testing its performance in real-world scenarios, and ensuring that it is secure and reliable.

Using these basic building blocks and general steps, many neural network architectures can be formed.

Starting with Perceptron today:

The What:

Perceptron is the simplest types of artificial neural network. It consists of a single layer of artificial neurons. This layer takes in a set of input values, processes them using a set of weights, and produces an output value. The output value is then passed through an activation function to produce the final output.

The output can be fed into another perceptron or a layer of perceptrons to create a multi-layer perceptron (MLP). In an MLP, the output of one layer serves as the input to the next layer. MLPs can be used to solve more complex problems than a single perceptron by learning non-linear decision boundaries. However, each individual perceptron still consists of only a single layer of artificial neurons.

Perceptron can be thought as an activated linear regression model.

Note 1: In a stricter sense, perceptron can only use binary step activation function. The weight update is done using simple error, rather than the gradient, because gradient can not be calculated for the discontinuous binary step activation function.

Note 2: A neuron is the more generalized version of perceptron that allows use of various activation functions and thereby allows use of gradient to update weights. This also allows for use of back-propagation algorithm, which is the very foundation of training any ‘multi layer’ neural network.

Note 3: However, nowadays its common to use perceptron and neuron terms interchangeably, so we will stick to the modern times.

The How:

Here is how perceptron algorithm works:

Given input data vector?x?and output vector?y, define a single perceptron/neuron with weight vector?W?and bias?b.

Step 1: Initialize the weights?of perceptron/neuron to random values.

Step 2: Compute the weighted sum?of inputs as the dot product of the input vector?x?and the weight matrix?W, plus a bias vector?b:

z = W x + b

Step 3: Apply the activation function?f(z)?to the weighted sum?z:

y = f(z)

Step 4: Compute the loss as a function of predicted output?y?and true output?t:

L(y, t) = f_loss(y, t)

Step 5: Compute the gradient of loss with respect to weights and biases using chain rule:

dL/dW = (df_loss/dy) * (dy/dz) * (dz/dW)

dL/db = (df_loss/dy)*(dy/dz) * (dz/db)

Where,

  • df_loss/dy?is the derivative of loss function with respect to predicted output
  • dy/dz?is the derivative of the activation function with respect to the weighted sum
  • dz/dW?is x
  • dz/db?is 1

Step 6: Update the weights using gradient descent algorithm:

W_new = W_old - alpha * dL/dW

b_new = b_old - alpha * dL/db

Where,

  • alpha?is the learning rate, which controls the step size of the weight updates.
  • Weights and biases are updated in opposite direction of gradient, in order to minimize the loss.

Step 7: Repeat steps 2-6 for ‘each input in training set’ with the goal of minimizing loss across all inputs.

Step 8: Repeat steps 2-7 for multiple epochs: Training process is typically repeated for multiple epochs, where each epoch involves ‘iterating over entire training set once’. This allows the perceptron to gradually improve its accuracy by adjusting the weights over multiple passes through training data.

Step 9: Use the trained perceptron to make predictions: Once the perceptron has been trained, it can be used to make predictions on new, unseen inputs. The prediction for a given input?x?is obtained by computing the weighted sum and applying the activation function:?y = f(W * x + b)

Output?y?is the predicted output for input?x.

Choosing Activation Function:

When selecting an activation function, it is important to consider the properties of data and task at hand.

  • Some activation functions work better than others for certain types of data or problems.
  • It is also possible to combine different activation functions in a single network, by using different activation functions in different layers of the network.

There is no theoretical framework for choosing activation function. Here are some guides:

??Sigmoid:

  • Sigmoid function suffers from the "vanishing gradient" problem, where the gradient of function approaches zero for large or small input values. It becomes difficult for the network to learn from data with extreme values.
  • Output is bounded between 0 and 1, where the output represents the probability of positive class.
  • Avoid using in deep neural networks where the vanishing gradient problem is more severe.

??Hyperbolic tangent (tanh):

  • The tanh function is similar to the sigmoid function, but maps any real number to a value between -1 and 1.
  • It is preferred over sigmoid in situations where average value of input data is close to zero.

??Rectified Linear Unit (ReLU):

  • For positive input values, ReLU function does not suffer from vanishing gradient problem, allowing for faster learning.
  • For negative input values, ReLU function suffer from "dying ReLU" problem, where the gradient of function is zero, causing corresponding neuron to stop learning.
  • It produce sparse representations, which can be useful for reducing overfitting.
  • Use in deep neural networks where computational efficiency is important.
  • Avoid using in networks with small number of neurons, where dying ReLU problem is severe.

??Leaky ReLU:

  • Leaky ReLU function is similar to the ReLU function, but allows for a small non-zero gradient for negative input values.
  • It prevents "dead neurons" problem of ReLU, where a neuron always outputs zero and stops learning.

??Parametric ReLU:

  • Parametric ReLU is Leaky ReLU with an adjustable slope for negative input values, allowing for better performance in some cases.

??Exponential ReLU:

  • Exponential ReLU can provide better performance in deep neural networks where ReLU and leaky ReLU functions are not performing well.

??Softplus:

  • Softplus function is always positive, allowing for better handling of negative input values than the ReLU family of functions.
  • Avoid using in networks with large number of neurons, where the computational overhead of the Softplus function outweigh the benefits.

??Swish:

  • Swish function is smooth and differentiable at all points, allowing for better handling of large or small input values than the ReLU family of functions.
  • Swish function is a relatively new activation function and has shown promising results in some cases but its performance has not been extensively tested in all scenarios.

??Softmax:

  • Softmax is useful for multi-class classification problems, where the output of network represents the probability of each class.
  • It ensures that the output probabilities are mutually exclusive, simplifying the classification task.
  • It suffers from numerical instability when dealing with large input values.
  • Avoid in problems where the output probabilities do not need to be mutually exclusive.

The Why:

Reasons to use a single neuron network a.k.a perceptron:

  1. For a binary classification problem where the data is linearly separable, a single neuron network such as the perceptron can be a?simple and effective solution.
  2. For data with small number of dimensions (i.e. features), a single neuron network can be a good choice because it is?easy to interpret and can be trained quickly.
  3. If you need a model that can adapt to new data in real-time, a single neuron network can be a good choice because it?supports online learning.
  4. If you need a model that can be deployed on devices with limited resources, a single neuron network can be a good choice because it has?low computational requirements.
  5. If you need to visualize the decision boundary of your model, a single neuron network can be a good choice because it has?only one output and is easy to interpret.

The Why Not:

Reasons to not use a single neuron network a.k.a perceptron:

  1. If data is not linearly separable, a single neuron network will not be able to learn correct decision boundary, and you will need a more complex model.
  2. If input data has a large number of dimensions, a single neuron network will not be able to learn the complex relationships between features and output.
  3. If you need to learn complex features from your data, you will need a deep neural network with multiple layers, rather than a single neuron network.
  4. If you need to classify data into more than two classes, a single neuron network cannot be used directly, and you will need a multi-class classifier such as a softmax model.
  5. Not appropriate for predicting continuous output variable.

Time for you to support:

  1. Reply to this article with your question
  2. Forward/Share to a friend who can benefit from this
  3. Chat on Substack with BxD (here )
  4. Engage with BxD on LinkedIN (here )

In next edition, we will cover Feed Forward Neural Networks.

Let us know your feedback!

Until then,

Have a great time! ??

#businessxdata ?#bxd ?#Perceptron #neuralnetworks ?#primer

要查看或添加评论,请登录

社区洞察

其他会员也浏览了