BxD Primer Series: Feed Forward Neural Networks

BxD Primer Series: Feed Forward Neural Networks

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Feed Forward Neural Networks. Let’s get started:

The What:

In Feed Forward Neural Networks (FFNN), information propagates forward through a series of interconnected layers of neurons, with each neuron receiving input from previous layer and passing output to the next layer.

Basic architecture of a feed-forward neural network consists of an input layer, one or more hidden layers, and an output layer. The input layer receives data input, and hidden layers perform a series of nonlinear transformations on input data. The output layer then generates final output, which can be a single value or a vector of values.

Each neuron in a feed-forward neural network has a set of weights and biases associated with it. Weights determine the strength of connection between input and the neuron, while bias determines how easily the neuron is activated. During training, the network learns to adjust these weights and biases to minimize the error between predicted output and actual output.

Note: Input flow for lot many architectures such convolutional neural network etc. is also of feed forward nature but our use of this term is limited to type of networks you see in below diagram.

Below architecture has:

  • Input layer of?n?dimensions for?n?dimensional dataset
  • Output layer with 1 sigmoid activated neuron that produce?m?discrete outputs
  • p-1?hidden layers with ReLU activated neurons

No alt text provided for this image

General Architecture Guidelines:

Architecture refers to the number of layers, number of neurons and type in each layer and how they are connected. Although there is no specific rule to decide this architecture, we have still compiled some guidelines you can use to get a good starting point.

Once this network is initialized, you can iteratively tune the architecture during training using techniques such as node pruning based on (small) values of weight vector after a certain number of training epochs. In other words, eliminating unnecessary/redundant nodes.

Creating the architecture, means coming up with values for number of layers and the number of nodes in each of these layers.

Input Layer: There is only one input layer per model and number of neurons in this layer is equal to the number of features (columns) in your data. Add one additional node for bias term.

Output Layer: Every neural network has?exactly **one?output layer. Determining number of neurons in this layer is completely determined by the output type of model.

  • If the neural network is a regressor, then output layer has a single node.
  • If the neural network is a classifier, then it also has a single node. In case?softmax?is used, the output layer has one node per class label in model.

Hidden Layer: Considerations in selecting number of hidden layers and number of neurons per layer:

? Overfitting and underfitting:

  • Selecting too few neurons per layer or too few hidden layers may result in underfitting,
  • Selecting too many neurons per layer or too many hidden layers may result in overfitting.
  • The optimal number of neurons per layer and hidden layers will depend on the complexity of data and the desired level of generalization.

? Computational resources: As the number of neurons per layer and hidden layers increases, so does the computational complexity of network, time required for training the network and the amount of memory needed to store network's parameters.

? Learning rate: A network with a large number of neurons per layer and hidden layers may require a smaller learning rate to prevent the weights from diverging during training.

  • Smaller learning rate requires more training time to achieve good performance.

? Regularization: A network with too many neurons per layer or too many hidden layers may overfit to the training data, requiring regularization techniques such as dropout or L1/L2 regularization to prevent overfitting.

The optimal number of hidden layers is often determined through cross-validation that evaluate network's performance on a separate validation set. There are some guidelines that can be useful:

  1. Number of neurons per layer: If you have an input with 100 features and an output with 10 classes, you might use a number of neurons per layer between 100 and 10.
  2. Number of hidden layers: A single hidden layer with enough neurons can achieve good performance for most of the problems.

You can usually prevent over-fitting without regularization if you keep total number of neurons below ??_?:

No alt text provided for this image

Where,

  • ??_?? is number of input neurons.
  • ??_?? is number of output neurons.
  • ??_?? is number of samples in training data set.
  • ?? is an arbitrary scaling factor usually 2-10.

Connectivity of Neurons: Standard architecture for a feed-forward network is fully connected, meaning that every neuron in one layer is connected to every neuron in the next layer. This type of connectivity allows the network to learn complex, non-linear relationships between inputs and outputs.

However, a fully connected architecture may not always be the right choice:

  • Because high dimensional data often require deep networks, which face the problem of vanishing gradients and overfitting. Skip layers, Dropout layers are the solution here.
  • Every connection is a weight parameter that need to be optimized, this increases computational overhead. Sparse connectivity is the solution here.
  • When input data is sparse, where many of the input features are zero, it is best to practice sparse connectivity.

Note 1: The idea of skip connections, also known as residual connections, is to add the output from a previous layer directly to the input of a later layer, forming a shortcut connection.

  • This helps to improve training speed, reduce the risk of vanishing gradients, and improve generalization performance.

Note 2: Dropout randomly drops out a subset of neurons in a layer during training, effectively removing them from the network for that iteration. This has the effect of forcing remaining neurons to take on more responsibility.

  • Placing dropout layers too early in the network can lead to information loss and degrade performance, as the input to later layers becomes more noisy.
  • Placing dropout layers too late in the network may not regularize the network enough, as the earlier layers have already learned to co-adapt to noisy input.

Note 3: Sparse connectivity helps to reduce the number of parameters in network. This is also called ‘pruning connections’. Approaches to pruning:

  • Randomly removing a fraction of connections between neurons.
  • Remove connections based on a specific pattern or structure in the network. For example, connections can be removed between neurons that are far apart in the network, or between neurons that have low weights.
  • Create groups of neurons in the network and selectively prune connections between groups. This approach preserves the group structure of network.

The How:

We already provided a general process of building any neural network in a previous edition. Check?here.

For that reason, we will be focus on an important algorithm common to all neural networks:

??Back-propagation: Backpropagation is needed to optimize the weights and biases of a neural network by minimizing the loss function. Since loss can only be calculated in last layer i.e. where predicted and true values can be compared, we need a way for other layers to know about the loss and update their weights proportionally.

Neural Networks can be considered as an assembly of neurons in multiple layers, with each layer having multiple neurons and a weight matrix associated to each. Backpropagation allows the error to be propagated backwards through the network from output to input layer.

Here is how it works:

Step 1: Calculate the loss at output layer.

Step 2: General weight update rule for any neuron in layer ‘l’ is given by:

No alt text provided for this image

Where,

  • W(l) is the weight matrix of layer?l
  • ???is the learning rate
  • dJ/dW?is the partial derivative of loss function?J?with respect to weights?W(l)?of layer?l

Step 3: J?is function of predicted value i.e. output of last layer, which in turn is function of activation function, weights and input to last later, which is function of output of previous layer, and so on…

  • You can see its a sequential chain. Hence, the chain derivative rule is used to calculate the derivative at any general step.

No alt text provided for this image

Let's consider a layer with?m?inputs and?n?neurons.

  • Let?a[l-1]?be the?m*1?input to this layer,
  • W[l]?be the weight matrix of size?n*m
  • b[l]?be the bias vector of size?nx1
  • z[l]?be the weighted input to the layer, given by: z[l] = W[l]*a[l-1] + b[l]
  • Let the activation function for this layer be g(z[l]). Where, g?is applied element-wise to the vector z[l].
  • Let output of this layer be?a[l], which is a vector of size?n*1.
  • Let?J?be the loss function for neural network.

Then, the gradient of loss?J?with respect to weights?W[l]?and biases b[l]?of this layer can be computed using the chain rule as:

No alt text provided for this image

This gradient is then used to update weights.

For the last layer, dJ/da[l], can be calculated directly since J is a direct function of a[l].

  • For second last layer, J becomes a function of its output and entire last layer.
  • Similar chain continue ‘backward’ through the network.

You can observe that the weight of last layer are updated first and weight of first layer are updated last. As weights propagate backward in a deep neural network, their influence is diminished, this phenomenon is called ‘diminishing of weights’.

Parameters for Problem Specific Optimization:

Several parameters can be tuned in FF-NN to make the basic architecture suitable for problem and data at hand:

  • Input size: Number of input features or variables to neural network.
  • Output size: Number of output neurons in neural network.
  • Weight matrix: Matrix of weights that are learned during training and used to compute the weighted sum of inputs in each neuron.
  • Bias vector: Vector of biases that are learned during training and added to the weighted sum of inputs in each neuron.
  • Number of layers: Number of layers in neural network, not including input layer.
  • Number of neurons per layer: Number of neurons in each hidden layer of neural network.
  • Activation function for each layer: Activation function applied to output of each neuron in each layer of neural network.
  • Initialization: Method used to initialize weights and biases in neural network before training.
  • Epochs: Number of times the entire dataset is passed through neural network during training.
  • Loss function: Function used to measure the difference between predicted and true output.
  • Optimizer: Optimization algorithm used to update weights and biases during training.
  • Learning rate: Controls step size of weight updates during training.
  • Momentum: Controls impact of past weight updates on current update.
  • Learning rate decay rate: Controls the rate at which learning rate decreases during training.
  • Regularization: Techniques used to prevent overfitting by adding penalties to loss function.
  • Batch size: Number of samples used to compute the gradient during each training iteration.
  • Batch normalization: Technique that normalizes the inputs to each layer in neural network.
  • Dropout rate: Controls the probability of randomly dropping out neurons during training.
  • Weight decay: Regularization technique that adds a penalty to loss function based on the magnitude of weights.
  • Early stopping criteria: Technique used to stop training early based on certain criteria, such as when the validation loss stops improving.
  • Gradient clipping threshold: Technique used to limit the magnitude of gradients during training to prevent exploding gradients.
  • Sparsity: Technique used to create sparse connectivity patterns in neural network.
  • Optimization method parameters: Additional hyperparameters specific to the optimization algorithm, such as the beta1/beta2 parameters for the Adam optimizer.
  • Residual connections or skip connections: Technique used in deep residual networks to allow information to bypass certain layers and be passed directly to later layers.

The Why:

Reasons to use Feed Forward Neural Networks:

  1. Computationally efficient and can quickly process large amounts of data.
  2. Known to be universal approximators, meaning they can approximate any continuous function to a high degree of accuracy, given enough hidden neurons.
  3. Once trained, feed-forward neural networks can generalize well to new, unseen data, making them useful for complex tasks such as image recognition, speech recognition etc.
  4. Can be easily implemented on parallel computing architectures, making them suitable for tasks that require large-scale parallel processing.
  5. Robust to noisy inputs by incorporating regularization techniques, such as dropout.
  6. Can be implemented using standard machine learning libraries and frameworks.

The Why Not:

Reasons to not use Feed Forward Neural Networks:

  1. Difficult to interpret, as the relationships learned by network are not be easily visualizable.
  2. Have a large number of hyperparameters, which can be difficult to tune.
  3. Require large amounts of labeled training data to learn meaningful patterns and relationships, which can be a limiting factor in some applications.
  4. Prone to the vanishing gradient problem, which occurs when the gradient of error function becomes too small to effectively update the weights in earlier layers of network.
  5. Limited ability to capture long-term dependencies and context, which is a limiting factor in tasks such as natural language processing.
  6. Lack feedback connections, which means that they cannot use the output of network to adjust the input or to modify the network architecture dynamically.

Time for you to support:

  1. Reply to this article with your question
  2. Forward/Share to a friend who can benefit from this
  3. Chat on Substack with BxD (here)
  4. Engage with BxD on LinkedIN (here)

In next edition, we will cover Radial Basis Neural Networks.

Let us know your feedback!

Until then,

Have a great time! ??

#businessxdata?#bxd?#feedforward #neuralnetworks?#primer

要查看或添加评论,请登录

Mayank K.的更多文章

  • What we look for in new recruits?

    What we look for in new recruits?

    Personalization is the #1 use case of most of AI technology (including Generative AI, Knowledge Graphs…

  • 500+ Enrollments, ?????????? Ratings and a Podcast

    500+ Enrollments, ?????????? Ratings and a Podcast

    We are all in for AI Driven Marketing Personalization. This is the niche where we want to build this business.

  • What you mean 'Build A Business'?

    What you mean 'Build A Business'?

    We are all in for AI Driven Personalization in Business. This is the niche where we want to build this business.

  • Why 'AI-Driven Personalization' niche?

    Why 'AI-Driven Personalization' niche?

    We are all in for AI Driven Personalization in Business. In fact, this is the niche where we want to build this…

  • Entering the next chapter of BxD

    Entering the next chapter of BxD

    We are all in for AI Driven Personalization in Business. And recently we created a course about it.

    1 条评论
  • We are ranking #1

    We are ranking #1

    We are all in for AI Driven Personalization in Business. And recently we created a course about it.

  • My favorites from the new release

    My favorites from the new release

    The Full version of BxD newsletter has a new home. Subscribe on LinkedIn: ?? https://www.

  • Many senior level jobs inside....

    Many senior level jobs inside....

    Hi friend - As you know, we recently completed 100 editions of this newsletter and I was the primary publisher so far…

  • People need more jobs and videos.

    People need more jobs and videos.

    From the 100th edition celebration survey conducted last week- one point is standing out that people need more jobs and…

  • BxD Saturday Letter #202425

    BxD Saturday Letter #202425

    Please take 2 mins to send your feedback. Link: https://forms.

社区洞察

其他会员也浏览了