BxD Primer Series: Feed Forward Neural Networks
Hey there ??
Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Feed Forward Neural Networks. Let’s get started:
The What:
In Feed Forward Neural Networks (FFNN), information propagates forward through a series of interconnected layers of neurons, with each neuron receiving input from previous layer and passing output to the next layer.
Basic architecture of a feed-forward neural network consists of an input layer, one or more hidden layers, and an output layer. The input layer receives data input, and hidden layers perform a series of nonlinear transformations on input data. The output layer then generates final output, which can be a single value or a vector of values.
Each neuron in a feed-forward neural network has a set of weights and biases associated with it. Weights determine the strength of connection between input and the neuron, while bias determines how easily the neuron is activated. During training, the network learns to adjust these weights and biases to minimize the error between predicted output and actual output.
Note: Input flow for lot many architectures such convolutional neural network etc. is also of feed forward nature but our use of this term is limited to type of networks you see in below diagram.
Below architecture has:
General Architecture Guidelines:
Architecture refers to the number of layers, number of neurons and type in each layer and how they are connected. Although there is no specific rule to decide this architecture, we have still compiled some guidelines you can use to get a good starting point.
Once this network is initialized, you can iteratively tune the architecture during training using techniques such as node pruning based on (small) values of weight vector after a certain number of training epochs. In other words, eliminating unnecessary/redundant nodes.
Creating the architecture, means coming up with values for number of layers and the number of nodes in each of these layers.
Input Layer: There is only one input layer per model and number of neurons in this layer is equal to the number of features (columns) in your data. Add one additional node for bias term.
Output Layer: Every neural network has?exactly **one?output layer. Determining number of neurons in this layer is completely determined by the output type of model.
Hidden Layer: Considerations in selecting number of hidden layers and number of neurons per layer:
? Overfitting and underfitting:
? Computational resources: As the number of neurons per layer and hidden layers increases, so does the computational complexity of network, time required for training the network and the amount of memory needed to store network's parameters.
? Learning rate: A network with a large number of neurons per layer and hidden layers may require a smaller learning rate to prevent the weights from diverging during training.
? Regularization: A network with too many neurons per layer or too many hidden layers may overfit to the training data, requiring regularization techniques such as dropout or L1/L2 regularization to prevent overfitting.
The optimal number of hidden layers is often determined through cross-validation that evaluate network's performance on a separate validation set. There are some guidelines that can be useful:
You can usually prevent over-fitting without regularization if you keep total number of neurons below ??_?:
Where,
Connectivity of Neurons: Standard architecture for a feed-forward network is fully connected, meaning that every neuron in one layer is connected to every neuron in the next layer. This type of connectivity allows the network to learn complex, non-linear relationships between inputs and outputs.
However, a fully connected architecture may not always be the right choice:
Note 1: The idea of skip connections, also known as residual connections, is to add the output from a previous layer directly to the input of a later layer, forming a shortcut connection.
Note 2: Dropout randomly drops out a subset of neurons in a layer during training, effectively removing them from the network for that iteration. This has the effect of forcing remaining neurons to take on more responsibility.
Note 3: Sparse connectivity helps to reduce the number of parameters in network. This is also called ‘pruning connections’. Approaches to pruning:
领英推荐
The How:
We already provided a general process of building any neural network in a previous edition. Check?here.
For that reason, we will be focus on an important algorithm common to all neural networks:
??Back-propagation: Backpropagation is needed to optimize the weights and biases of a neural network by minimizing the loss function. Since loss can only be calculated in last layer i.e. where predicted and true values can be compared, we need a way for other layers to know about the loss and update their weights proportionally.
Neural Networks can be considered as an assembly of neurons in multiple layers, with each layer having multiple neurons and a weight matrix associated to each. Backpropagation allows the error to be propagated backwards through the network from output to input layer.
Here is how it works:
Step 1: Calculate the loss at output layer.
Step 2: General weight update rule for any neuron in layer ‘l’ is given by:
Where,
Step 3: J?is function of predicted value i.e. output of last layer, which in turn is function of activation function, weights and input to last later, which is function of output of previous layer, and so on…
Let's consider a layer with?m?inputs and?n?neurons.
Then, the gradient of loss?J?with respect to weights?W[l]?and biases b[l]?of this layer can be computed using the chain rule as:
This gradient is then used to update weights.
For the last layer, dJ/da[l], can be calculated directly since J is a direct function of a[l].
You can observe that the weight of last layer are updated first and weight of first layer are updated last. As weights propagate backward in a deep neural network, their influence is diminished, this phenomenon is called ‘diminishing of weights’.
Parameters for Problem Specific Optimization:
Several parameters can be tuned in FF-NN to make the basic architecture suitable for problem and data at hand:
The Why:
Reasons to use Feed Forward Neural Networks:
The Why Not:
Reasons to not use Feed Forward Neural Networks:
Time for you to support:
In next edition, we will cover Radial Basis Neural Networks.
Let us know your feedback!
Until then,
Have a great time! ??