Unveiling the Structure of Neural Networks: A Primer on the Basics

Artificial neural networks, often referred to simply as neural networks, are computational models inspired by the intricate structure and functioning of the human brain. They consist of interconnected nodes, or artificial neurons, organized into layers. The fundamental components of a neural network closely resemble the neurons found in the human brain, communicating with each other through connections known as synapses.

The neural network architecture typically comprises three main types of layers:

  • Input Layer: This layer receives the initial data or features. Each neuron in the input layer represents a feature or input variable.
  • Hidden Layers: Intermediate layers between the input and output layers where computations occur. These hidden layers enable the network to learn complex patterns and representations in the data. Deep neural networks have multiple hidden layers, contributing to the term "deep learning."
  • Output Layer: This layer produces the final result or prediction. The number of neurons in the output layer is determined by the nature of the task—classification, regression, etc.

Neural Network Structure

Each connection between neurons of the layers is assigned a weight, representing the strength of the connection. Biases are individual terms associated with each neuron, accounting for offsets and providing the network with flexibility to capture shifts in the data.

Neuron Processing

Each neuron applies an Activation Function ( more on this later in the article :-) ) to the weighted sum of its inputs, introducing non-linearity to the model and enabling it to learn intricate patterns.

In essence, the neural network architecture mirrors the inter-contentedness of neurons in the human brain, allowing it to process information, learn from data, and make predictions or classifications. As data passes through the network during training, the weights and biases are adjusted based on the error at the output.

To understand further, in our simplified exploration, we'll focus on the fundamental training techniques that underpin the optimization of neural networks. Central to this process are methods such as Back-propagation, which leverages the Chain Rule, and the ubiquitous Gradient Descent that tries to minimize the value of the loss function. While there exist additional strategies for enhancing model performance, our attention in this article is dedicated to a closer examination of the intricacies of Gradient Descent.

Consider a basic illustration of a neural network featuring a single input neuron and one output neuron, complete with weights and biases to facilitate comprehension.

Simple 2 layer NN

In this scenario, we designate the,

?Input value as 'x'

?Weight as 'w', and

?Bias as 'b'.

The output, denoted as 'y', is calculated through a simple formula:

y = (x * w) + b, highlighting that the output is a product of the input, adjusted by the respective weights and biases.

Expanding our example to incorporate a hidden layer with a solitary neuron introduces an additional layer of complexity, involving a fresh set of weights and biases. Now, envision,

3 Layer NN with a hidden layer

'x' as the input value

'W_in', the weight for the connection between the input and hidden layer,

'B_in', the bias for the hidden layer neuron,

'W_out', the weight for the connection between the hidden and output layer, and

'B_out', the bias for the output layer.

The computations unfold as follows:

first, the hidden layer computation is expressed as

  • Y_hidden = (x * W_in) + B_in.

Subsequently, the output layer computation becomes

  • Y_output = ((Y_hidden) * W_out + B_out.
  • Y_output = ((x W_in) + B_in) W_out + B_out.

This depiction elucidates that, in a single hidden layer with one neuron per layer, the neural network generates an output based on the given input, integrating weights and biases from both the input-hidden and hidden-output layers.

Examining the outcomes of these functions, it becomes apparent that they yield linear outputs, reflecting a stark contrast from the intricate patterns present in real-world data. This simplified neural network, while effective in linear scenarios, lacks the complexity required to handle more nuanced and non-linear data representations.

Linear pattern

To be able to classify or make predictions with non linear data such as for ex with,

Non-Linear pattern

a non linear function is needed, and this is where the activation functions come in.

The activation functions serve a crucial role in introducing non-linearity to the model. Without activation functions, the network would be limited to linear transformations, making it incapable of capturing complex patterns and relationships in the data. Activation functions enable neural networks to model and learn non-linear mappings, allowing them to approximate more intricate functions and patterns.

So in the basic example with a single neuron in hidden layer with activation function becomes,

  • Y_hidden = (x * W_in) + B_in
  • Yaf_hidden=A_f(Y_hidden)

Here, A_f(Y_hidden) represents an activation function applied to Y_hidden, and if an activation function such as Sigmoid activation function is used, it is is given by:

  • Yaf_hidden = 1/1+e^-Y_hidden

There are many different activation functions used in neural networks that is dependent on the data and the problem to solve, and some are illustrated below

Activation Function

Training

To be able to unleash the full potential of the model, the neural network needs to undergo a crucial phase known as training. This intricate process involves grappling with concepts like loss function, gradient descent, and backpropagation. Let's embark on a concise exploration, delving into the mathematical underpinnings of these fundamental concepts to unravel their significance in the training journey.

Loss Function

A loss function, also known as a cost function or objective function, is a mathematical measure that quantifies the difference between the predicted output of a model and the actual target values in the training dataset. The primary purpose of a loss function is to represent how well or poorly a model is performing on a given task. The goal during the training process is to minimize this loss, as a lower loss indicates better alignment between the model's predictions and the true values.

The choice of a specific loss function depends on the nature of the machine learning task. For regression tasks, where the goal is to predict a continuous value, Mean Squared Error (MSE) is commonly used as the loss function. For classification tasks, where the objective is to assign instances to predefined classes, Cross-Entropy Loss is widely employed. There are various other loss functions tailored to specific tasks, and the selection of the appropriate loss function is a critical aspect of designing an effective machine learning model.

?The simplest loss function, the Mean Squared Error (MSE), measures the squared difference between the predicted output of the neural network and the actual target values. It is given by:

  • L(Y_output, Y_target) = 1/n * (Y_target ? Y_output)2

where Y_output is the predicted output and Y_target is the actual target value.

Gradient Descent

Gradient descent is an iterative optimization algorithm used to minimize a cost function or loss function in the context of training machine learning models, including neural networks. The primary objective of gradient descent is to find the minimum of a function by iteratively adjusting the model's parameters (weights and biases) in the direction that reduces the function's value most rapidly.

The core idea behind gradient descent is derived from calculus and involves computing the slope of the loss/cost function with respect to each parameter. The positive gradient points in the direction of the steepest ascent, and the negative gradient points in the direction of the steepest descent.

Gradient Descent

By moving opposite to the gradient, the algorithm aims to reach the minimum of the cost function.

The update rule for each Model parameter (Mp) in gradient descent is given by:

  • Mp = Mp – lrn* dL/dMp

where:

  • Mp represents a model parameter (weight or bias),
  • lrn is the learning rate, determining the step size in the parameter space,
  • dL/dMp is the partial derivative of the loss/cost function with respect to the parameter.

The learning rate is a hyperparameter that influences the convergence and stability of the algorithm. If the learning rate is too large, the algorithm may overshoot the minimum, and if it is too small, convergence may be slow.

The algorithm iteratively updates the parameters until convergence, where the model reaches a state where further adjustments do not significantly reduce the cost function.

To derive the dL/dMp needed for the new parameter (Mp) values, the Back-Propagation through Chain Rule technique is used

Backpropagation involves computing the gradients of the loss function with respect to the weights and biases in the network. The chain rule is fundamental to backpropagation, and the steps include:

  1. Compute the gradient of the loss with respect to the output layer's pre-activation (Y_output).
  2. Propagate this gradient backward to compute the gradients for the hidden layer's pre-activation (Y_hidden) and the input layer's weights and biases.
  3. Update the weights and biases using the gradient descent algorithm.

The chain rule is applied at each step to calculate the gradients efficiently.

Let’s see how the back propagation and chain rule is implemented on a simple network,

Sample NN

Let, L be the Mean Squared Error (MSE) loss function.

  • Y_target: Target output.
  • Y_output: Weighted sum of inputs in the output layer.
  • W_out: Weight connecting the hidden layer to the output layer.
  • B_out: Bias in the output layer.
  • Y_hidden: Weighted sum of inputs in the hidden layer.
  • W_in: Weight connecting the input layer to the hidden layer.
  • B_in: Bias in the hidden layer.
  • xx: Input to the neural network.

?Using the chain rule, the derivative of Loss function with respect to the model parameters is calculated:

Loss with respect to W_out and B_out:

  • dL/dW_out = dL/dY_output * dY_output/dW_out
  • dL/dB_out =dL/dY_output * dY_output/dB_out

?Loss with respect to W_in and B_in:

  • dL/dW_in = dL/dY_output dY_output/dY_hidden dY_hidden/dW_in
  • dL/dB_in = dL/dY_output dY_output/dY_hidden dY_hidden/dB_in

?

Using gradient descent algorithm, the weights are adjusted as,

  • W_out = W_out – lrn * dL/dW_out
  • B_out = B_out – lrn ?* dL/dB_out
  • W_in ??= W_in – lrn ??* dL/dW_in
  • B_in ???= B_in – lrn ???* dL/dB_in

The calculations can be illustrated with the following values:

  • Y_target=2.15
  • W_out=0.5,
  • W_in=2,
  • B_out = 2,
  • B_in = 1.5
  • input=0.4
  • lrn = 0.5 ( The learn influencing hyperparameter )

First, we calculate Y_output,

Y_output = ((x W_in) + B_in) W_out) + B_out

??????????????? = ((0.4 2) + 1.5) 0.5) + 2

??????????????? = 3.15

The Y_target = 2.15 is the expected output

The Loss , L is,

  • L = (3.15 – 2.15)2 = 1

To adjust B_out, we use the back propagation and gradient descent techniques,

  • B_out = B_out – lrn * dL/dB_out

dL/dB_out = dL/dY_output * dY_output/dB_out

dL/dY_output = 2 * (3.15 – 2.15 ) = 2

dY_output/dB_out = 1

so, dL/dB_out = 2

And the new B_out will be, B_out = 2 – (0.5 *2) = 1 ??????????????????????

In the subsequent training iteration, the updated value of B_out shall be used. Likewise, adjustments to weights and biases will be computed, and these refined values will be employed in subsequent training iterations.

Training a neural network involves the iterative process of adjusting its parameters, such as weights and biases, to minimize a defined loss function.

The initial weights and biases are set randomly, and during each training iteration, input data is fed forward through the network to make predictions. The predictions are then compared to the actual target values, and the difference is quantified by the loss function. The goal of training is to minimize this loss by updating the parameters through the above-described optimization algorithm such as gradient descent. The gradient of the loss function with respect to each parameter is computed during backpropagation, and the parameters are adjusted in the opposite direction of the gradient. This process continues until the model reaches a state where the loss is minimized, indicating that the network has learned the underlying patterns in the training data. This is how it happens,

Throughout the training phase, as the data is input into the system, the loss (MSE, for example) is computed by assessing the resultant output. The computation follows the formula:

  • L(Y_output, Y_target) = 1/n * (Y_target ? Y_output)2

If the loss is deemed unacceptable, the chain rule is employed to compute dL/dMp. Subsequently, through the backpropagation and gradient descent algorithm, the parameters undergo adjustment:

  • Mp = Mp – lrn * dL/dMp.

The newly computed Mp is then utilized to analyze the output by re-issuing the inputs, initiating a repetitive cycle of adjustments.

In our previous discussion in the article https://www.dhirubhai.net/posts/prathap-thammanna-847043a_machinelearning-linearregression-gpus-activity-7124421078516420608-UUxf?utm_source=share&utm_medium=member_desktop, we methodically derived the weights and biases of a linear equation using algorithms. Building upon that knowledge, let's now leverage the insights gained about neural networks. In this context, we will apply these concepts and the architectural principles to construct a linear regression model. Our focus will be on delving into the code, showcasing how the fundamental concepts of Loss, Gradient Descent, and Backpropagation can be employed to determine the same weights and biases. This demonstration will highlight the model's ability to make predictions for linear data along with the code to train the model.

We model this simple network into the code,

Example Linear NN
import numpy as np
import matplotlib.pyplot as plt

class LinearRegressionModel:
    def __init__(self):
        self.weights = None
        self.bias = None

    def train(self, X_train, y_train, learning_rate=0.01, epochs=100):
        # Initialize weights and bias
        np.random.seed(1)
        self.weights = np.random.randn(1)
        self.bias = np.random.randn(1)

        for epoch in range(epochs):
            total_loss = self._train_one_epoch(X_train, y_train, learning_rate)

            # Print the total loss for this epoch
            if epoch % 10 == 0:
                print(f'Epoch {epoch}, Total Loss: {total_loss}')

    def _train_one_epoch(self, X_train, y_train, learning_rate):
        total_loss = 0

        for i in range(len(X_train)):
            # Forward pass for a single data point
            prediction = self.predict(X_train[i])

            # Compute the mean squared error for this data point
            loss = (prediction - y_train[i]) ** 2
            total_loss += loss

            # Backward pass (gradient descent) for a single data point
            grad_weights, grad_bias = self._backpropagate(X_train[i], y_train[i], prediction)

            # Update weights and bias for a single data point
            self.weights -= learning_rate * grad_weights
            self.bias -= learning_rate * grad_bias

        return total_loss

    def _backpropagate(self, x, y_true, y_pred):
        # Chain rule for gradients
        grad_loss = 2 * (y_pred - y_true)
        grad_weights = grad_loss * x
        grad_bias = grad_loss

        return grad_weights, grad_bias

    def predict(self, x):
        return x * self.weights + self.bias

# Generate some random data for training
np.random.seed(0)
X_train = np.random.rand(100, 1)
y_train = 2 * X_train + 1 + 0.1 * np.random.randn(100, 1)

# Create and train the linear regression model
linear_model = LinearRegressionModel()
linear_model.train(X_train, y_train)

# Test the model on new data
X_test = np.array([[0.2], [0.5], [0.8]])
predictions = linear_model.predict(X_test)        

The graph of dataset, and the predicted Neural network linear regression shown below,

In conclusion, our journey through the intricacies of neural networks has unveiled the fundamental mechanisms that drive their learning and predictive capabilities. From the nuanced architecture inspired by the human brain to the indispensable concepts of loss, gradient descent, and backpropagation, we have navigated the landscape of artificial intelligence with a focus on neural networks. As we stand at the nexus of data science and computational innovation, understanding these principles becomes not just beneficial but essential for anyone venturing into the realms of machine learning and neural network applications. Armed with this knowledge, we can harness the power of neural networks to tackle complex problems, make accurate predictions, and drive advancements that redefine the boundaries of technological innovation. As we move forward, the synergy between human intelligence and artificial neural networks promises a future where the uncharted territories of knowledge and discovery are within our computational grasp.

In the next article, my focus will delve into the realm of TensorFlow, exploring its applications and the utilization of Graphics Processing Units (GPUs) to enhance the computational efficiency of neural networks. We will unravel the synergy between TensorFlow's powerful capabilities, and the accelerated processing potential offered by GPUs, shedding light on how this combination contributes to the optimization of neural network computations.

Kajal Singh

HR Operations | Implementation of HRIS systems & Employee Onboarding | HR Policies | Exit Interviews

7 个月

Well-crafted post. Besides Support Vector Machines, during 1980 and 2010, researchers worked on expanding MultiLayer Perceptrons (MLPs) which were invented by Ivankhnenko and Lapa in 1965 and began to be called Deep Learning Networks (DLNs) in 1986. As mentioned in a previous blog, a one layer Perceptron network consists of an input layer connected to a hidden layer, which is connected to an output layer of Perceptrons (or vertices). The Perceptron multiplies incoming signals by their weights and adds them together. If the sum of the weighted signals exceeds a specified value, the Perceptron "fires". Activation functions, such as Tanh, ReLU, and Sigmoid, are used to determine if a Perceptron fires. Artificial Neural Networks (ANNs) are simply Perceptrons or other similar neurons that may have different activation functions. DLNs have more than one hidden layear and are complex due to the non-linear nature of activation functions, making them unexplainable "black boxes". Researchers like Hinton, LeCun and Schmidhauber popularized variants of DLNs, e.g., Fully Connected Networks, Autoencoders, Convolution Neural Networks, Recurrent Neural Networks, Long Short Term Memory, and Deep Belief Networks.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了