Machine Learning Regularization

Machine Learning Regularization

When building a machine learning model with unsupervised training, there are a lot of uncertainties. For instance, the model could be trained with a set of data that doesn’t allow the model to achieve stronger predictions with other sets, or maybe the model is used to those specific sets and it can’t solve new ones.

These kinds of issues are errors that can be analyzed, and even better you can regulate them. In this post I will explain how to solve them. Hope you enjoy it.


The topics treated are the following:


  • L1 regularization
  • L2 regularization
  • Dropout
  • Data Augmentation
  • Early Stopping


L1 regulation

L1 regulation is a technique that reduces the complexity of a model to prevent its overfitting. Overfitting is an issue in which the model is used to the training data, so it has a harder time figuring new data while also having inaccuracies.

In this method of regularization, a penalty term proportional to the absolute value of the model’s weights is added to the cost function. This penalty term is multiplied by a hyperparameter lambda, which controls the strength of the regularization. The resulting cost function is then minimized during training to find the optimal values of the model’s weights.


The formula can be written like this:

J(w) = L(y, f(x; w)) + λ ||w||?

where:

  • J(w) is the regularized cost function.
  • L(y, f(x; w)) is the original cost function, such as mean squared error.
  • w is the vector of model weights.
  • λ is the regularization strength hyperparameter.
  • ||w||? is the L1 norm of the weight vector, which is the sum of the absolute values of the weights.


L1 is a nice technique to regularize, some of its pros are:

  • Encourages sparsity: tends to drive some of the weights to zero, resulting in a simpler model making it more efficient.
  • Feature selection: when driving its weights to zero, this technique can be used for feature selection so the model identifies the most important features for prediction.
  • Can handle high-dimensional data: L1 is particularly useful for high-dimensional data, where the number of features is much larger than the number of samples.
  • Improves generalization performance: this method helps prevent overfitting by reducing the complexity while promoting sparsity, which improves the generalization performance of the model on new sets.


On the other hand, L1 also has some cons that must be taken into consideration:

  • Instability: L1 can be unstable when the number of features is much larger than the number of examples. In such cases, small changes in the data may result in large changes in the chosen features.
  • Requires tuning of hyperparameter: the strength of this method of regularization is controlled by a hyperparameter that needs to be tuned to achieve the best performance on the validation data.
  • Not suitable for correlated features: tends to select only one feature from a group of highly correlated features, which can lead to loss of information and poorer performance. In contrast, L2 regularization tends to spread weight across all correlated features.


L2 regularization

L2 adds a penalty term proportional to the squared magnitude of the model weights to the cost function. It encourages smooth solutions with smaller and more evenly distributed weights.


The formula can be written like this:

J(w) = L(y, f(x; w)) + λ ||w||?2

where:

  • J(w) is the regularized cost function.
  • L(y, f(x; w)) is the original cost function, such as mean squared error.
  • w is the vector of model weights.
  • λ is the regularization strength hyperparameter.
  • ||w||?2 is the L2 norm of the weight vector squared, which is the sum of the squared values of the weights.


Some of the pros:

  • Encourages smoothness: L2 tends to encourage smoothness in the weight vector, which can make the model less sensitive to small changes in the input data.
  • Works well with correlated features: it is less likely to discard correlated features as it distributes the weight evenly among them.
  • Can improve generalization performance: helps prevent overfitting by reducing the complexity of the model and promoting smoothness in the weight vector, which can improve the generalization performance of the model on unseen data.


Some important cons to take into consideration:

  • May not produce sparse models: L2 does not encourage sparsity in the weight vector and can result in a model with many non-zero weights.
  • May not handle high-dimensional data well: where the number of features is much larger than the number of samples.
  • Requires tuning of hyperparameter: the strength of the regularization in L2 is controlled by a hyperparameter that needs to be tuned to achieve the best performance on the validation data.


Overall, L1 and L1 regularization are two distinct regularization techniques, one is not necessarily an evolution of the other. They both prevent overfitting and improve the generalization performance of our models. They have different properties and can be more suitable for different types of problems.


Dropout

This regularization technique is commonly used in deep neural networks to prevent overfitting while improving the generalization performance of models. Dropout works by randomly dropping out some of the neurons in the network during training, which forces the remaining neurons to take on more independent and robust features.


Here’s an example of forward propagation and gradient descent in NumPy:


def dropout_forward_prop(X, weights, L, keep_prob):

????"""

????????Conducts forward propagation using Dropout


????????X -- numpy.ndarray of shape (nx, m) containing the input data for

????????the network

????????????nx -- number of input features

????????????m -- number of data points

????????weights -- dictionary of the weights and biases of the neural network

????????L -- number of layers in the network

????????keep_prob -- probability that a node will be kept


????????All layers except the last should use the tanh activation function

????????The last layer should use the softmax activation function

????????Returns: a dictionary containing the outputs of each layer and the

????????dropout mask used on each layer

????"""

????cache = {}


????cache['A0'] = X

????for l in range(1, L+1):

????????W = weights['W' + str(l)]

????????b = weights['b' + str(l)]


????????A_prev = cache['A' + str(l - 1)]


????????# compute the linear transformation of the previous layer

????????Z = np.dot(W, A_prev) + b


????????# apply the activation function, except for the last layer

????????if l < L:

????????????A = np.tanh(Z)? # tanh

????????????# Generate a binary mask to drop out some nodes

????????????D = np.random.rand(A.shape[0], A.shape[1])

????????????D = (D < keep_prob).astype(int)


????????????# Apply the mask to the output of the current layer

????????????A = np.multiply(A, D)


????????????# Normalize the output of the current layer

????????????A /= keep_prob


????????????# Store the mask for backpropagation

????????????cache['D' + str(l)] = D

????????else:

????????????A = np.exp(Z) / np.sum(np.exp(Z), axis=0, keepdims=True)? # softmax

????????cache['A' + str(l)] = A


????return cache



def dropout_gradient_descent(Y, weights, cache, alpha, keep_prob, L):

????"""

????????Updates the weights of a neural network with Dropout regularization

????????using gradient descent


????????Y -- one-hot numpy.ndarray of shape (classes, m) that contains the

????????correct labels for the data

????????????classes -- number of classes

????????????m -- number of data points

????????weights -- dictionary of the weights and biases of the neural network

????????cache -- dictionary of the outputs and dropout masks of each layer of

????????the neural network

????????alpha -- learning rate

????????keep_prob -- probability that a node will be kept

????????L -- number of layers of the network


????????All layers use the tanh activation function except the last, which uses

????????the softmax activation function

????????The weights of the network should be updated in place

????"""

????# Compute the gradients for the last layer

????dZ = cache["A" + str(L)] - Y

????dW = np.dot(dZ, cache["A" + str(L - 1)].T) / Y.shape[1]

????db = np.sum(dZ, axis=1, keepdims=True) / Y.shape[1]

????dA_prev = np.dot(weights["W" + str(L)].T, dZ)


????# Update the last layer's weights

????weights["W" + str(L)] -= alpha * dW

????weights["b" + str(L)] -= alpha * db


????# Loop over the remaining layers, backpropagating and updating the weights

????for l in range(L - 1, 0, -1):

????????dA = dA_prev * (cache["D" + str(l)] / keep_prob)

????????dZ = dA * (1 - np.power(cache["A" + str(l)], 2))

????????dW = np.dot(dZ, cache["A" + str(l - 1)].T) / Y.shape[1]

????????db = np.sum(dZ, axis=1, keepdims=True) / Y.shape[1]

????????dA_prev = np.dot(weights["W" + str(l)].T, dZ)


????????# Update the weights of this layer

????????weights["W" + str(l)] -= alpha * dW

????????weights["b" + str(l)] -= alpha * db


Dropout randomly selects a set of neurons to be dropped out with a specified probability as the examples above. The output of the remaining neurons is scaled by a factor equal to the inverse of the dropout probability, to ensure that the expected sum of the output remains the same.

Some benefits of implementing dropout regularization:

  • Can be applied to various types of neural networks.
  • Can reduce training time: since it performs model averaging during training, it reduces the variance of the gradients and speeds up the convergence of the network.


Some drawbacks:

  • Can reduce the accuracy on small datasets: it can reduce the accuracy of the network on small datasets due to the fact that it can introduce more noise and reduce the effective number of training examples.
  • Requires careful tuning of the dropout probability: this constant needs to be tuned with a careful touch, otherwise the performance must be taken care into consideration. A high dropout probability may underfit the model, while a low dropout probability can overfit the model.


Data augmentation

Data augmentation is used to increase the size and diversity of the training dataset, by creating new training examples from the existing ones. Data augmentation techniques typically apply random transformations to the existing data samples, such as rotations, translation, scaling flipping, cropping and noise or distortions to create new samples that are still representative of the original data but with some variations.

For instance, let’s say that I am training a model to scan an image and recognize what the drawn number is in the image. With data augmentation, I can do all sorts of transformations such as rotating the images, so it is way more prepared to face other datasets


Some of the most important benefits of data augmentation:

  • Increased dataset size.
  • Improved generalization: it can create new examples that capture more variations and patterns in the data.
  • Reduces overfitting: data augmentation can help reduce overfitting by providing more diverse training examples.


Now some drawbacks:

  • Increased computational cost: it makes the model require additional computational resources? to generate and preprocess the augmented data. It may lead to longer training times and higher hardware requirements.
  • Limited effectiveness on certain problems: now with data that is highly structured or has low variability, it may not be as certain with some datasets or problems. In this scenario, other techniques may be more effective.


To sum up, while it may have some potential drawbacks, data augmentation remains a powerful technique for improving the performance and robustness of our machine learning models, especially in computer vision and other areas where large and diverse training datasets are important.


Early stopping

This technique prevents overfitting and improves the generalization performance of the machine learning model. The idea is to stop the training process before the model starts to overfit the training data, by monitoring the validation loss during training

No hay texto alternativo para esta imagen

As you can see in the image, the idea with early stopping, the training finds its end just after it figures the training data and before the overfitting occurs. In order to know just the right moment to stop the training, you must take into consideration the specific problem and the characteristics of the data. Early stopping should be used in conjunction with other techniques.


Here’s an implementation using Python:


def early_stopping(cost, opt_cost, threshold, patience, count):

????"""

????????Determines if you should stop gradient descent early


????????Early stopping should occur when the validation cost of the network

????????has not decreased relative to the optimal validation cost by more than

????????the threshold over a specific patience count


????????cost -- current validation cost of the neural network

????????opt_cost -- lowest recorded validation cost of the neural network

????????threshold -- threshold used for early stopping

????????patience -- patience count used for early stopping

????????count -- count of how long the threshold has not been met


????????Returns: a boolean of whether the network should be stopped early,

????????followed by the updated count

????"""

????if opt_cost - cost > threshold:

????????count = 0

????else:

????????count += 1

????if count != patience:

????????return False, count

????return True, count


Most notable pros:

  • Faster convergence: it can help the model converge faster by preventing unnecessary training on the training data.
  • Simpler models: it prevents the model from becoming too complex or specialized to the training data.


Some cons:

  • Stopping too early: it may stop the training process too early before the model has converged to a good solution. This may lead to underfitting and poor performance on the test set.
  • Increased complexity: early stopping adds additional operations to the training process and requires additional hyperparameters to be tuned. It may make it more difficult to optimize the performance of the model.



All in all, regularization techniques are all methods that involve a better response from machine learning models. It becomes really helpful when developing a piece of software that involves a model that may make predictions with unseen data.

Hope you learnt something new from this essay. Thanks for reading.

要查看或添加评论,请登录

Mateo Gallo的更多文章

  • 'Theseus' Chess bot

    'Theseus' Chess bot

    In 1989, Garry Kasparov faced IBM’s Deep Thought, a computer made to play Chess. This computer was defeated by Kasparov…

    2 条评论
  • Bayesian Optimization

    Bayesian Optimization

    In the rapidly evolving landscape of machine learning, optimization techniques play a pivotal role in solving complex…

  • I did machine learning in C

    I did machine learning in C

    Introduction First of all, I want to clarify that all the code that I wrote in C to perform machine learning was…

  • Image Recognition with Transfer Learning

    Image Recognition with Transfer Learning

    Abstract This article explores the concept of transfer learning in machine learning. Transfer learning involves…

  • What is AlexNet?

    What is AlexNet?

    Nowadays image recognition is something that we give for granted. For instance, in our phones we have both simple QR…

  • Supervised Learning Optimization

    Supervised Learning Optimization

    Supervised learning is a technique used in machine learning where an algorithm learns from labeled data to make…

社区洞察

其他会员也浏览了