I created a cheatsheet which provides practical explanation and comparison of most popular loss functions for training models in PyTorch across various domains.
More attention provided to comparison of classification and regression losses and notes about specific losses about when you need them.
The loss function is a measurable way to gauge the performance and accuracy of a machine learning model. In this case, the loss function acts as a guide for the learning process within a model or machine learning algorithm. More details about what loss function is and how it works you can find in [6]
Loss functions cheatsheet
Summary and comparison of each loss function
LogLoss, Cross Entropy Loss
Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.
The graph above shows the range of possible loss values given a true observation (isDog = 1). As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but especially those predictions that are confident and wrong!
Cross-entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.
In PyTorch nn.BCEWithLogitsLoss (aka BCE with sigmoid) is comparable to nn.CrossEntropyLoss and nn.NLLLoss. While the former uses a nn.LogSoftmax activation function internally, you would have to add it in the latter criterion.
Both approaches below are the same.
For a binary classification you could use nn.CrossEntropyLoss() with a logit output of shape [batch_size, 2] or nn.BCELoss() with a nn.Sigmoid()
MAE (L1), MSE (L2), RMSE
In two words: L2 is more stable(balancing bias and variance), L1 is more robust(better work with outliers).
The RMSE has the same unit as the target variable, like MAE, but not too different, like MSE. It penalizes large errors more than small errors like MSE, but not too much like MSE. In practice MSE is widely used.
Huber Loss
It is in between MSE and MAE with a parameter known as threshold. It calculates the quadratic loss by default (MSE) and if the loss exceeds the threshold, MAE is calculated and applied.
It’s less sensitive to outliers than the MSE as it treats error as square only inside an interval. Further information can be found at Huber Loss in Wikipedia.
Focal Loss
Focal Loss was introduced by Lin et al of Facebook AI Research in 2017 as a means of combating extremely imbalanced datasets where positive cases were relatively rare. Their paper "Focal Loss for Dense Object Detection'' is retrievable here. In practice, the researchers used an alpha-modified version of the function so I have included it in this implementation.
Tversky Loss
This loss was introduced in "Tversky loss function for ima
ge segmentation using 3D fully convolutional deep networks", retrievable here. It was designed to optimize segmentation on imbalanced medical datasets by utilizing constants that can adjust how harshly different types of error are penalized in the loss function.?
CTC Loss
CTC is an algorithm employed for training deep neural networks in tasks like speech recognition and handwriting recognition, as well as other sequential problems where there is no explicit information about alignment between the input and output. CTC provides a way to get around when we don’t know how the inputs map to the output.?
Dice-S?rensen Loss
The Dice coefficient, or Dice-S?rensen coefficient, is a common metric for pixel segmentation that can also be modified to act as a loss function
Jaccard (IoU) Loss
Intersection over Union (IoU) is used for object detection and segmentation. It measures how well a predicted object aligns with the actual object annotation.
This is used for “maximum-margin” classification. It measures the difference between the predicted and the actual output.
Kullback-Leibler Loss
This is used in Variational Autoencoders (VAEs). It measures how one probability distribution differs from another.
Wasserstein Loss
This is used in Wasserstein GANs. It measures the distance between the data distribution observed in the training dataset and the distribution observed in the generated examples.
Useful links:
AI/ML | NLP, Computer Vision - Golden Visa Holder
6 个月This is excellent. I love such detailed posts especially when its a comparison and illustrated in a table