Is Your Machine Learning Model That Good?
TensorBoard, a machine learning open source platform developed by Google, provides an excellent visualization of an artificial neural network model training in process. The image below presents a compelling precision and accuracy of a trained model.
image 1: A binomial classifier of a stock forecaster. A (1) label represent the stock reaching a meaningful target % change within 21 days. You can clearly see how over time, the model improves. With more data, the model’s accuracy and precision improve. Hence, the model is learning.
The model is a binary classifier that is tasked to identify one of two states:
- (1) - A buy.
- (0) - A sell if you were holding a position, or do nothing.
After 140 sets of training data (called epochs in machine learning lingo) we’ve achieved 64% accuracy and 69% precision.
Accuracy: Measures how often the classifier is correct (64.4%). In other words, how many times (1) and (0) signals were predicted correctly as a percentage of all possible signals.
Precision: Indicates the rate of true positives (69%). Simply stated, precision is how often the model classifies a buy signal successfully as a percentage of all buy signals.
On the surface, the statistics look compelling. If we assume that 69% of all of our trades will generate profit, we can theoretically have a very successful fund. However, this is only true if buy/sell signals were split 50/50 and the profit/loss amount on every transaction were identical. We clearly know that in reality this is never the case. In short, we need more information to determine if the model is good, as further illustrated by the following scenarios:
- Imagine we evaluate the model during bullish periods in which 70% of the time the constituents are favorable to move higher anyway. If the model did nothing but bet a buy or (1) every time, it would have achieved a 70% accuracy by default.
- Imagine the model only makes one prediction and was successful. Does it really mean the model can reliably achieve 100% success? Probably not.
The default assumption of 50/50 signal distribution and 50/50 chance of (1) or (0) of our signal universe (often referred to as label data or data of which its predictive outcome is known) is clearly not a real-world example, and therefore evaluating trading models efficacy can be tricky.
Today, I wanted to take you step by step to show how to grade a model with a single score in order more easily compare many models against each other.
Confusion Matrix
It’s clear that measuring how often the model was correct in identifying (1) labels and (0) labels (true positive and true negatives) is not enough. Measuring the rate in which the model was wrong (false positive and false negatives) should be considered as well.
Confusion matrix is a tabular representation of the performance of a classification model. Given a set of validation data for which the true values are known, the confusion matrix identifies how effective the model was in predicting the known outcome. In our case we are looking at a binary classifier with two potential outputs 1 (buy) or 0 (sell or do nothing).
image 2: A sample of a confusion matrix with 1000 labels. Predicted 1’s and 0’s vs actual 1’s and 0’s.
Blue cells: Represent the total predicted 1 (buys) and 0 (sells) 340 and 660 respectively for a total of 1,000 predictions.
Red cells: Represent the total of actual 1 (buys) and 0 (sells) 460 and 540 respectively for a total of 1,000 observations.
Green cells: True positive, 150 predictions, and true negative 350 predictions. Measures how the classifier correctly predicted the known outcome. We want the model to maximize the value of the green cells.
Yellow cells: False positives of 190 predictions and false negatives of 310 signify the incorrect predictions by the model. We naturally would favor a model that minimizes the yellow cells’ values and maximizes the green cells’ values.
As you can see, even with a simple binary 0 and 1 classification the outcome can be fairly confusing (no pun intended) not to mention when you have three or more potential outcomes, for example 1, 0, -1 (1 buy, 0 do nothing, -1 sell) would generate a three by three confusion matrix.
It turns out that there are quite a few valuable statistical measures that can be derived from the confusion matrix, some of which we’ve seen earlier. To name a few:
Accuracy - How often is the classifier correct? TP+TN/Total Observations - 150+350/1000 = 50%
Precision - How often are the buy predictions correct? TP/Total buy predictions - 150/340 = 44.12%
Sensitivity – Also called True Positive Rate or Recall. What percentage of the buy labels was the classifier able to identify? TP/Actual Buy 150/460 = 32.61%
Specificity – Also called True Negative Rate. What percentage of the sell (0) labels were identified correctly by the classifier? TN / Actual Sell - 350/540 = 64.81%
Misclassification Rate – How often is the classifier wrong? FP + FN / Total (190 + 310) / 1000 = 50%
Null Error Rate – How often you would be wrong if you always predicted the majority class. In our case, sell is the majority class (540 cases). If I forecasted 1000 sells I would be right in 540 of the cases and wrong in 460 of the cases (1000 – 540 = 460). And so, 460 wrong sell forecast against 1000 predictions is 460 / 1000 = 46%.
Judging from the results above, the classifier didn’t truly learn anything meaningful. However, there are so many parameters to sift through, it’s difficult to wrap your head around in order to identify what each of these truly mean. For example:
- Is the classifier making fewer bets in order to obtain a higher degree of precision?
- Is the classifier making better sell or do-nothing predictions vs. buy prediction? This could be meaningful for some investment paradigms, but typically we want to deploy our capital vs. sit on cash.
There must be a way to put all these parameters into a common baseline in order to compare models’ fitness for a predetermined investment objective.
Introducing the ROC curve!
ROC
Receiver Operative Characteristic is a graphical representation of a model’s ability to distinguish between sensitivity and specificity across multiple probability thresholds. In essence, we are looking to evaluate a model’s efficacy based on its ability to distinguish between True Positive Rate and True Negative Rate for a wide range of probability scores. The image below shows how a model distinguishes between true positive and true negative for a 50% probability threshold (designated as “cut off value”).
image 3: A synthetic distribution of a hypothetical model which is able to distinguish very effectively (90% accuracy) between buy (true positive) and sell signals (true negative). The shaded area in which the distribution curves overlap indicate false positive (for the red region left of the vertical threshold line) or false negative (for the blue region right of the vertical threshold line).
Blue cells: Represent the total 0 (sell or do nothing) predictions.
Red cells: Represent the total 1 (buy) predictions.
As you can see, we’ve introduced two additional concepts:
- Probability - A confidence score (normally 0 to 1 or 0 to 100) generated by the classifier. An example is a normalized value of SoftMax output for the 1 labels classification.
- Threshold - The probability cutoff point to distinguish between 1’s and a 0’s labels. In our example, distinguish between sell (blue) and buy (red) signals.
It’s important to note that in the real world the data is rarely distributed as perfectly as depicted in image-3 above.
The two true positive vs. true negative distributions (also known as sensitivity vs specificity) below do a much better job of distinguishing between a poor model on the left and a more effective model on the right.
images 4 and 5: Two model true positive and true negative distribution. The model depicted above first is not very effective as there is a large overlap area under the curves which indicates a high degree of false positives and false negatives. In the bottom image however, we see a much more effective model by which there is a clear distinction between the true positives and true negatives leading to 85% accuracy for a 0.5 threshold (50% threshold).
It's important to note, that as we move the threshold up or down, the distinction between true positive and false positive will change accordingly. A valid probability threshold is mainly driven by business needs, or more specifically its ability to tolerate errors manifested by false positives or false negatives.
For example, if we were to build a model that predicts the likelihood of a deadly disease such as cancer, we would probably use a low threshold since we want to capture ALL cancer patients even with the likelihood of a few false positives. In trading models however, we can be more tolerant towards bad trades if we are able to capture more profitable trades with superior net gains.
Now that we are able to capture true positive rate and false positive rate for any probability threshold in a model, we can go ahead and capture it in a ROC graph.
We plot the ROC graph based on two measures for every possible probability threshold:
- X axis - False positive rate
- Y axis - True positive rate
Images 6 and 7: In the top image we marked two probability thresholds (A and B) for a model that was able to separate effectively the true positives from the true negatives. This class separation can be further depicted along a ROC curves (bottom image), which plots the false positive rate on the X axis and true positive rate on the Y axis.
The ROC graph in essence depicts the tradeoff between true positive rate versus false positive rate as we vary the decision threshold. When comparing multiple ROC curves, a larger “area under the curve” (refers to as AUC) shows a model that does a better job separating classes, or simply put, is more predictive.
In order to place a single value that depicts the strength of the classification model, we can simply measure the area under the curve, AUC, of the ROC graph. A larger value represents a better model.
Three ROC graphs comparison.
Blue – Random guesses by which the true positive and true negatives completely overlap. AUC is measured at 0.5.
Green – Poor model in which large portion of the true positive curve overlaps the true negative. AUC is measured at 0.62.
Orange – Good model with a distinct separation between the true positive and the true negative. AUC is measured at 0.93.
Conclusion
One of the biggest challenges in machine learning research is to identify the best model from an array of trained models. When conducting hyper-parameter tuning, the system can generate hundreds if not thousands of models and identifying which model is superior to its peers can be challenging. This has been traditionally a function of human intuition (some would call it art) and science. We now have a very clean and scientific way to quantifiably measure the strength of a model. Further, we can now modify or extend the cost function of the learner in order to train for a greater AUC (Area Under Curve) score.
Another interesting approach that can be used as an overlay to the AUC score is Cohen’s Kappa. Cohen’s Kappa measures the inter-rater agreement of a classifier. It is a statistical measure of how meaningful a model’s confusion matrix is above a random guess, or by simply being lucky. In our case, it would be nice to further ascertain that our winning AUC score did not occur by chance. We will try and cover Cohen’s Kappa in future posts.