Accuracy: The Bias-Variance Trade-off
Accuracy: The Bias-Variance Trade-off
In the article “Which Machine Learning (ML) to choose?" [1], as part of the "Architectural Blueprints—The “4+1” View Model of Machine Learning," which helps you to choose the right ML for your data, we indicated that “From a business perspective, two of the most significant measurements are accuracy and interpretability.” [Interpretability/Explainability: “Seeing Machines Learn”]
We also claimed that “Evaluating the accuracy of a machine learning model is critical in selecting and deploying a machine learning model.”
-???????But, what factors affect model accuracy?
Accuracy is the percentage of correct predictions that a trained ML model makes. Accuracy is affected by your model fitting. And, model fitting depends on the Bias-Variance Trade-off in machine learning. Balancing bias and variance can solve overfitting and underfitting.
Bullseye Diagram: The Distribution of Model Predictions. Image adapted: Domingo (2012) [2]
Additionally, accuracy is affected by your machine learning scenarios, which depend on learning categories, data types, and objectives. [Scenarios: Which Machine Learning (ML) to ch
Moreover, computational complexity of an algorithm is a fundamental concept in computer science. It is necessary to be taken into account because it affects the accuracy and amount of resources required to run your model. [Complexity: Time, Space, & Sample]
Furthermore, future accuracy is affected by "data drift" and "concept drift". ML Operations (MLOps) and Continuous ML (CML) are a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. [Operations: MLOps, Continuous ML, & AutoML]
-???????Definition
"Model fitting is a measure of how well [optimize] a machine learning model generalizes to similar [evaluation] data to that on which it was trained. A well-fitted model [optimal-fitted] produces more accurate outcomes ("Precisely Right"). A model that is overfitted matches the data too closely. A model that is under-fitted does not match closely enough." [3]
"In machine learning, overfitting occurs when a learning model customizes itself too much to describe the relationship between training data and the labels. Overfitting tends to make the model very complex by having too many parameters. By doing this, it loses its generalization power, which leads to poor performance on new [evaluation] data." [4]
"Your model is underfitting the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y)." [5]
Machine learning model complexity refers to the capacity of a model to fit complex patterns in the data. A more complex model can capture intricate relationships, but it also runs the risk of overfitting, which occurs when the model learns the training data too well and performs poorly on new, unseen data. Here are some factors that contribute to model complexity:
In essence, a more complex model is capable of learning more intricate patterns, but it also carries a higher risk of overfitting. Striking the right balance between model complexity and generalization ability is a key challenge in machine learning.
-???????Root Causes
Model fit depends on solving the issue and balancing the trade-off between bias and variance.
"Understanding model fit is important for understanding the root cause for poor model accuracy. This understanding will guide you to take corrective steps. We can determine whether a predictive model is underfitting or overfitting the training data by looking at the prediction error on the training and evaluation data." [6]
Variance is the degree of spread in a data set which indicates how far a set of data points are spread out from their mean [average] value. The variance of an estimated function indicates how much the function is capable of adjusting to the change in a data set. High variance results in overfitting leading to an imprecise [not reliable] model. It can be caused by having too many features, building a more complex model than necessary, or capturing a high noise level. Generally, high variance models tune themselves and are more robust to a changing data set, but they are more complex and overly flexible.
Bias is the difference between the estimated value and the true value of the parameter being evaluated. High bias results in underfitting leading to an inaccurate [not valid] ("Generally Wrong") model. It can be caused by training on a small data set, building a simple model to capture complex patterns, or not taking into account all the features given for training which causes learning incorrect relations. Generally, high-bias models learn faster and are easy to understand, but they are less flexible. [7]
Cognitive biases are systematic patterns of deviation from norm or rationality in judgment. They are often studied in psychology, sociology and behavioral economics. Although the reality of most of these biases is confirmed by reproducible research, there are often controversies about how to classify these biases or how to explain them.
Cognitive Biases. Table: Justin Wright
Biases have a variety of forms and appear as cognitive ("cold") bias, such as mental noise, or motivational ("hot") bias, such as when beliefs are distorted by wishful thinking. Both effects can be present at the same time. There are also controversies over some of these biases as to whether they count as useless or irrational, or whether they result in useful attitudes or behavior. For example, when getting to know others, people tend to ask leading questions which seem biased towards confirming their assumptions about the person.
“A major difference between machine learning and statistics is their purpose. Machine learning models are designed to make the most accurate predictions possible. Statistical models are designed for inference about the relationships between variables.”
Statistical bias is a systematic tendency that causes differences between results and facts. Statistical bias may be introduced at all stages of data analysis: data selection, hypothesis testing, estimator selection, analysis methods, and interpretation.
Statistical Bias Sources from Stages of Data Analysis. Diagram: Visual Science Informatics, LLC
Systematic error (bias) introduces noisy data with high bias but low variance. Although measurements are inaccurate (not valid), they are precise (reliable). Repeatable systematic error is associated with faulty equipment or a flawed experimental design and influences a measurement's accuracy ("Precisely Wrong").
Errors in Health Research. Chart: Unknown Author
Reproducibility (Random) error (variance) introduces noisy data with low bias but high variance. Although measurements are accurate (valid), they are imprecise (not reliable). The repeatable error is due to a measurement process and primarily influences a measurement's accuracy. Reproducibility refers to the variation in measurements made on a subject under changing conditions ("Generally Right").
Bias-Variance Trade-off. Graphs: Ivan Reznikov, PhD
Underfitting, Optimal-fitting, and Overfitting in Machine Learning. Graphs adapted from Scott Fortmann-Roe [8], Abhishek Shrivastava [9], and Andrew Ng [10]
Essentially, data quality, bias (systematic error), and variance (reproducibility random error) factors affect your ML model accuracy.
-???????Trade-Off
“The expected test error of an ML model can be decomposed into its bias and variance through the following formula:
???????? ?????????? = ????????2 + ???????????????? + ?????????????????????? ??????????
So, to decrease the estimation error [to improve accuracy], you need to decrease both the bias and variance, which in general are inversely proportional and hence the trade-off." [11]
The bias-variance trade-off needs to be balanced to address any differences in accuracy. But, increasing bias (not always) reduces variance and vice-versa.
- Classification Evaluation Metrics & Confusion Matrix
Once you fit your ML model, you must evaluate its performance on a test dataset.
Evaluating your model performance is critical, as your model performance allows you to choose between candidate models and to communicate how reasonable the model is at solving the problem.
Measuring, for instance, a binary output prediction (Classification) is captured in a specific table layout - a Confusion Matrix, which visualizes whether a model is confusing two classes. "An NxN table that summarizes the number of correct and incorrect predictions that a classification model made. There are four possible outcomes for each output from a binary classifier." [Google] Each row of the matrix represents the instances in an actual class, while each column represents the instances in a predicted class. Four measures are captured: True Positive, False Negative, False Positive, and True Negative.
Calculating accuracy is derived from the four values in a confusion matrix. The accuracy of diagnostic tests is the proportion of subjects who give the correct result. Additional metrics with formulas on the right and below are Classification Evaluation Metrics. These metrics include but are not limited to the following: Sensitivity, Specificity, Accuracy, Negative Predictive Value, and Precision.
Confusion Matrix for Model Evaluation and Formulas for Calculating Summary Statistics. Table: Rowland Pettit, et al.
"The False Positive Rate (FPR) is the proportion of all actual negatives that were classified incorrectly as positives, also known as the probability of false alarm. It is mathematically defined as:
False Positive Rate (FPR) Formula.
False positives are actual negatives that were misclassified, which is why they appear in the denominator. A perfect model would have zero false positives and therefore a FPR of 0.0, which is to say, a 0% false alarm rate. In an imbalanced dataset where the number of actual negatives is very, very low, say 1-2 examples in total, FPR is less meaningful and less useful as a metric." [Google]
- Type I, Type II, and Type III Errors
In statistics, particularly in hypothesis testing, there are three main types of errors that can occur:
Type I Error
A Type I error occurs when you reject a null hypothesis that is actually true. This is often referred to as a "False Positive (FP)." For example, if a medical test incorrectly indicates that a person has a disease when they actually do not, that is a Type I error.
Type II Error
A Type II error occurs when you fail to reject a null hypothesis that is actually false. This is often referred to as a "False Negative (FN)." For example, if a medical test incorrectly indicates that a person does not have a disease when they actually do, that is a Type II error.
Type III Error
A Type III error occurs when you correctly reject the null hypothesis but draw the wrong conclusion about the alternative hypothesis. This is often less discussed but can be as problematic as Type I and Type II errors. For example, if you correctly identify that a new drug is effective, but mistakenly conclude that it is more effective than an existing drug, that is a Type III error.
Note: The balance between Type I and Type II errors is often considered in hypothesis testing. A higher significance level (alpha) increases the chance of a Type I error but decreases the chance of a Type II error. Conversely, a lower significance level decreases the chance of a Type I error but increases the chance of a Type II error. Type I and Type II errors can be caused by random sampling, but they can also be caused by bias. The likelihood of these errors can be reduced by increasing the sample size.
Type I, Type II, and Type III Errors. Table: Gemini
Confusion Matrix Heatmap. Heatmap: KNIME, AG
"In a binary classification, a number between 0 and 1 that converts the raw output of a logistic regression model into a prediction of either the positive class or the negative class. The probability score is not reality, or ground truth.
All of the metrics in this section are calculated at a single fixed threshold, and change when the threshold changes. Very often, the user tunes the threshold to optimize one of these metrics.
Note that the classification threshold is a value that a human chooses, not a value chosen by model training. Which evaluation metrics are most meaningful depends on the specific model and the specific task, the cost of different misclassifications, and whether the dataset is balanced or imbalanced." [Google]
Choice of Metric & Tradeoffs. Table: Google
In addition to accuracy, there are numerous model evaluation metrics. Three metrics that are commonly reported for a model on a binary classification problem are:
Precision quantifies the number of positive class predictions that belong to the positive class. Recall quantifies the number of positive class predictions made out of all positive examples in the dataset. Precision and recall often show an inverse relationship, where improving one of them worsens the other. "The metrics form a hierarchy that starts by counting the true/false negatives/positives, at the bottom, continues by calculating the Precision and Recall (Sensitivity) metrics, and builds up by combining them to calculate the F1 score." [12]
Hierarchy of Metrics from Labeled Training Data and Classifier Predictions to F1 score. Diagram adapted: Teemu Kanstrén
Precision and Recall Formulas.
The F1 score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. It is primarily used to compare the performance of two finer-grained classifiers.
F1 score Formula.
Why F1 score?
Interpretation
When to Use F1 score
Matthews Correlation Coefficient (MCC)
MCC (Phi Coefficient) is a metric used to evaluate the performance of classification models for binary problems (two classes). It takes into account true positives, true negatives, false positives, and false negatives from the confusion matrix. MCC is a valuable tool for evaluating binary classification models, particularly when:
MCC = (TP * TN - FP * FN) / ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
-1: Complete disagreement between prediction and reality
-1 to 0: Indicate worse than random classification
0: Random classification (no better than chance)
0 to +1: Indicate different levels of good performance
+1: Perfect classification
Advantages of MCC:
Disadvantages of MCC:
By understanding MCC, you can gain a more comprehensive picture of your classification model's performance in binary problems.
Classification Evaluation Metrics. Table: Gemini
- Regression Evaluation Metrics
When evaluating the performance of a regression model, several metrics are commonly used. These metrics provide insights into how well the model fits the data and how accurate its predictions are.
R-squared (R2)
Adjusted R-squared
Mean Squared Error (MSE) - Average L2 Loss
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE) - Average L1 Loss
Mean Squared Prediction Error (MSPE)
Regression Evaluation Metrics. Table: Gemini
Choosing the Right Metric
The choice of metric depends on the specific context and goals of the regression analysis. For example, if outliers are a concern, MAE might be preferred over MSE. If interpretability is important, RMSE can be useful. Ultimately, a combination of metrics can provide a more comprehensive understanding of the model's performance.
- Unsupervised Evaluation Metrics
When dealing with unsupervised learning tasks, where ground truth labels are not available, different metrics are used to assess the quality of the clustering results. Here are four commonly employed metrics:
Rand Index (RI)
Adjusted Rand Index (ARI)
Mutual Information (MI)
Normalized Mutual Information (NMI)
Unsupervised Evaluation Metrics. Table: Gemini
Note: While these metrics provide useful insights, it is important to consider their limitations. For instance, the Rand Index can be insensitive to certain types of clustering errors, and Mutual Information can be influenced by the number of clusters. In practice, a combination of metrics may be used to get a more comprehensive evaluation.
- Other Evaluation Metrics
Cross-Validation Errors (CV Errors)
Heuristic Methods to Find K (in Clustering)
Elbow Method: Plots the explained variance ratio against the number of clusters. The "elbow" point in the plot often indicates the optimal number of clusters.
Silhouette Coefficient: Measures how similar a data point is to its own cluster compared to other clusters. The optimal number of clusters is often the one that maximizes the average silhouette coefficient. Range is -1 to 1; Higher is better. ?
Gap Statistic: Compares the within-cluster dispersion to the expected dispersion under a null hypothesis. The optimal number of clusters is the one that minimizes the gap statistic.
BLEU Score (BiLingual Evaluation Understudy)
These are just a few examples of other evaluation metrics that can be used in different contexts. The choice of metric depends on the specific task and the goals of the evaluation.
- Estimating Uncertainty in ML
Uncertainty in machine learning refers to the lack of confidence or certainty in a model's predictions. It is a crucial aspect of understanding a model's limitations and ensuring its reliability, especially in critical applications.
Uncertainty in Machine Learning. Table: Gemini
Uncertainty is an inherent aspect of machine learning models. It arises due to various factors, including noise in data, model complexity, and the inherent randomness of real-world phenomena. Understanding and quantifying uncertainty is crucial for building reliable and trustworthy ML models.
By understanding and addressing uncertainty in machine learning, you can build more reliable, trustworthy, and effective ML models.
- Estimating Future Accuracy Performance
Holdout method, cross-validation, and bootstrap sampling are techniques used in statistics and ML to evaluate the accuracy performance of models. They achieve this goal by resampling the data in different ways.
Comparison of?Holdout validation, k-fold Cross-validation, and Bootstrap sampling. Diagrams: Vikas More
Holdout Method
This is a simple approach where the data is split into two sets:
Commonly, the split is 80% for training and 20% for testing. The advantage of this method is its simplicity. However, the accuracy performance estimate can be sensitive to how the data is split. A single random split might not be representative of the entire dataset.
Cross-Validation
This is a more robust approach compared to the holdout method. It involves splitting the data into multiple folds (usually k folds) and iterating through the following steps:
Common variations include k-fold cross-validation (where k is a chosen number of folds) and leave-one-out cross-validation (where k is equal to the number of data points). This method provides a more reliable estimate of the model's generalizability.
Bootstrap Sampling
This method involves creating new datasets, called bootstrap replicates, by sampling with replacement from the original data. This means a data point can be chosen multiple times in a single replicate, and some points might not be included at all. The replicates are then used to train the model, and the variability in the model's accuracy performance across these replicates is used to estimate the model's generalizability and uncertainty.
Bootstrap sampling is particularly useful for smaller datasets where holdout methods might not be reliable. It is also used to estimate the distribution of statistics, not just model accuracy performance.
Here is a table summarizing the key differences:
Estimating Future Accuracy Performance Techniques. Table: Gemini
- Evaluation (Validation & Testing) in Traditional ML Workflow
Building a Machine Learning Model. Diagram: Chanin Nantasenamat
The traditional machine learning workflow is a structured process for developing and deploying machine learning models. It can be broken down into several key stages:
1. Data Preparation
2. Modeling
3. Evaluation
The traditional workflow above focuses on the core development stages of building and evaluating a machine learning model.
Traditional ML workflows primarily focus on model development and evaluation. However, the dynamic nature of data, characterized by data drift and concept drift, significantly impacts model accuracy over time. To address these challenges and maintain model reliability and efficiency in production, organizations adopt MLOps and Continuous ML (CML) practices. These methodologies encompass a comprehensive approach to deploying and maintaining ML models, including continuous monitoring, retraining, and redeployment, model versioning, experimentation, and robust collaboration between teams. [Operations: MLOps, Continuous ML, & AutoML]
Cross-validation model and trained model are two key concepts in machine learning, each serving a distinct purpose in the development and evaluation of predictive models. A cross-validation model assesses a model's performance on unseen data and prevent overfitting. The average performance metrics across all iterations provides a more robust estimate of the model's generalization ability. After selecting the best model based on cross-validation, you train it on the entire training dataset and deploy it to make predictions on new, unseen data based on the learned patterns in the training data.
Selecting the optimal estimator for a ML problem can be challenging due to the multitude of options available. Effectively navigating this process requires a deep understanding of the data, including its distribution, outliers, and missing values. By considering model bias, variance, and interpretability needs, practitioners can make informed decisions. Cross-validation and hyperparameter tuning are essential for building robust models. Additionally, feature engineering and domain knowledge play crucial roles in enhancing model performance. Ultimately, the choice of estimator is an iterative process that involves experimentation and evaluation.
Choosing the Right Estimator. Diagram: scikit-learn
The importance and interpretation of evaluation metrics depend on the domain and context of your ML model. For instance, medical tests are evaluated by specificity and sensitivity, while information retrieval systems are evaluated by precision and recall. Understanding the differences between precision and recall vs. specificity and sensitivity is significant in your model evaluation within a specific domain. [13]
Bias vs. Variance of ML Algorithms. Chart: Ega Skura
For ML model builders, understanding how accuracy is affected by their model fitting is essential. Building an accurate classification model can correctly classify positives from negatives.
- Dataflow in a Traditional ML Workflow
Dataflow in a Traditional ML Workflow. Diagram: Visual Science Informatics
-???????Remedies
Overfitting and underfitting are common challenges in machine learning, where a model performs poorly on unseen data. There are effective technique in solving the issue of overfitting and underfitting and building an optimal-fitting ML model. Here is a breakdown of both issues and how to address them:
ABC of Data Science
ML is a form of Artificial Intelligence (AI), which makes predictions and decisions from data. It is the result of training algorithms and statistical models to analyze and draw inferences from patterns in data, which are able to learn and adapt without following explicit instructions. However, you need to:
The Assumptions, Biases, and Constraints (ABC) of data science, Data, and Models of ML can be captured in this formula:
Machine Learning = {Assumptions/Biases/Constraints, Data, Models}
Diagnosing ML Model "Goodness-of-Fit" using Learning Curves
“The term goodness-of-fit refers to?a statistical test that determines how well sample data fits a distribution from a population with a normal distribution. Put simply, it hypothesizes whether a sample is skewed or represents the data you would expect to find in the actual population." [Investopedia]
Goodness-of-fit in machine learning assesses how closely the model's predictions align with the actual observed data points. A good-fitting model accurately captures the underlying patterns and relationships within the data, leading to reliable predictions and insights. By selecting appropriate goodness-of-fit metrics, you can effectively evaluate the performance of your machine learning models and make informed decisions.
Diagnosing ML Model "Goodness-of-Fit" using Learning Curves. Visual Science Informatics [14] [15] [16]
Generalization Conditions
"Training a model that generalizes well implies the following dataset conditions:
Dimension Reduction Techniques
Dimension reduction is a technique used to simplify data by reducing the number of variables or features while preserving as much information as possible. This can be especially beneficial when dealing with high-dimensional datasets, which can be computationally expensive to analyze.
Why is dimension reduction important?
Dimension Reduction Techniques. Diagram: Javapoint
领英推荐
1. Factor Analysis
- Assumes that observed variables are linear combinations of a smaller set of latent variables.
- Identifies underlying constructs or patterns in the data.
2. Principal Component Analysis (PCA)
- Finds a new set of uncorrelated variables (principal components) that explain the most variance in the data.
- Commonly used for linear dimensionality reduction.
- Often used for exploratory data analysis and visualization.
- Best suited for: Linear relationships, large datasets, and when the goal is to reduce dimensionality while preserving most of the variance.
- Example: Analyzing gene expression data to identify patterns or biomarkers.
3. Independent Component Analysis (ICA)
- Separates a multivariate signal into a set of statistically independent components. It assumes that the observed data is a linear mixture of underlying independent sources.
- Sensitive to the choice of independence criterion. May not be effective if the sources are highly correlated.
- Best fit for: handling non-Gaussian sources. Does not require prior knowledge of the mixing matrix. Can be computationally efficient.
- Examples: Blind Source Separation (BSS) - Separating mixed signals, such as audio signals or EEG data. Feature extraction - Extracting meaningful features from high-dimensional data. Medical imaging - Analyzing brain signals and medical images. Financial analysis - Identifying hidden factors in financial data.
4. Isometric Mapping (ISOMAP)
- Captures nonlinear relationships in data that aims to preserve the geodesic distances between data points in a high-dimensional space.
- Can be computationally expensive for large datasets and sensitive to the choice of neighborhood parameters.
- Best suited for: when the data points lie on a nonlinear manifold and persevering global structure of the manifold.
- Examples: Data visualization - Understanding the underlying structure of high-dimensional data. Image and video analysis - Extracting meaningful features from visual data. Machine learning - As a preprocessing step to improve the performance of algorithms.
5. t-Distributed Stochastic Neighbor Embedding (t-SNE)
- Preserves local structure in the data and is useful for visualizing high-dimensional data in 2D or 3D.
- Best suited for: Visualizing high-dimensional data in 2D or 3D, preserving local structure, and understanding non-linear relationships.
- Example: Visualizing word embeddings to understand semantic relationships between words.
6. Uniform Manifold Approximation and Projection (UMAP)
- Reduces non-linear dimensionality and preserves local structure.
- Best suited for: Non-linear relationships, preserving global and local structure, and handling complex data structures.
- Example: Analyzing time series data to identify patterns or anomalies.
Note PCA focuses on global structure and variance, while t-SNE and UMAP prioritize preserving local structure, making them better suited for visualization and exploring complex relationships. In essence, PCA often reduces the number of features while retaining most of the information, where t-SNE and UMAP create a new, lower-dimensional representation of the data that emphasizes relationships between data points.
Choosing the Right Technique
Before applying any dimensionality reduction technique, it is crucial to understand the characteristics of your data. The best technique for a given dataset depends on factors such as:
- Goals of the analysis (e.g., visualization, classification, feature extraction)
- Nature of the data (e.g., numerical, categorical)
- Number of dimensions (e.g., features, variables)
- Combing multiple variables to a feature (e.g., aggregation, interaction, bucketing)
- Transforming raw data into features, which are more informative and relevant, via transformation techniques (e.g., transformation, encoding)
- Distribution of features (e.g., normal, skewed)
- Relationships between features (e.g., correlations, dependencies, causality)
- The computational resources available
Breakdown of the decision tree
1. Goal
- Visualization: t-SNE or UMAP are often preferred for their ability to preserve local structure.
- Feature extraction: PCA can be used to extract the most important features.
- Classification or regression: Consider techniques such as LDA or autoencoders.
2. Data Type and Dimensionality
- High-dimensional numerical data: Consider PCA, t-SNE, or UMAP.
- Low-dimensional numerical data: PCA might be sufficient.
- Categorical data: Consider techniques such as correspondence analysis or multidimensional scaling.
3. Linearity
- Linear relationships: PCA is a good choice.
- Non-linear relationships: t-SNE or UMAP are more suitable.
4. Computational Resources
- Large datasets: PCA might be computationally more efficient than t-SNE or UMAP.
Additional Considerations
- Domain knowledge: Incorporate your understanding of the data and the problem domain to make informed decisions.
- Experimentation: Try different techniques and evaluate their performance using appropriate metrics.
- Hyperparameter tuning: Fine-tune the parameters of each technique to optimize results.
Remember that this is a general guide, and the best choice often depends on specific data characteristics and goals. It is always a good practice to experiment with different techniques and evaluate their performance to find the most suitable one for your particular problem.
- Underfitting Example
Underfitting example. Python code: Skbkekas. Graphs: Visual Science Informatics
"A training set (left) and a test set (right) from the same statistical population are shown as blue points. Two predictive models are fit to the training set. Both fitted models are plotted with both the training and test sets. In the training set, the MSE of the fit shown in orange is about 10 whereas the MSE for the fit shown in green is about 8. In the test set, the MSE for the fit shown in orange is about 14 and the MSE for the fit shown in green is about 10. The orange curve severely underfits the training set, since its MSE increases by almost a factor of 4 when comparing the test set to the training set. The green curve underfits the training set much less, as its MSE increases by less than a factor of 2."
- Remedies for Underfitting
In underfitting, the model is too simple and fails to capture the underlying patterns in your data, leading to poor performance on both training and unseen data. Underfitting is the opposite of overfitting.
Data
Model Complexity
Training
- Overfitting Example
Overfitting example. Python code: Skbkekas. Graphs: Visual Science Informatics
"A training set (left) and a test set (right) from the same statistical population are shown as blue points. Two predictive models are fit to the training set. Both fitted models are plotted with both the training and test sets. In the training set, the MSE of the fit shown in orange is about 1 whereas the MSE for the fit shown in green is about 6. In the test set, the MSE for the fit shown in orange is about 11 and the MSE for the fit shown in green is about 7. The orange curve severely overfits the training set, since its MSE increases by almost a factor of 10 when comparing the test set to the training set. The green curve overfits the training set much less, as its MSE increases by less than a factor of 1."
- Remedies for Overfitting
Imagine a model memorizing every detail of your training data, including random noise. This makes it perform well on that specific data but fails to generalize to new examples.
Data
Model Complexity
Training
It is critical to find the balance between overfitting and underfitting. Experiment with different techniques and evaluate your model's performance on a validation set to determine the best approach for your specific scenario. Combine models via the Stacking ensemble method to reduce both bias and variance. In practice, L1 and L2 regularization are often combined in a technique called Elastic Net. This combines the benefits of both L1 and L2 regularization, promoting sparsity and preventing overfitting.
"Models trained on large datasets with few features generally outperform models trained on small datasets with a lot of features. It is possible to get good results from a small dataset if you are adapting an existing model already trained on large quantities of data from the same schema." [Google]
"Learning rate and regularization rate tend to pull weights in opposite directions. A high learning rate often pulls weights away from zero; a high regularization rate pulls weights towards zero.
If the regularization rate is high with respect to the learning rate, the weak weights tend to produce a model that makes poor predictions. Conversely, if the learning rate is high with respect to the regularization rate, the strong weights tend to produce an overfit model.
Your goal is to find the equilibrium between learning rate and regularization rate. This can be challenging. Worst of all, once you find that elusive balance, you may have to ultimately change the learning rate. And, when you change the learning rate, you'll again have to find the ideal regularization rate." [Google]
Feedback loops in ML can occur when a model's output is used as input to the same or another model, creating a circular dependency. This can lead to unintended consequences, such as: Reinforcement (amplify) of biases, Instability (unstable or oscillate), or Reduced accuracy (degrade over time). Here are techniques to address feedback loops:
By carefully considering these techniques, you can effectively address feedback loops in your machine learning systems and ensure their stability and accuracy.
"When creating a model with multiple features, the values of each feature should span roughly the same range. If one feature's values range from 500 to 100,000 and another feature's values range from 2 to 12, the model will need to have weights of extremely low or extremely high values to be able to combine these features effectively. This could result in a low quality model. To avoid this, normalize features in a multi-feature model.
Broadly speaking, the process of converting a variable's actual range of values into a standard range of values, such as:
Normalization is a common task in feature engineering. Models usually train faster (and produce better predictions) when every numerical feature in the feature vector has roughly the same range." [Google]
Summary of Normalization Techniques. Table: Google
"Normalization provides the following benefits:
Warning: If you normalize a feature during training, you must also normalize that feature when making predictions in validation, testing, and production!
- Data Visualization
Data visualization is a graphical representation of information and data. Using visual elements, such as charts, graphs, and maps, data visualization techniques provide a visual way to see and understand trends, outliers, and patterns in data. Visualization tools provide capabilities that help discover new insights by demonstrating relationships between data.
Anscombe's Quartet. Graphs: Schutz [18]
An additional benefit for visualizing data is that data sets that have similar descriptive statistics, such as mean, variance, correlation, linear regression, and coefficient of determination of the linear regression, yet have very different distributions and appear very different when graphed.
Anscombe's quartet [19], in the above image, comprises four data sets that demonstrate both the importance of graphing data when analyzing it and the effect of?outliers?and other?influential observations?on statistical properties.
In ML, the three major reasons, for data visualization, are for understanding, diagnosis, and refinement of your model.
One important purpose, you need to visualize your model, is to provide an interpretable (reasoning) predictive model and explainability of your model. Other significant purposes are visualizing your model architecture, parameters, and metrics. ?
Also, you might need to visualize your model during debugging and improvements, comparison and selection, and teaching concepts.
Visualization is most relevant during training for monitoring and observing several metrics and tracking model training progression. After training, visualizing model inference is the process of concluding out of a trained model. Visualizing the results helps in interpreting and retracing how the model generates its estimates (Visualizing Machine Learning Models: Guide and Tools). [20]
"Visualizations are critical in understanding complex data patterns and relationships. They offer a concise way to understand the: intricacies of statistical models, validate model assumptions, and evaluate model performance." [Avi Chawla]
Plots in Data Science. Graphs: Avi Chawla
The following are plots that can be utilized for machine learning models validation and evaluation:
1) Kolmogorov-Smirnov (KS) plot - Visualizing distributional differences
The KS statistic acts as a statistical test for distributional differences. The KS plot itself offers more than a binary "same or different" answer. By visually inspecting the plot, you can gain insights into the nature of the difference between the distributions by:
2) SHAP plot - Unveiling feature importance with interplay
SHAP plots are incredibly valuable for understanding how features influence a model's predictions. While SHAP values can be used to rank features by importance, their true strength lies in explaining how individual features interact and contribute to specific predictions. They achieve this by:
3) Receiver Operating Characteristic (ROC) curve - Evaluating binary classification models
The ROC curve "depicts the trade-off between the true positive rate (good performance) and the false positive rate (bad performance) across different classification thresholds."
Interpreting the ROC curve:
The ROC curve plots TPR against FPR at different classification thresholds. This means it captures the trade-off between correctly identifying true positives and minimizing false positives as you adjust the model's sensitivity.
4) Quantile-Quantile (QQ) plot - Comparing the distribution of your data to a theoretical distribution
QQ plot "assesses the distributional similarity between observed data and theoretical distribution. It plots the quantiles of the two distributions against each other. It visualize deviations from the straight line indicate a departure from the assumed distribution."
Understanding QQ plot:
Types of QQ plot:
Beyond the Basics:
5) Cumulative Explained Variance (CEV) plot - Utilizing as a tool for tasks such as dimensionality reduction
CEV plot is "useful in determining the number of dimensions that can be reduced for your data to while preserving max variance during Principal Component Analysis (PCA).
What it shows:
Interpreting the CEV plot:
6) Elbow curve - Utilizing as a tool particularly for determining the optimal number of clusters in K-means clustering
The Elbow curve "helps identify the optimal number of clusters for the k-means algorithm. The point of the elbow depicts the ideal number of clusters."
Understanding the Elbow curve:
Limitations of the Elbow curve:
7) Silhouette curve - Providing an alternative to the Elbow curve for determining the optimal number of clusters in K-means clustering
Silhouette curve is an alternative to the Elbow curve, which is often ineffective when you have a large number of clusters.
What it shows:
Interpreting the plot:
Benefits over the Elbow curve:
Limitations:
8) Gini Impurity and Entropy - Measuring splits in decision tree algorithms
Gini Impurity and Entropy serve the same purpose of quantifying impurity within a node, but they approach it from different perspectives. "They are used to measure the impurity or disorder of a node or split in a decision tree. The plot compares Gini impurity and Entropy across different splits. It provides insights into the trade-off between these measures."
Gini Impurity:
Entropy:
Key differences:
Choosing the right metric:
9) Bias-Variance Trade-off plot - Finding the right balance between the bias and the variance of a model against complexity
The Bias-Variance Trade-off plot visually depicts the relationship between a model's complexity, its accuracy on the training data (bias), and its ability to generalize to unseen data (variance).
Understanding the plot:
Plot shape:
Interpreting the Trade-off:
10) Partial Dependence Plots (PDPs) - Utilizing as a tool for understanding how individual features influence a model's predictions
PDPs visually show the average marginal effect of a feature on the model's prediction, holding all other features constant. This helps you understand how a specific feature changes the prediction, independent of the influence of other features. PDPs depict the dependence between target and features.
Different types of PDPs:
Interpreting a PDP:
Benefits of PDPs:
Limitations of PDPs:
11) Precision-Recall (PR) curve - Evaluating binary classification models
The PR curve "depicts the trade-off between Precision and Recall across different classification thresholds."
Precision vs. Recall:
The PR curve plots precision against recall at different classification thresholds, similar to the ROC curve. However, unlike the ROC curve, which focuses on false positives, the PR curve directly examines the trade-off between precision and recall, which are often more relevant in certain contexts. When class imbalance is significant or when false negatives are more costly than false positives (e.g., medical diagnosis).
Interpreting the Curve:
Dashboard Charts for Model Accuracy Evaluation in a Guided Automation. Charts: KNIME, AG
Dashboard charts for model accuracy evaluation in a guided automation provides a comprehensive overview of the performance of different machine learning models after undergoing a rigorous automation process.
The dashboard serves as a centralized location for:
Dashboard components:
Potential enhancements:
While the described dashboard provides valuable insights, consider adding these elements for further enrichment:
-???????In Essence
“Balancing bias and variance ... is the best way to ensure that model is sufficiently [optimally] fit on the data and performs well on new [evaluation] data.” Solving the issue of bias and variance is about dealing with overfitting and underfitting and building an optimal model. [29]
Next, read my "Complexity: Time, Space, & Sample" article at?https://www.dhirubhai.net/pulse/complexity-time-space-sample-yair-rajwan-ms-dsc.
---------------------------------------------------------
[5] https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html
[11] https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-and-visualizing-it-with-example-and-python-code-7af2681a10a7
[13] https://towardsdatascience.com/should-i-look-at-precision-recall-or-specificity-sensitivity-3946158aace1
In-depth reading at?https://www.enjoyalgorithms.com/blog/bias-variance-tradeoff-in-machine-learning
Read the "Complexity - Time, Space, & Sample" article at https://www.dhirubhai.net/pulse/complexity-time-space-sample-yair-rajwan-ms-dsc
Very good reference