Accuracy: The Bias-Variance Trade-off
Bullseye Diagram: The distribution of model predictions. Diagram adapted: Domingo (2012)

Accuracy: The Bias-Variance Trade-off

Accuracy: The Bias-Variance Trade-off

In the article “Which Machine Learning (ML) to choose?" [1], as part of the "Architectural Blueprints—The “4+1” View Model of Machine Learning," which helps you to choose the right ML for your data, we indicated that “From a business perspective, two of the most significant measurements are accuracy and interpretability.” [Interpretability/Explainability: “Seeing Machines Learn”]

We also claimed that “Evaluating the accuracy of a machine learning model is critical in selecting and deploying a machine learning model.”

-???????But, what factors affect model accuracy?

Accuracy is the percentage of correct predictions that a trained ML model makes. Accuracy is affected by your model fitting. And, model fitting depends on the Bias-Variance Trade-off in machine learning. Balancing bias and variance can solve overfitting and underfitting.

Bullseye Diagram: The Distribution of Model Predictions. Diagram adapted: Domingo (2012)

Bullseye Diagram: The Distribution of Model Predictions. Image adapted: Domingo (2012) [2]

Additionally, accuracy is affected by your machine learning scenarios, which depend on learning categories, data types, and objectives. [Scenarios: Which Machine Learning (ML) to ch

oose?]

Moreover, computational complexity of an algorithm is a fundamental concept in computer science. It is necessary to be taken into account because it affects the accuracy and amount of resources required to run your model. [Complexity: Time, Space, & Sample]

Furthermore, future accuracy is affected by "data drift" and "concept drift". ML Operations (MLOps) and Continuous ML (CML) are a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. [Operations: MLOps, Continuous ML, & AutoML]


-???????Definition

"Model fitting is a measure of how well [optimize] a machine learning model generalizes to similar [evaluation] data to that on which it was trained. A well-fitted model [optimal-fitted] produces more accurate outcomes ("Precisely Right"). A model that is overfitted matches the data too closely. A model that is under-fitted does not match closely enough." [3]

"In machine learning, overfitting occurs when a learning model customizes itself too much to describe the relationship between training data and the labels. Overfitting tends to make the model very complex by having too many parameters. By doing this, it loses its generalization power, which leads to poor performance on new [evaluation] data." [4]

"Your model is underfitting the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y)." [5]

Machine learning model complexity refers to the capacity of a model to fit complex patterns in the data. A more complex model can capture intricate relationships, but it also runs the risk of overfitting, which occurs when the model learns the training data too well and performs poorly on new, unseen data. Here are some factors that contribute to model complexity:

  • Number of parameters: Models with more parameters (e.g., weights and biases in neural networks) are generally more complex.
  • Depth of the model: Deeper models, such as deep neural networks, can learn more complex patterns.
  • Non-linearity: Models that use non-linear activation functions can capture more complex relationships.
  • Regularization: Regularization techniques, such as L1 or L2 regularization, can control model complexity by penalizing large weights.
  • Complexity metrics: Complexity is a function of the square of the model's weights. This is a reasonable metric, commonly used in L2 regularization. It penalizes large weights, which can lead to overfitting.

In essence, a more complex model is capable of learning more intricate patterns, but it also carries a higher risk of overfitting. Striking the right balance between model complexity and generalization ability is a key challenge in machine learning.


-???????Root Causes

Model fit depends on solving the issue and balancing the trade-off between bias and variance.

"Understanding model fit is important for understanding the root cause for poor model accuracy. This understanding will guide you to take corrective steps. We can determine whether a predictive model is underfitting or overfitting the training data by looking at the prediction error on the training and evaluation data." [6]

Variance is the degree of spread in a data set which indicates how far a set of data points are spread out from their mean [average] value. The variance of an estimated function indicates how much the function is capable of adjusting to the change in a data set. High variance results in overfitting leading to an imprecise [not reliable] model. It can be caused by having too many features, building a more complex model than necessary, or capturing a high noise level. Generally, high variance models tune themselves and are more robust to a changing data set, but they are more complex and overly flexible.

Bias is the difference between the estimated value and the true value of the parameter being evaluated. High bias results in underfitting leading to an inaccurate [not valid] ("Generally Wrong") model. It can be caused by training on a small data set, building a simple model to capture complex patterns, or not taking into account all the features given for training which causes learning incorrect relations. Generally, high-bias models learn faster and are easy to understand, but they are less flexible. [7]

Cognitive biases are systematic patterns of deviation from norm or rationality in judgment. They are often studied in psychology, sociology and behavioral economics. Although the reality of most of these biases is confirmed by reproducible research, there are often controversies about how to classify these biases or how to explain them.

Cognitive Biases. Table: Justin Wright

Cognitive Biases. Table: Justin Wright

Biases have a variety of forms and appear as cognitive ("cold") bias, such as mental noise, or motivational ("hot") bias, such as when beliefs are distorted by wishful thinking. Both effects can be present at the same time. There are also controversies over some of these biases as to whether they count as useless or irrational, or whether they result in useful attitudes or behavior. For example, when getting to know others, people tend to ask leading questions which seem biased towards confirming their assumptions about the person.

“A major difference between machine learning and statistics is their purpose. Machine learning models are designed to make the most accurate predictions possible. Statistical models are designed for inference about the relationships between variables.”

Statistical bias is a systematic tendency that causes differences between results and facts. Statistical bias may be introduced at all stages of data analysis: data selection, hypothesis testing, estimator selection, analysis methods, and interpretation.

tatistical bias sources from stages of data analysis. Image: Visual Science Informatics, LLC
Statistical Bias Sources from Stages of Data Analysis. Diagram: Visual Science Informatics, LLC

Statistical Bias Sources from Stages of Data Analysis. Diagram: Visual Science Informatics, LLC

Systematic error (bias) introduces noisy data with high bias but low variance. Although measurements are inaccurate (not valid), they are precise (reliable). Repeatable systematic error is associated with faulty equipment or a flawed experimental design and influences a measurement's accuracy ("Precisely Wrong").

Errors in Health Research. Chart: Unknown Author

Errors in Health Research. Chart: Unknown Author

Reproducibility (Random) error (variance) introduces noisy data with low bias but high variance. Although measurements are accurate (valid), they are imprecise (not reliable). The repeatable error is due to a measurement process and primarily influences a measurement's accuracy. Reproducibility refers to the variation in measurements made on a subject under changing conditions ("Generally Right").


Bias-Variance Trade-off. Graphs: Ivan Reznikov, PhD
Underfitting, Optimal-fitting, and Overfitting in Machine Learning  Images adapted from Scott Fortmann-Roe[8], Abhishek Shrivastava[9], and Andrew Ng[10]
Underfitting, Optimal-fitting, and Overfitting in Machine Learning. Graphs adapted from Scott Fortmann-Roe, Abhishek Shrivastava, and Andrew Ng

Bias-Variance Trade-off. Graphs: Ivan Reznikov, PhD

Underfitting, Optimal-fitting, and Overfitting in Machine Learning. Graphs adapted from Scott Fortmann-Roe [8], Abhishek Shrivastava [9], and Andrew Ng [10]

Essentially, data quality, bias (systematic error), and variance (reproducibility random error) factors affect your ML model accuracy.


-???????Trade-Off

“The expected test error of an ML model can be decomposed into its bias and variance through the following formula:

???????? ?????????? = ????????2 + ???????????????? + ?????????????????????? ??????????

So, to decrease the estimation error [to improve accuracy], you need to decrease both the bias and variance, which in general are inversely proportional and hence the trade-off." [11]

The bias-variance trade-off needs to be balanced to address any differences in accuracy. But, increasing bias (not always) reduces variance and vice-versa.


- Classification Evaluation Metrics & Confusion Matrix

Once you fit your ML model, you must evaluate its performance on a test dataset.

Evaluating your model performance is critical, as your model performance allows you to choose between candidate models and to communicate how reasonable the model is at solving the problem.

Measuring, for instance, a binary output prediction (Classification) is captured in a specific table layout - a Confusion Matrix, which visualizes whether a model is confusing two classes. "An NxN table that summarizes the number of correct and incorrect predictions that a classification model made. There are four possible outcomes for each output from a binary classifier." [Google] Each row of the matrix represents the instances in an actual class, while each column represents the instances in a predicted class. Four measures are captured: True Positive, False Negative, False Positive, and True Negative.

Calculating accuracy is derived from the four values in a confusion matrix. The accuracy of diagnostic tests is the proportion of subjects who give the correct result. Additional metrics with formulas on the right and below are Classification Evaluation Metrics. These metrics include but are not limited to the following: Sensitivity, Specificity, Accuracy, Negative Predictive Value, and Precision.

Confusion Matrix for Model Evaluation and Formulas for Calculating Summary Statistics. Table: Rowland Pettit, et al.

Confusion Matrix for Model Evaluation and Formulas for Calculating Summary Statistics. Table: Rowland Pettit, et al.

"The False Positive Rate (FPR) is the proportion of all actual negatives that were classified incorrectly as positives, also known as the probability of false alarm. It is mathematically defined as:

False Positive Rate (FPR) Formula

False Positive Rate (FPR) Formula.

False positives are actual negatives that were misclassified, which is why they appear in the denominator. A perfect model would have zero false positives and therefore a FPR of 0.0, which is to say, a 0% false alarm rate. In an imbalanced dataset where the number of actual negatives is very, very low, say 1-2 examples in total, FPR is less meaningful and less useful as a metric." [Google]


- Type I, Type II, and Type III Errors

In statistics, particularly in hypothesis testing, there are three main types of errors that can occur:

Type I Error

A Type I error occurs when you reject a null hypothesis that is actually true. This is often referred to as a "False Positive (FP)." For example, if a medical test incorrectly indicates that a person has a disease when they actually do not, that is a Type I error.

Type II Error

A Type II error occurs when you fail to reject a null hypothesis that is actually false. This is often referred to as a "False Negative (FN)." For example, if a medical test incorrectly indicates that a person does not have a disease when they actually do, that is a Type II error.

Type III Error

A Type III error occurs when you correctly reject the null hypothesis but draw the wrong conclusion about the alternative hypothesis. This is often less discussed but can be as problematic as Type I and Type II errors. For example, if you correctly identify that a new drug is effective, but mistakenly conclude that it is more effective than an existing drug, that is a Type III error.

Note: The balance between Type I and Type II errors is often considered in hypothesis testing. A higher significance level (alpha) increases the chance of a Type I error but decreases the chance of a Type II error. Conversely, a lower significance level decreases the chance of a Type I error but increases the chance of a Type II error. Type I and Type II errors can be caused by random sampling, but they can also be caused by bias. The likelihood of these errors can be reduced by increasing the sample size.

Type I, Type II, and Type III Errors. Table: Gemini

Type I, Type II, and Type III Errors. Table: Gemini


Confusion Matrix Heatmap. Heatmap: KNIME, AG

Confusion Matrix Heatmap. Heatmap: KNIME, AG

"In a binary classification, a number between 0 and 1 that converts the raw output of a logistic regression model into a prediction of either the positive class or the negative class. The probability score is not reality, or ground truth.

All of the metrics in this section are calculated at a single fixed threshold, and change when the threshold changes. Very often, the user tunes the threshold to optimize one of these metrics.

Note that the classification threshold is a value that a human chooses, not a value chosen by model training. Which evaluation metrics are most meaningful depends on the specific model and the specific task, the cost of different misclassifications, and whether the dataset is balanced or imbalanced." [Google]

Choice of Metric & Tradeoffs. Table: Google

Choice of Metric & Tradeoffs. Table: Google


In addition to accuracy, there are numerous model evaluation metrics. Three metrics that are commonly reported for a model on a binary classification problem are:

  • Precision (positive predictive value)
  • Recall ("probability of detection")
  • F1 score ("roll-up")

Precision quantifies the number of positive class predictions that belong to the positive class. Recall quantifies the number of positive class predictions made out of all positive examples in the dataset. Precision and recall often show an inverse relationship, where improving one of them worsens the other. "The metrics form a hierarchy that starts by counting the true/false negatives/positives, at the bottom, continues by calculating the Precision and Recall (Sensitivity) metrics, and builds up by combining them to calculate the F1 score." [12]

Hierarchy of Metrics from Labeled Training Data and Classifier Predictions to F1 score. Diagram adapted: Teemu Kanstrén

Hierarchy of Metrics from Labeled Training Data and Classifier Predictions to F1 score. Diagram adapted: Teemu Kanstrén

Precision and Recall Formulas

Precision and Recall Formulas.

The F1 score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. It is primarily used to compare the performance of two finer-grained classifiers.

F1 score Formula

F1 score Formula.

Why F1 score?

  • Balances precision and recall: Provides a single metric that considers both the ability to correctly identify positive instances and the ability to avoid false positives. It gives more weight to the smaller value, ensuring that both precision and recall are considered equally important for the overall evaluation.
  • Useful for imbalanced datasets: Can be particularly informative when dealing with datasets where the number of positive and negative instances is significantly different.

Interpretation

  • F1 score of 0: Indicates either perfect precision and zero recall, or zero precision and perfect recall. This typically means the model is either predicting everything as positive or everything as negative.
  • Lower F1 score: Suggests that either precision or recall is low, or both.
  • Higher F1 score: Indicates better overall performance, with a good balance of precision and recall.
  • F1 score of 1: Indicates perfect precision and recall, meaning the model correctly predicted all positive and negative instances.

When to Use F1 score

  • Imbalanced datasets: When the number of instances in different classes is significantly different.
  • Tasks where both precision and recall are important: For example, in medical diagnosis or information retrieval.

Matthews Correlation Coefficient (MCC)

MCC (Phi Coefficient) is a metric used to evaluate the performance of classification models for binary problems (two classes). It takes into account true positives, true negatives, false positives, and false negatives from the confusion matrix. MCC is a valuable tool for evaluating binary classification models, particularly when:

  • The dataset has imbalanced classes.
  • A single metric that considers both correct and incorrect predictions.

  • Purpose: Measures the quality of binary classifications by considering both correct and incorrect predictions.
  • Advantage: Provides a balanced view compared to metrics such as accuracy, which can be misleading in imbalanced datasets.
  • Formula: MCC is calculated using the following formula:

MCC = (TP * TN - FP * FN) / ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))        

  • Interpretation: MCC ranges from -1 to +1, where:

-1: Complete disagreement between prediction and reality

-1 to 0: Indicate worse than random classification

0: Random classification (no better than chance)

0 to +1: Indicate different levels of good performance

+1: Perfect classification

Advantages of MCC:

  • Balanced: Considers both positive and negative predictions.
  • Robust: Less sensitive to imbalanced class distributions in the data.
  • Single Score: Provides a single value to summarize classification performance.

Disadvantages of MCC:

  • Less intuitive: Interpretation of the exact value might be less straightforward compared to accuracy.
  • Threshold Dependence: Defining a specific threshold for "good" performance can be subjective.

By understanding MCC, you can gain a more comprehensive picture of your classification model's performance in binary problems.

Classification Evaluation Metrics. Table: Gemini

Classification Evaluation Metrics. Table: Gemini

- Regression Evaluation Metrics

When evaluating the performance of a regression model, several metrics are commonly used. These metrics provide insights into how well the model fits the data and how accurate its predictions are.

R-squared (R2)

  • Interpretation: Measures the proportion of variance in the dependent variable that is explained by the independent variables.
  • Range: 0 to 1
  • Higher is better: A higher R2 indicates a better fit, with 1 being a perfect fit.
  • Caveat: R2 can be misleading in certain cases, especially when the number of independent variables is large relative to the sample size.

Adjusted R-squared

  • Interpretation: Similar to R2, but adjusts for the number of independent variables in the model.
  • Range: 0 to 1
  • Higher is better: A higher adjusted R2 indicates a better fit, considering the number of predictors.
  • Benefit: Helps to avoid overfitting by penalizing models with too many unnecessary predictors.

Mean Squared Error (MSE) - Average L2 Loss

  • Interpretation: Measures the average squared difference between the predicted values and the actual values.
  • Range: 0 to infinity
  • Lower is better: A lower MSE indicates a better fit.
  • Benefit: Suitable when you want to penalize large errors more heavily (e.g., in financial forecasting where large errors can have significant consequences).
  • Formula: MSE = ∑(yi - ?i)2 / n

Root Mean Squared Error (RMSE)

  • Interpretation: The square root of the MSE.
  • Range: 0 to infinity
  • Lower is better: A lower RMSE indicates a better fit.
  • Benefit: Provides a more interpretable metric as it is in the same units as the dependent variable.

Mean Absolute Error (MAE) - Average L1 Loss

  • Interpretation: Measures the average absolute difference between the predicted values and the actual values.
  • Range: 0 to infinity
  • Lower is better: A lower MAE indicates a better fit.
  • Benefit: Less sensitive to outliers compared to MSE. Suitable when you want to treat all errors equally (e.g., in scenarios where outliers can significantly affect the results).

Mean Squared Prediction Error (MSPE)

  • Interpretation: Similar to MSE, but calculated using a holdout dataset to assess the model's generalization performance.
  • Range: 0 to infinity
  • Lower is better: A lower MSPE indicates better predictive accuracy.

Regression Evaluation Metrics. Table: Gemini

Regression Evaluation Metrics. Table: Gemini

Choosing the Right Metric

The choice of metric depends on the specific context and goals of the regression analysis. For example, if outliers are a concern, MAE might be preferred over MSE. If interpretability is important, RMSE can be useful. Ultimately, a combination of metrics can provide a more comprehensive understanding of the model's performance.

- Unsupervised Evaluation Metrics

When dealing with unsupervised learning tasks, where ground truth labels are not available, different metrics are used to assess the quality of the clustering results. Here are four commonly employed metrics:

Rand Index (RI)

  • Purpose: Measures the similarity between two clusterings.
  • Calculation: Calculates the proportion of pairs of data points that are either correctly grouped together or correctly separated in both clusterings.
  • Range: 0 to 1
  • Higher is better: A higher RI indicates better agreement between the two clusterings.

Adjusted Rand Index (ARI)

  • Purpose: Similar to RI but corrects for chance agreement.
  • Calculation: Adjusts the RI based on the expected number of agreements by chance.
  • Range: -1 to 1
  • Higher is better: A higher ARI indicates better agreement, considering chance.

Mutual Information (MI)

  • Purpose: Measures the dependence between two clusterings.
  • Calculation: Quantifies the information shared between the two clusterings.
  • Range: 0 to infinity
  • Higher is better: A higher MI indicates a stronger dependence between the two clusterings.

Normalized Mutual Information (NMI)

  • Purpose: A normalized version of MI for better interpretability.
  • Calculation: Divides MI by the geometric mean of the entropies of the individual clusterings.
  • Range: 0 to 1
  • Higher is better: A higher NMI indicates stronger dependence, normalized to a range of 0 to 1.

Unsupervised Evaluation Metrics. Table: Gemini

Unsupervised Evaluation Metrics. Table: Gemini

Note: While these metrics provide useful insights, it is important to consider their limitations. For instance, the Rand Index can be insensitive to certain types of clustering errors, and Mutual Information can be influenced by the number of clusters. In practice, a combination of metrics may be used to get a more comprehensive evaluation.

- Other Evaluation Metrics

Cross-Validation Errors (CV Errors)

  • Purpose: To assess the generalization performance of a model by training and testing it on different subsets of the data.
  • Process: Divide the dataset into multiple folds. Train the model on all but one fold and evaluate it on the remaining fold. Repeat this process for each fold. Calculate the average error across all folds.
  • Advantages: Helps to prevent overfitting by evaluating the model's performance on unseen data.

Heuristic Methods to Find K (in Clustering)

  • Purpose: To determine the optimal number of clusters in unsupervised learning.
  • Common methods:

Elbow Method: Plots the explained variance ratio against the number of clusters. The "elbow" point in the plot often indicates the optimal number of clusters.

Silhouette Coefficient: Measures how similar a data point is to its own cluster compared to other clusters. The optimal number of clusters is often the one that maximizes the average silhouette coefficient. Range is -1 to 1; Higher is better. ?

Gap Statistic: Compares the within-cluster dispersion to the expected dispersion under a null hypothesis. The optimal number of clusters is the one that minimizes the gap statistic.

BLEU Score (BiLingual Evaluation Understudy)

  • Purpose: To evaluate the quality of machine-translated text.
  • Calculation: Measures the n-gram precision between the machine-translated text and a set of reference translations. Range is 0 to 1; Higher is better.
  • Advantages: A widely used metric in natural language processing for evaluating machine translation systems.

These are just a few examples of other evaluation metrics that can be used in different contexts. The choice of metric depends on the specific task and the goals of the evaluation.

- Estimating Uncertainty in ML

Uncertainty in machine learning refers to the lack of confidence or certainty in a model's predictions. It is a crucial aspect of understanding a model's limitations and ensuring its reliability, especially in critical applications.

Uncertainty in Machine Learning. Table: Gemini

Uncertainty in Machine Learning. Table: Gemini

Uncertainty is an inherent aspect of machine learning models. It arises due to various factors, including noise in data, model complexity, and the inherent randomness of real-world phenomena. Understanding and quantifying uncertainty is crucial for building reliable and trustworthy ML models.

By understanding and addressing uncertainty in machine learning, you can build more reliable, trustworthy, and effective ML models.

- Estimating Future Accuracy Performance

Holdout method, cross-validation, and bootstrap sampling are techniques used in statistics and ML to evaluate the accuracy performance of models. They achieve this goal by resampling the data in different ways.

Comparison of?Holdout validation, k-fold cross-validation, & bootstrap sampling. Diagrams: Vikas More

Comparison of?Holdout validation, k-fold Cross-validation, and Bootstrap sampling. Diagrams: Vikas More

Holdout Method

This is a simple approach where the data is split into two sets:

  • Training Set: The larger portion used to train the model.
  • Test Set: The remaining data used to evaluate the model's accuracy performance on unseen data.

Commonly, the split is 80% for training and 20% for testing. The advantage of this method is its simplicity. However, the accuracy performance estimate can be sensitive to how the data is split. A single random split might not be representative of the entire dataset.

Cross-Validation

This is a more robust approach compared to the holdout method. It involves splitting the data into multiple folds (usually k folds) and iterating through the following steps:

  1. One fold is used as the test set, and the remaining k-1 folds are combined for training.
  2. The model is trained and evaluated on the test fold.
  3. This process is repeated for each fold, ensuring all data points are used for both training and testing.

Common variations include k-fold cross-validation (where k is a chosen number of folds) and leave-one-out cross-validation (where k is equal to the number of data points). This method provides a more reliable estimate of the model's generalizability.

Bootstrap Sampling

This method involves creating new datasets, called bootstrap replicates, by sampling with replacement from the original data. This means a data point can be chosen multiple times in a single replicate, and some points might not be included at all. The replicates are then used to train the model, and the variability in the model's accuracy performance across these replicates is used to estimate the model's generalizability and uncertainty.

Bootstrap sampling is particularly useful for smaller datasets where holdout methods might not be reliable. It is also used to estimate the distribution of statistics, not just model accuracy performance.

Here is a table summarizing the key differences:

Estimating Future Accuracy Performance Techniques. Table: Gemini

Estimating Future Accuracy Performance Techniques. Table: Gemini


- Evaluation (Validation & Testing) in Traditional ML Workflow

Building a Machine Learning Model. Diagram: Chanin Nantasenamat

Building a Machine Learning Model. Diagram: Chanin Nantasenamat

The traditional machine learning workflow is a structured process for developing and deploying machine learning models. It can be broken down into several key stages:

1. Data Preparation

  • Data Collection: Gather the data relevant to your machine learning problem. This could involve collecting data from internal databases, external sources, or even scraping data from the web.
  • Data Cleaning: Clean and format the data to ensure consistency and remove errors or missing values employing a consistence data imputation policy.
  • Feature Engineering: Select, extract, transform, encode, binning (bucketing), scrubbing, normalize, and scale features, or create new synthetic features or feature crosses from existing ones that might be more informative for the model.

2. Modeling

  • Model Selection: Choose an appropriate machine learning algorithm based on the type of problem you're trying to solve (classification, regression, etc.).
  • Hyperparameter Tuning: Adjust the model's hyperparameters (e.g., Learning rate, Batch size, Epochs) by finding the best combination of hyperparameter values (settings that control the training process) to minimize or maximizing a function that measures the quality of a model's predictions.
  • Training the Model: Train the model on a portion of your prepared data (training set) to learn the patterns and relationships within the data.

3. Evaluation

  • Model Assessment: Evaluate the model's performance on a separate portion of the data (test set) that it hasn't seen before. This helps assess how well the model generalizes to unseen data.
  • Refinement (Optional): If the model performance is not satisfactory, you may need to go back and refine the data preparation steps, choose a different model, reduce model's complexity, or optimize the set of hyperparameters that minimizes a loss function on a given data set.

The traditional workflow above focuses on the core development stages of building and evaluating a machine learning model.

Traditional ML workflows primarily focus on model development and evaluation. However, the dynamic nature of data, characterized by data drift and concept drift, significantly impacts model accuracy over time. To address these challenges and maintain model reliability and efficiency in production, organizations adopt MLOps and Continuous ML (CML) practices. These methodologies encompass a comprehensive approach to deploying and maintaining ML models, including continuous monitoring, retraining, and redeployment, model versioning, experimentation, and robust collaboration between teams. [Operations: MLOps, Continuous ML, & AutoML]

Cross-validation model and trained model are two key concepts in machine learning, each serving a distinct purpose in the development and evaluation of predictive models. A cross-validation model assesses a model's performance on unseen data and prevent overfitting. The average performance metrics across all iterations provides a more robust estimate of the model's generalization ability. After selecting the best model based on cross-validation, you train it on the entire training dataset and deploy it to make predictions on new, unseen data based on the learned patterns in the training data.

Selecting the optimal estimator for a ML problem can be challenging due to the multitude of options available. Effectively navigating this process requires a deep understanding of the data, including its distribution, outliers, and missing values. By considering model bias, variance, and interpretability needs, practitioners can make informed decisions. Cross-validation and hyperparameter tuning are essential for building robust models. Additionally, feature engineering and domain knowledge play crucial roles in enhancing model performance. Ultimately, the choice of estimator is an iterative process that involves experimentation and evaluation.

Choosing the Right Estimator. Diagram: scikit-learn

Choosing the Right Estimator. Diagram: scikit-learn

The importance and interpretation of evaluation metrics depend on the domain and context of your ML model. For instance, medical tests are evaluated by specificity and sensitivity, while information retrieval systems are evaluated by precision and recall. Understanding the differences between precision and recall vs. specificity and sensitivity is significant in your model evaluation within a specific domain. [13]

Bias vs. Variance of ML Algorithms. Chart: Ega Skura

Bias vs. Variance of ML Algorithms. Chart: Ega Skura

For ML model builders, understanding how accuracy is affected by their model fitting is essential. Building an accurate classification model can correctly classify positives from negatives.


- Dataflow in a Traditional ML Workflow

Dataflow in a Traditional ML Workflow. Diagram: Visual Science Informatics

Dataflow in a Traditional ML Workflow. Diagram: Visual Science Informatics

-???????Remedies

Overfitting and underfitting are common challenges in machine learning, where a model performs poorly on unseen data. There are effective technique in solving the issue of overfitting and underfitting and building an optimal-fitting ML model. Here is a breakdown of both issues and how to address them:

ABC of Data Science

ML is a form of Artificial Intelligence (AI), which makes predictions and decisions from data. It is the result of training algorithms and statistical models to analyze and draw inferences from patterns in data, which are able to learn and adapt without following explicit instructions. However, you need to:

  1. Check - double check your Assumptions,
  2. Mitigate - make sure you mitigate your Biases, and
  3. Validate - take your time to validate your Constraints.

The Assumptions, Biases, and Constraints (ABC) of data science, Data, and Models of ML can be captured in this formula:

Machine Learning = {Assumptions/Biases/Constraints, Data, Models}

Diagnosing ML Model "Goodness-of-Fit" using Learning Curves

  • Statistical Goodness-of-Fit

“The term goodness-of-fit refers to?a statistical test that determines how well sample data fits a distribution from a population with a normal distribution. Put simply, it hypothesizes whether a sample is skewed or represents the data you would expect to find in the actual population." [Investopedia]

  • ML Model Goodness-of-Fit

Goodness-of-fit in machine learning assesses how closely the model's predictions align with the actual observed data points. A good-fitting model accurately captures the underlying patterns and relationships within the data, leading to reliable predictions and insights. By selecting appropriate goodness-of-fit metrics, you can effectively evaluate the performance of your machine learning models and make informed decisions.

Diagnosing ML Model "Goodness-of-Fit" using Learning Curves. Visual Science Informatics [14] [15] [16]

Diagnosing ML Model "Goodness-of-Fit" using Learning Curves. Visual Science Informatics [14] [15] [16]

Generalization Conditions

"Training a model that generalizes well implies the following dataset conditions:

  • Examples must be independently and identically distributed (i.e., examples cannot influence each other).
  • The dataset is stationary, meaning the dataset does not change significantly over time.
  • The dataset partitions have the same distribution. That is, the examples in the training set are statistically similar to the examples in the validation set, test set, and production (real-world) data. Shuffle the examples in the dataset extensively before partitioning them." [Google]


Dimension Reduction Techniques

Dimension reduction is a technique used to simplify data by reducing the number of variables or features while preserving as much information as possible. This can be especially beneficial when dealing with high-dimensional datasets, which can be computationally expensive to analyze.

Why is dimension reduction important?

  • Improved Model Accuracy: In some cases, dimension reduction can lead to better model accuracy and performance.
  • Noise Reduction: Can help remove irrelevant or noisy features.
  • Computational Efficiency: Reduces the time and resources required for data processing and analysis.
  • Visualization: Makes it easier to visualize data in lower dimensions.

Dimension Reduction Techniques. Diagram: Javapoint

Dimension Reduction Techniques. Diagram: Javapoint

1. Factor Analysis

- Assumes that observed variables are linear combinations of a smaller set of latent variables.

- Identifies underlying constructs or patterns in the data.

2. Principal Component Analysis (PCA)

- Finds a new set of uncorrelated variables (principal components) that explain the most variance in the data.

- Commonly used for linear dimensionality reduction.

- Often used for exploratory data analysis and visualization.

- Best suited for: Linear relationships, large datasets, and when the goal is to reduce dimensionality while preserving most of the variance.

- Example: Analyzing gene expression data to identify patterns or biomarkers.

3. Independent Component Analysis (ICA)

- Separates a multivariate signal into a set of statistically independent components. It assumes that the observed data is a linear mixture of underlying independent sources.

- Sensitive to the choice of independence criterion. May not be effective if the sources are highly correlated.

- Best fit for: handling non-Gaussian sources. Does not require prior knowledge of the mixing matrix. Can be computationally efficient.

- Examples: Blind Source Separation (BSS) - Separating mixed signals, such as audio signals or EEG data. Feature extraction - Extracting meaningful features from high-dimensional data. Medical imaging - Analyzing brain signals and medical images. Financial analysis - Identifying hidden factors in financial data.

4. Isometric Mapping (ISOMAP)

- Captures nonlinear relationships in data that aims to preserve the geodesic distances between data points in a high-dimensional space.

- Can be computationally expensive for large datasets and sensitive to the choice of neighborhood parameters.

- Best suited for: when the data points lie on a nonlinear manifold and persevering global structure of the manifold.

- Examples: Data visualization - Understanding the underlying structure of high-dimensional data. Image and video analysis - Extracting meaningful features from visual data. Machine learning - As a preprocessing step to improve the performance of algorithms.

5. t-Distributed Stochastic Neighbor Embedding (t-SNE)

- Preserves local structure in the data and is useful for visualizing high-dimensional data in 2D or 3D.

- Best suited for: Visualizing high-dimensional data in 2D or 3D, preserving local structure, and understanding non-linear relationships.

- Example: Visualizing word embeddings to understand semantic relationships between words.

6. Uniform Manifold Approximation and Projection (UMAP)

- Reduces non-linear dimensionality and preserves local structure.

- Best suited for: Non-linear relationships, preserving global and local structure, and handling complex data structures.

- Example: Analyzing time series data to identify patterns or anomalies.


Note PCA focuses on global structure and variance, while t-SNE and UMAP prioritize preserving local structure, making them better suited for visualization and exploring complex relationships. In essence, PCA often reduces the number of features while retaining most of the information, where t-SNE and UMAP create a new, lower-dimensional representation of the data that emphasizes relationships between data points.

Choosing the Right Technique

Before applying any dimensionality reduction technique, it is crucial to understand the characteristics of your data. The best technique for a given dataset depends on factors such as:

- Goals of the analysis (e.g., visualization, classification, feature extraction)

- Nature of the data (e.g., numerical, categorical)

- Number of dimensions (e.g., features, variables)

- Combing multiple variables to a feature (e.g., aggregation, interaction, bucketing)

- Transforming raw data into features, which are more informative and relevant, via transformation techniques (e.g., transformation, encoding)

- Distribution of features (e.g., normal, skewed)

- Relationships between features (e.g., correlations, dependencies, causality)

- The computational resources available

Breakdown of the decision tree

1. Goal

- Visualization: t-SNE or UMAP are often preferred for their ability to preserve local structure.

- Feature extraction: PCA can be used to extract the most important features.

- Classification or regression: Consider techniques such as LDA or autoencoders.

2. Data Type and Dimensionality

- High-dimensional numerical data: Consider PCA, t-SNE, or UMAP.

- Low-dimensional numerical data: PCA might be sufficient.

- Categorical data: Consider techniques such as correspondence analysis or multidimensional scaling.

3. Linearity

- Linear relationships: PCA is a good choice.

- Non-linear relationships: t-SNE or UMAP are more suitable.

4. Computational Resources

- Large datasets: PCA might be computationally more efficient than t-SNE or UMAP.

Additional Considerations

- Domain knowledge: Incorporate your understanding of the data and the problem domain to make informed decisions.

- Experimentation: Try different techniques and evaluate their performance using appropriate metrics.

- Hyperparameter tuning: Fine-tune the parameters of each technique to optimize results.

Remember that this is a general guide, and the best choice often depends on specific data characteristics and goals. It is always a good practice to experiment with different techniques and evaluate their performance to find the most suitable one for your particular problem.


- Underfitting Example

Underfitting example. Python code: Skbkekas. Graphs: Visual Science Informatics

Underfitting example. Python code: Skbkekas. Graphs: Visual Science Informatics

"A training set (left) and a test set (right) from the same statistical population are shown as blue points. Two predictive models are fit to the training set. Both fitted models are plotted with both the training and test sets. In the training set, the MSE of the fit shown in orange is about 10 whereas the MSE for the fit shown in green is about 8. In the test set, the MSE for the fit shown in orange is about 14 and the MSE for the fit shown in green is about 10. The orange curve severely underfits the training set, since its MSE increases by almost a factor of 4 when comparing the test set to the training set. The green curve underfits the training set much less, as its MSE increases by less than a factor of 2."

- Remedies for Underfitting

In underfitting, the model is too simple and fails to capture the underlying patterns in your data, leading to poor performance on both training and unseen data. Underfitting is the opposite of overfitting.

Data

  • Reduce noise in data: Focus on the significant patterns in your data to avoid confusing the model. Check for a batch that might contain a lot of outliers due to improper shuffling of batches or clipping. Also, the training dataset might contain repetitive sequences of examples. Verify that you are shuffling examples sufficiently.
  • Improve data quality: Check for input data contains one or more NaNs—for example, a value caused by a division by zero. Try different data imputation policies. Increase data Variety. Quality of your data, a good data quality, is a necessary prerequisite to building your accurate ML model. [Data Science Approaches to Data Quality: From Raw Data to Datasets]?

Model Complexity

  • Increase model complexity: If your model is very basic (e.g., linear regression), try using a more complex model architecture. If the current model architecture is not suitable for the problem at hand (e.g., linear model for nonlinear data), consider changing to a more appropriate model type that can better capture the data's complexity (e.g., switching from a linear model to a decision tree or neural network). If the model is too simple and underfits the data, increasing the complexity might help. This can involve adding more layers to neural networks, increasing the number of neurons, or using a more complex model architecture.
  • Add more features: Include relevant real or synthetic features or feature crosses that might help the model learn the patterns better. Improve the quality of input features by performing feature engineering. This might involve creating new features from existing ones or selecting more relevant features that can better represent the problem. Sometimes, underfitting occurs because the model lacks sufficient information to capture the underlying patterns in the data. Adding more informative features can help the model learn better.
  • Use ensemble methods: For combating underfitting, ensemble methods can combine multiple weak models to create a stronger model that performs better on the training data. Combine models via the Boosting ensemble method to reduce the bias.

Training

  • Increase training time: Allow the model more time to learn the complexities of your data. In some cases, underfitting might be due to insufficient training time or too few iterations. Allowing the model to train for longer periods, learning rate, or increasing the number of training epochs can sometimes improve performance. Or, utilize "more advanced optimizers such as Adagrad and Adam protect against this problem by changing the effective learning rate over time." [Google]
  • Reduce regularization rate: Regularization can sometimes prevent the model from learning well. Be mindful of the balance. If excessive regularization is causing underfitting, reducing the regularization rate parameters (such as decreasing the regularization strength in L1 or L2 regularization) can help the model fit the training data better.


- Overfitting Example

Overfitting example. Python code: Skbkekas. Graphs: Visual Science Informatics

Overfitting example. Python code: Skbkekas. Graphs: Visual Science Informatics

"A training set (left) and a test set (right) from the same statistical population are shown as blue points. Two predictive models are fit to the training set. Both fitted models are plotted with both the training and test sets. In the training set, the MSE of the fit shown in orange is about 1 whereas the MSE for the fit shown in green is about 6. In the test set, the MSE for the fit shown in orange is about 11 and the MSE for the fit shown in green is about 7. The orange curve severely overfits the training set, since its MSE increases by almost a factor of 10 when comparing the test set to the training set. The green curve overfits the training set much less, as its MSE increases by less than a factor of 1."

- Remedies for Overfitting

Imagine a model memorizing every detail of your training data, including random noise. This makes it perform well on that specific data but fails to generalize to new examples.

Data

  • Increase training data size: More data provides more robust patterns for the model to learn from - Increase data Volume. Adding more training data can help the model generalize better and reduce overfitting, as it learns from a more diverse set of examples - Increase data Veracity.
  • Improve data quality: Verify your data is clean and free of noise that the model might latch onto. Prevent data leakage and label leakage. Check that the training set and test set are statistically equivalent (Separated, Unseparated, or Imbalanced). Any significant difference between the two means suggests that the model has some prediction bias with the training, verification, or testing datasets, or with the model itself. Any significant difference with the production data suggests a concept drift (behavioral change), covariate shift (distribution change), and/or both prior probability shift (target class change). Rebalance an imbalanced dataset in two steps: Step 1: Down-sample the majority class; Step 2: Up-weight the down-sample class. Quality of your data, a good data quality, is a necessary prerequisite to building your accurate ML model. [Data Science Approaches to Data Quality: From Raw Data to Datasets]?
  • Use data augmentation (if applicable): Artificially create variations of your existing data to expose the model to a wider range of scenarios.

Model Complexity

  • Reduce model complexity: Reduce the complexity of the model by decreasing the number of parameters or using simpler architectures. This can be achieved by reducing the depth of neural networks, decreasing the number of hidden units, or using regularization techniques.
  • Reduce feature selection and dimensions: Identify and remove irrelevant features and dimensions that might be contributing to overfitting.
  • Use ensemble methods: Combine predictions from multiple, less complex models to get a more robust final prediction. Combine models via the Bagging ensemble method to reduce the variance.

Training

  • Implement cross-validation techniques: to assess model performance on unseen data. This helps in detecting overfitting early and adjusting model complexity accordingly.
  • Apply regularization techniques: are broad set of techniques that penalize the model for becoming too complex, such as L1 (Lasso) or L2 (Ridge) regularization to penalize large weights in the model. Increasing regularization rate encourages the model to keep weights small, which can prevent overfitting. Entropy regularization is used for models that output probability. Entropy regularization adjust the probability distribution towards uniform distribution.
  • Apply dropout: Dropout is a regularization technique where randomly selected neurons are ignored during training. This helps prevent complex co-adaptations in the model, reducing overfitting.
  • Stop training early: Stop training the model once its performance on a validation set starts to degrade. This prevents it from memorizing noise in the training data. "Early stopping is one of the most commonly used strategies because it is straightforward and effective. It refers to the process of stopping the training when the training error is no longer decreasing but the validation error is starting to rise." [17]

It is critical to find the balance between overfitting and underfitting. Experiment with different techniques and evaluate your model's performance on a validation set to determine the best approach for your specific scenario. Combine models via the Stacking ensemble method to reduce both bias and variance. In practice, L1 and L2 regularization are often combined in a technique called Elastic Net. This combines the benefits of both L1 and L2 regularization, promoting sparsity and preventing overfitting.


"Models trained on large datasets with few features generally outperform models trained on small datasets with a lot of features. It is possible to get good results from a small dataset if you are adapting an existing model already trained on large quantities of data from the same schema." [Google]

  • Finding equilibrium between learning rate and regularization rate

"Learning rate and regularization rate tend to pull weights in opposite directions. A high learning rate often pulls weights away from zero; a high regularization rate pulls weights towards zero.

If the regularization rate is high with respect to the learning rate, the weak weights tend to produce a model that makes poor predictions. Conversely, if the learning rate is high with respect to the regularization rate, the strong weights tend to produce an overfit model.

Your goal is to find the equilibrium between learning rate and regularization rate. This can be challenging. Worst of all, once you find that elusive balance, you may have to ultimately change the learning rate. And, when you change the learning rate, you'll again have to find the ideal regularization rate." [Google]

  • Solving feedback loops

Feedback loops in ML can occur when a model's output is used as input to the same or another model, creating a circular dependency. This can lead to unintended consequences, such as: Reinforcement (amplify) of biases, Instability (unstable or oscillate), or Reduced accuracy (degrade over time). Here are techniques to address feedback loops:

  1. Break the loop with a direct intervention (preventing the output from being used as input), or introduce a delay (to prevent immediate feedback).
  2. Implement negative feedback control mechanisms to counteract the effects of the feedback loop and stabilize the system.
  3. Apply L1 or L2 regularization (prevent the model from overfitting and reduce the impact of feedback loops).
  4. Use ensemble methods to create a more robust and stable system (combining multiple models is less susceptible to feedback loops).
  5. Retrain models by deploying continues ML Ops tools to: 1.) Continuous monitor the system for signs of feedback loops, 2.) Regular evaluate the model's accuracy, and 3.) Retrain models automatically on trigger or on demand, based on model accuracy drift or availability of additional training data.

By carefully considering these techniques, you can effectively address feedback loops in your machine learning systems and ensure their stability and accuracy.

  • Normalizing features in a multi-feature model

"When creating a model with multiple features, the values of each feature should span roughly the same range. If one feature's values range from 500 to 100,000 and another feature's values range from 2 to 12, the model will need to have weights of extremely low or extremely high values to be able to combine these features effectively. This could result in a low quality model. To avoid this, normalize features in a multi-feature model.

Broadly speaking, the process of converting a variable's actual range of values into a standard range of values, such as:

  • -1 to +1
  • 0 to 1
  • Z-scores (roughly, -3 to +3)

Normalization is a common task in feature engineering. Models usually train faster (and produce better predictions) when every numerical feature in the feature vector has roughly the same range." [Google]

Summary of Normalization Techniques. Table: Google

Summary of Normalization Techniques. Table: Google

"Normalization provides the following benefits:

  • Helps models converge more quickly during training. When different features have different ranges, gradient descent can "bounce" and slow convergence. That said, more advanced optimizers such as Adagrad and Adam protect against this problem by changing the effective learning rate over time.
  • Helps models infer better predictions. When different features have different ranges, the resulting model might make somewhat less useful predictions.
  • Helps avoid the "NaN trap" when feature values are very high. NaN is an abbreviation for not a number. When a value in a model exceeds the floating-point precision limit, the system sets the value to NaN instead of a number. When one number in the model becomes a NaN, other numbers in the model also eventually become a NaN.
  • Helps the model learn appropriate weights for each feature. Without feature scaling, the model pays too much attention to features with wide ranges and not enough attention to features with narrow ranges." [Google]

Warning: If you normalize a feature during training, you must also normalize that feature when making predictions in validation, testing, and production!


- Data Visualization

Data visualization is a graphical representation of information and data. Using visual elements, such as charts, graphs, and maps, data visualization techniques provide a visual way to see and understand trends, outliers, and patterns in data. Visualization tools provide capabilities that help discover new insights by demonstrating relationships between data.

Anscombe's Quartet [Image: Schutz ]
Anscombe's Quartet. Graphs: Schutz

Anscombe's Quartet. Graphs: Schutz [18]

An additional benefit for visualizing data is that data sets that have similar descriptive statistics, such as mean, variance, correlation, linear regression, and coefficient of determination of the linear regression, yet have very different distributions and appear very different when graphed.

Anscombe's quartet [19], in the above image, comprises four data sets that demonstrate both the importance of graphing data when analyzing it and the effect of?outliers?and other?influential observations?on statistical properties.

In ML, the three major reasons, for data visualization, are for understanding, diagnosis, and refinement of your model.

One important purpose, you need to visualize your model, is to provide an interpretable (reasoning) predictive model and explainability of your model. Other significant purposes are visualizing your model architecture, parameters, and metrics. ?

Also, you might need to visualize your model during debugging and improvements, comparison and selection, and teaching concepts.

Visualization is most relevant during training for monitoring and observing several metrics and tracking model training progression. After training, visualizing model inference is the process of concluding out of a trained model. Visualizing the results helps in interpreting and retracing how the model generates its estimates (Visualizing Machine Learning Models: Guide and Tools). [20]

"Visualizations are critical in understanding complex data patterns and relationships. They offer a concise way to understand the: intricacies of statistical models, validate model assumptions, and evaluate model performance." [Avi Chawla]

Plots in Data Science. Graphs: Avi Chawla

Plots in Data Science. Graphs: Avi Chawla

The following are plots that can be utilized for machine learning models validation and evaluation:

1) Kolmogorov-Smirnov (KS) plot - Visualizing distributional differences

The KS statistic acts as a statistical test for distributional differences. The KS plot itself offers more than a binary "same or different" answer. By visually inspecting the plot, you can gain insights into the nature of the difference between the distributions by:

  • Magnitude of discrepancy: The maximum vertical distance between the Cumulative Distribution Functions (CDF) lines indicates the degree of difference. Larger distances imply less similarity.
  • Shape of differences: The way the lines deviate from each other provides clues about where the distributions differ most significantly. For example, a consistent vertical gap suggests differences across the entire range, while localized deviations point to specific regions of dissimilarity.

2) SHAP plot - Unveiling feature importance with interplay

SHAP plots are incredibly valuable for understanding how features influence a model's predictions. While SHAP values can be used to rank features by importance, their true strength lies in explaining how individual features interact and contribute to specific predictions. They achieve this by:

  • Considering marginal contributions: Unlike traditional feature importance measures, SHAP values isolate the influence of each feature while holding all others constant. This provides a more accurate picture of their individual impact.
  • Accounting for feature dependencies: SHAP plots capture complex interactions between features by showcasing how their effect on the prediction changes based on other feature values.

3) Receiver Operating Characteristic (ROC) curve - Evaluating binary classification models

The ROC curve "depicts the trade-off between the true positive rate (good performance) and the false positive rate (bad performance) across different classification thresholds."

Interpreting the ROC curve:

  • True Positive Rate (TPR): Proportion of actual positives correctly identified by the model.
  • False Positive Rate (FPR): Proportion of actual negatives incorrectly identified as positive.
  • When the cost of false positives and false negatives is relatively balanced. ROC curve is appropriate when the observations are balanced between each class,

The ROC curve plots TPR against FPR at different classification thresholds. This means it captures the trade-off between correctly identifying true positives and minimizing false positives as you adjust the model's sensitivity.

  • Ideal ROC curve: Lies close to the upper left corner, implying high TPR and low FPR.
  • Diagonal line: Represents random guessing, indicating poor model performance.
  • Area Under the Curve (AUC): A single metric summarizing overall performance, ranging from 0 (worst) to 1 (perfect).

4) Quantile-Quantile (QQ) plot - Comparing the distribution of your data to a theoretical distribution

QQ plot "assesses the distributional similarity between observed data and theoretical distribution. It plots the quantiles of the two distributions against each other. It visualize deviations from the straight line indicate a departure from the assumed distribution."

Understanding QQ plot:

  • Each point on the plot represents a quantile of one distribution plotted against the corresponding quantile of the other.
  • If the points fall roughly along a straight line, then the two distributions are similar.
  • Deviations from the line indicate where the distributions differ.

Types of QQ plot:

  • Normal QQ Plot: Compares your data to a normal distribution. This is commonly used to check for normality assumptions in statistical tests.
  • Empirical QQ Plot: Compares your data to another dataset. This can be useful for seeing if two datasets come from the same population.

Beyond the Basics:

  • QQ plot can also be used to compare more complex distributions than just normal distributions.
  • There are different ways to calculate quantiles, which can lead to slightly different QQ plots.
  • You can use statistical tests to assess how statistically significant the deviations from the line are.

5) Cumulative Explained Variance (CEV) plot - Utilizing as a tool for tasks such as dimensionality reduction

CEV plot is "useful in determining the number of dimensions that can be reduced for your data to while preserving max variance during Principal Component Analysis (PCA).

What it shows:

  • This plot displays the proportion of variance in the original data that is explained by each Principal Component (PC), in an ascending order.
  • By looking at the curve, you can understand how much information each PC captures and judge how many components are necessary to retain a significant amount of the original data's variance.

Interpreting the CEV plot:

  • Steep initial rise: Indicates initial PCs capturing a large chunk of variance.
  • Gradual flattening: Shows diminishing returns with later PCs explaining smaller portions of variance.
  • Elbow point: Often observed, suggesting a natural "stopping point" for selecting the number of PCs.
  • Interpretation of the "point": Varies depending on the specific goal and acceptable information loss. Sometimes, a threshold (e.g., 80% explained variance) is used, while other times, the interpretability and complexity trade-off might guide the decision.

6) Elbow curve - Utilizing as a tool particularly for determining the optimal number of clusters in K-means clustering

The Elbow curve "helps identify the optimal number of clusters for the k-means algorithm. The point of the elbow depicts the ideal number of clusters."

Understanding the Elbow curve:

  • It plots the Within-Cluster Sum of Squares (WCSS) against the number of clusters (k).
  • Ideally, as you increase k, WCSS will decrease because you create more clusters, potentially capturing more within-cluster variation.
  • The "elbow" point where the decrease in WCSS starts to plateau or slow down significantly is considered the optimal k.

Limitations of the Elbow curve:

  • Subjectivity: Identifying the "elbow" is often subjective and dependent on visualization.
  • Not always present: In datasets with poorly defined clusters or many potential cluster numbers, the elbow might be unclear or absent.
  • Sensitive to initialization: K-means is sensitive to initial cluster centroids, and different initializations can lead to different elbow shapes.

7) Silhouette curve - Providing an alternative to the Elbow curve for determining the optimal number of clusters in K-means clustering

Silhouette curve is an alternative to the Elbow curve, which is often ineffective when you have a large number of clusters.

What it shows:

  • The Silhouette curve plots the silhouette coefficient for each data point against the number of clusters (k).
  • Each data point's silhouette coefficient represents how well it's clustered:
  • Values close to 1: Represent data points well-assigned to their cluster, far from neighboring clusters.
  • Values around 0: Indicate uncertain clustering, where the point is roughly equidistant to its own and neighboring clusters.
  • Negative values: Suggest misplacement, where the point is closer to a different cluster than its own.

Interpreting the plot:

  • Unlike the Elbow curve, there's no single "ideal" shape. Look for:
  • Generally increasing trend: As k increases, silhouette coefficients should improve on average, indicating better clustering.
  • Stabilization or plateau: After a certain k, further increases in k might not significantly improve the average silhouette coefficient.
  • Clusters with consistently low coefficients: Investigate these, as they might indicate outliers or noisy data.

Benefits over the Elbow curve:

  • More objective: Silhouette coefficients quantify individual data point clustering quality, offering a less subjective measure than the "elbow" point.
  • Considers cluster separation: Unlike WCSS, silhouette coefficients incorporate information about between-cluster distances, leading to potentially more robust cluster selection.

Limitations:

  • Interpretation can be nuanced: Depending on data distribution and number of clusters, the curve might not have a clear peak. Expertise and domain knowledge might be needed.
  • Computational cost: Calculating silhouette coefficients can be more expensive than WCSS, especially for large datasets.

8) Gini Impurity and Entropy - Measuring splits in decision tree algorithms

Gini Impurity and Entropy serve the same purpose of quantifying impurity within a node, but they approach it from different perspectives. "They are used to measure the impurity or disorder of a node or split in a decision tree. The plot compares Gini impurity and Entropy across different splits. It provides insights into the trade-off between these measures."

Gini Impurity:

  • Focuses on misclassification: Calculates the probability of randomly misclassifying an instance from a node if it were labeled randomly according to the class distribution within that node.
  • Values range from 0 to 1: Where 0 signifies a pure node (all instances belong to the same class), and 1 indicates the most impure node (equal class distribution).
  • Intuitive interpretation: A higher Gini Impurity implies greater randomness and uncertainty about the correct classification, hence, a need for further splitting.

Entropy:

  • Based on information theory: Calculates the average amount of information needed to correctly classify an instance from a node, considering the probabilities of each class.
  • Values range from 0 to log2(k) (number of classes): Similar to Gini Impurity, 0 represents a pure node, and higher values imply greater uncertainty.
  • Less intuitive interpretation: Entropy might be less readily comprehensible for non-information theory experts compared to Gini Impurity.

Key differences:

  • Sensitivity to class distribution: Gini Impurity is sensitive to the difference between the majority and minority classes, while Entropy considers the distribution of all classes equally.
  • Computational cost: Gini Impurity is generally faster to calculate compared to Entropy.

Choosing the right metric:

  • No definitive answer: Both metrics are commonly used, and the choice often depends on personal preference or specific dataset characteristics.
  • Experimentation: Trying both metrics with your data and comparing the resulting trees might be helpful.

9) Bias-Variance Trade-off plot - Finding the right balance between the bias and the variance of a model against complexity

The Bias-Variance Trade-off plot visually depicts the relationship between a model's complexity, its accuracy on the training data (bias), and its ability to generalize to unseen data (variance).

Understanding the plot:

  • X-axis: Typically represents the model complexity, which can be measured by the number of features used, the model's size, or hyperparameter values.
  • Y-axis: Often has two components:
  • Bias: Measured by the model's training error or its ability to fit the training data perfectly. Lower bias generally corresponds to a more complex model that can capture intricate details of the training data.
  • Variance: Measured by the model's generalization error or its ability to perform well on unseen data. Higher variance indicates a model that is overly sensitive to the specific training data and might not generalize well.

Plot shape:

  • Ideally, the plot should show a U-shaped curve.

  • Left side: Decreasing bias with increasing complexity, but also increasing variance.
  • Minimum point: Represents the "sweet spot" where the model balances bias and variance for optimal performance.
  • Right side: Decreasing variance with further increasing complexity, but also increasing bias and potentially overfitting.

Interpreting the Trade-off:

  • Underfitting: If the model operates on the left side of the curve, it has high bias and low variance. This means it might not capture the nuances of the training data well, leading to underfitting and poor performance on both training and unseen data.
  • Overfitting: On the right side of the curve, the model has low bias but high variance. It might fit the training data perfectly but fail to generalize to unseen data due to overfitting.
  • Finding the Balance: The goal is to choose a model complexity that falls near the minimum point of the curve, achieving a balance between bias and variance for optimal generalization performance.

10) Partial Dependence Plots (PDPs) - Utilizing as a tool for understanding how individual features influence a model's predictions

PDPs visually show the average marginal effect of a feature on the model's prediction, holding all other features constant. This helps you understand how a specific feature changes the prediction, independent of the influence of other features. PDPs depict the dependence between target and features.

Different types of PDPs:

  • 1-way PDP: Plots the average prediction for a single feature across its range of values.
  • 2-way PDP: Plots the average prediction for two features across their ranges, visualized as a heatmap or contour plot.
  • Individual Conditional Expectation (ICE) Plot: Similar to a 1-way PDP, but shows the prediction for each individual data point instead of the average.

Interpreting a PDP:

  • Slope of the plot: Indicates the direction and strength of the feature's effect.
  • Shape of the plot: Can reveal complex relationships, such as non-linearity or interactions with other features.
  • Comparison across different PDPs: Helps identify which features have the strongest impact on the predictions.

Benefits of PDPs:

  • Easy to understand: Provide a visually intuitive way to explore feature importance and interactions.
  • Model agnostic: Can be used with various machine learning models.
  • Explainable AI (XAI): Contribute to understanding and interpreting complex models.

Limitations of PDPs:

  • Assumption of independence: Assume features are independent, which might not always be true.
  • High dimensionality: Can be challenging to visualize with many features.
  • Interpretability caveats: Don't necessarily imply causality.

11) Precision-Recall (PR) curve - Evaluating binary classification models

The PR curve "depicts the trade-off between Precision and Recall across different classification thresholds."

Precision vs. Recall:

  • Precision: Represents the proportion of predicted positives that are actually true positives. Think of it as "how much you can trust the model's positive predictions."
  • Recall: Represents the proportion of actual positives that the model correctly identifies. Think of it as "how many true positives the model is not missing."

The PR curve plots precision against recall at different classification thresholds, similar to the ROC curve. However, unlike the ROC curve, which focuses on false positives, the PR curve directly examines the trade-off between precision and recall, which are often more relevant in certain contexts. When class imbalance is significant or when false negatives are more costly than false positives (e.g., medical diagnosis).

Interpreting the Curve:

  • Ideal PR curve: Lies close to the upper right corner, implying high precision and high recall.
  • Random guessing: A diagonal line from (0,1) to (1,0) represents random guessing, indicating poor model performance.
  • Average Precision (AP): A single metric summarizing overall performance, considering both precision and recall across all thresholds.

Dashboard Charts for Model Accuracy Evaluation in a Guided Automation. Charts: KNIME, AG

Dashboard Charts for Model Accuracy Evaluation in a Guided Automation. Charts: KNIME, AG

Dashboard charts for model accuracy evaluation in a guided automation provides a comprehensive overview of the performance of different machine learning models after undergoing a rigorous automation process.

The dashboard serves as a centralized location for:

  • Model comparison: Quickly identify the best-performing models based on accuracy and other metrics.
  • Performance analysis: Delve into individual model performance through confusion matrices, ROC curves, and gain charts.
  • Decision support: Inform data-driven decisions by understanding model strengths and weaknesses.

Dashboard components:

  • Bar chart of accuracies and AUC scores: This visual compares models based on their overall accuracy and ability to discriminate between positive and negative classes.
  • ROC curves: These curves illustrate the trade-off between true positive rate and false positive rate, providing a deeper understanding of model performance.
  • Confusion matrices: These matrices offer a detailed breakdown of model predictions, revealing correct and incorrect classifications.
  • Cumulative gain charts: These charts show the cumulative proportion of positive cases identified by the model compared to a random model, helping assess model performance in terms of target population identification.

Potential enhancements:

While the described dashboard provides valuable insights, consider adding these elements for further enrichment:

  • Feature importance: Visualize the impact of different features on model predictions.
  • Model explainability: Incorporate techniques to understand how models make decisions.
  • Hyperparameter tuning results: Display the optimal hyperparameter values for each model.
  • Computational cost: Include metrics to assess the computational efficiency of different models.

-???????In Essence

“Balancing bias and variance ... is the best way to ensure that model is sufficiently [optimally] fit on the data and performs well on new [evaluation] data.” Solving the issue of bias and variance is about dealing with overfitting and underfitting and building an optimal model. [29]

Next, read my "Complexity: Time, Space, & Sample" article at?https://www.dhirubhai.net/pulse/complexity-time-space-sample-yair-rajwan-ms-dsc.

---------------------------------------------------------

[1] https://www.dhirubhai.net/pulse/machine-learning-101-which-ml-choose-yair-rajwan-ms-dsc

[2] https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

[3] https://www.datarobot.com/wiki/fitting

[4] https://prateekvjoshi.com/2013/06/09/overfitting-in-machine-learning

[5] https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

[6] https://medium.com/ml-research-lab/under-fitting-over-fitting-and-its-solution-dc6191e34250

[7] https://medium.datadriveninvestor.com/determining-perfect-fit-for-your-ml-model-339459eef670

[8] https://scott.fortmann-roe.com/docs/BiasVariance.html

[9] https://www.kaggle.com/getting-started/166897

[10] https://www.coursera.org/lecture/deep-neural-network/bias-variance-ZhclI

[11] https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-and-visualizing-it-with-example-and-python-code-7af2681a10a7

[12] https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec

[13] https://towardsdatascience.com/should-i-look-at-precision-recall-or-specificity-sensitivity-3946158aace1

[14] https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/

[15] https://www.soapboxlabs.com/blog/common-machine-learning-problems/

[16] https://wandb.ai/mostafaibrahim17/ml-articles/reports/A-Deep-Dive-Into-Learning-Curves-in-Machine-Learning--Vmlldzo0NjA1ODY0

[17] https://theaisummer.com/regularization

[18] https://commons.wikimedia.org/wiki/User:Schutz

[19] https://www.tandfonline.com/doi/abs/10.1080/00031305.1973.10478966

[20] https://neptune.ai/blog/visualizing-machine-learning-models

要查看或添加评论,请登录

社区洞察

其他会员也浏览了