Must Know Mathematical Measures For Every Data Scientist
Sadhiq Nazar
Digital Marketing & Analytics Consultant |Helping Start UP to Scale UP |CEO @MindGee Technologies |Mindzee Digital Academy|Ex Founder & CEO Zinavo Pvt Ltd|Trained 1000+ Folks|Consulted 5000+ Brands Globally|
There are a large number of mathematical measures that every data scientist needs to be aware of. This article outlines the must-know statistical measures in a concise and succinct manner.
Mean
- Sum all values.
- Divide it by the total number of observations.
Mode
Take the most occurring value in the sample.
Median
- Sort the numbers in ascending order.
- Take the middle value.
Variance
- Calculate mean.
- Take difference between each value and the mean
- Square this difference.
- Sum all differences
- Finally, divide by the total number of observations.
Variance gives us dispersion of the values around the mean.
Standard Deviation
Square root of variance.
Standard deviation gives us dispersion of the values around the mean in the same units as the values (instead of squared value as variance)
Covariance
Covariance is used to find relationship between two variables. For each variable:
- Calculate mean.
- Take difference between each value and the mean of a variable. Multiple the difference of the two variables.
- Sum all the multiplied differences.
- Divide by total number of observations
Correlation
Measures strength of the relationship between variables co-movement. It is the standardized variance of two assets.
Correlation is always between -1 and 1. -1 indicates that the variables are negatively correlated and +1 shows that the variables are positively correlated. 0 indicates that there is no correlation amongst the target variables.
- Calculate covariance of two variables
- Calculate standard deviation of the two variables
- Multiply the two standard deviations
- Divide covariance by the multiplied standard deviations
Explained Sum Of Squares
For a variable Y:
- Calculate difference between estimated value of Y and mean of Y
- Square the difference
- Sum all of the values
Sum Of Squared Residuals
For a variable Y:
- Calculate difference between estimated value of Y and actual value of Y
- Square the difference
- Sum all of the values
Residuals are also known as errors or unknowns.
Total Sum of Squares
Explained Sum Of Squares + Sum Of Squared Residuals. Therefore it is known as total sum of all squares.
R-Squared
Measures explained variation over total variation. Additionally, R squared is also known as coefficient of determination and it measures quality of fit.
Formula to calculate R squared is:
- R squared = 1 — (Sum of Squared Residuals/Total Sum of Squares)
Adjusted R-Squared
R squared by itself is not good enough as it does not consider the number of variables that gave us the degree of determination. As a result, adjusted R squared is calculated.
1 — [ [(n-1)/(n-k-1] x [1 — R squared]]
- n = number of observations
- k = number of independent variables
It is adjusted for the number of predictors in the model.
Standard Error Of Regression
Measures variability of actual and estimated values of Y. It is the standard deviation of the residuals. It is calculated as
[Standard deviation of sample/SquareRoot(Number of Observations)]
Mean Absolute Error
- Calculate absolute differences between prediction and actual observation
- Sum the absolute differences
- Divide sum by total number of observations
Root Mean Squared Error
- Calculate difference between prediction and actual observation.
- Square the difference
- Sum the squared differences
- Divide sum of squared differences by total number of observations
- Calculate square root of it
F1
Used to measure performance of classification based supervised machine learning algorithms. It is a weighted average of the precision and recall of a model. The results are between 1 and 0. Results tending to 1 are considered the best whereas those tending towards 0 are treated as the worst. F1 is used in classification tests where true negatives do not matter as much.
Confusion Matrix
Confusion matrix is a result table that summarises results of classification algorithm when actual true values are known.
There are several terms used:
- True Positive: When the actual result is true and predicted value is also true
- True Negative: When the actual result is false and predicted value is also false
- False Positive: When the actual result is false but the predicted value is true
- False Negative: When the actual result is true but the predicted value is false
Euclidean Distance
Finds similarity between two variables
- For each variable, find difference between each value
- Square the difference
- Sum the differences
- Square root the sum
Manhattan Distance
Finds similarity between two variables
- For each variable, find difference between each value
- Take the absolute of the difference
- Sum the differences
Minkowski Distance
Metric form of Euclidean and Manhattan distances.
Given Minkowsky power (a number) known as λ
- For each variable, find difference between each value
- Take the absolute of the difference
- Raise the difference to the power λ
- Sum the differences
- λ Root the sum
Cosine Similarity
Finds how similar two variables X and Y are:
- Multiply each value of X and Y
- Sum the multiplied values
- Square values of each variable
- Multiply the squared values of each variable together
- Divide the sum of multiplied values by multiplied square values of each variable