Must Know Mathematical Measures For Every Data Scientist

Must Know Mathematical Measures For Every Data Scientist

There are a large number of mathematical measures that every data scientist needs to be aware of. This article outlines the must-know statistical measures in a concise and succinct manner.

Mean

  1. Sum all values.
  2. Divide it by the total number of observations.

Mode

Take the most occurring value in the sample.

Median

  1. Sort the numbers in ascending order.
  2. Take the middle value.

Variance

  1. Calculate mean.
  2. Take difference between each value and the mean
  3. Square this difference.
  4. Sum all differences
  5. Finally, divide by the total number of observations.

Variance gives us dispersion of the values around the mean.

Standard Deviation

Square root of variance.

Standard deviation gives us dispersion of the values around the mean in the same units as the values (instead of squared value as variance)

Covariance

Covariance is used to find relationship between two variables. For each variable:

  1. Calculate mean.
  2. Take difference between each value and the mean of a variable. Multiple the difference of the two variables.
  3. Sum all the multiplied differences.
  4. Divide by total number of observations

Correlation

Measures strength of the relationship between variables co-movement. It is the standardized variance of two assets.

Correlation is always between -1 and 1. -1 indicates that the variables are negatively correlated and +1 shows that the variables are positively correlated. 0 indicates that there is no correlation amongst the target variables.

  1. Calculate covariance of two variables
  2. Calculate standard deviation of the two variables
  3. Multiply the two standard deviations
  4. Divide covariance by the multiplied standard deviations

Explained Sum Of Squares

For a variable Y:

  1. Calculate difference between estimated value of Y and mean of Y
  2. Square the difference
  3. Sum all of the values

Sum Of Squared Residuals

For a variable Y:

  1. Calculate difference between estimated value of Y and actual value of Y
  2. Square the difference
  3. Sum all of the values

Residuals are also known as errors or unknowns.

Total Sum of Squares

Explained Sum Of Squares + Sum Of Squared Residuals. Therefore it is known as total sum of all squares.

R-Squared

Measures explained variation over total variation. Additionally, R squared is also known as coefficient of determination and it measures quality of fit.

 

Formula to calculate R squared is:

  • R squared = 1 — (Sum of Squared Residuals/Total Sum of Squares)

Adjusted R-Squared

R squared by itself is not good enough as it does not consider the number of variables that gave us the degree of determination. As a result, adjusted R squared is calculated.

1 — [ [(n-1)/(n-k-1] x [1 — R squared]]
  • n = number of observations
  • k = number of independent variables

It is adjusted for the number of predictors in the model.

Standard Error Of Regression

Measures variability of actual and estimated values of Y. It is the standard deviation of the residuals. It is calculated as

[Standard deviation of sample/SquareRoot(Number of Observations)]

Mean Absolute Error

  1. Calculate absolute differences between prediction and actual observation
  2. Sum the absolute differences
  3. Divide sum by total number of observations

Root Mean Squared Error

  1. Calculate difference between prediction and actual observation.
  2. Square the difference
  3. Sum the squared differences
  4. Divide sum of squared differences by total number of observations
  5. Calculate square root of it

F1

Used to measure performance of classification based supervised machine learning algorithms. It is a weighted average of the precision and recall of a model. The results are between 1 and 0. Results tending to 1 are considered the best whereas those tending towards 0 are treated as the worst. F1 is used in classification tests where true negatives do not matter as much.

Confusion Matrix

Confusion matrix is a result table that summarises results of classification algorithm when actual true values are known.

There are several terms used:

  • True Positive: When the actual result is true and predicted value is also true
  • True Negative: When the actual result is false and predicted value is also false
  • False Positive: When the actual result is false but the predicted value is true
  • False Negative: When the actual result is true but the predicted value is false

Euclidean Distance

Finds similarity between two variables

  1. For each variable, find difference between each value
  2. Square the difference
  3. Sum the differences
  4. Square root the sum

Manhattan Distance

Finds similarity between two variables

  1. For each variable, find difference between each value
  2. Take the absolute of the difference
  3. Sum the differences

Minkowski Distance

Metric form of Euclidean and Manhattan distances.

Given Minkowsky power (a number) known as λ

  1. For each variable, find difference between each value
  2. Take the absolute of the difference
  3. Raise the difference to the power λ
  4. Sum the differences
  5. λ Root the sum

Cosine Similarity

Finds how similar two variables X and Y are:

  1. Multiply each value of X and Y
  2. Sum the multiplied values
  3. Square values of each variable
  4. Multiply the squared values of each variable together
  5. Divide the sum of multiplied values by multiplied square values of each variable


要查看或添加评论,请登录

Sadhiq Nazar的更多文章

社区洞察

其他会员也浏览了