Some Statistical Operations For Machine Learning
Swapnil Sharma
Co-founder & Backend Architect of SongGPT | CEO at OvaDrive | AI/ML Engineer
Introduction
Machine learning is an interdisciplinary field that uses statistical methods to extract meaningful insights and knowledge from large and complex datasets. ML algorithms leverage statistical operations to understand patterns in data, make predictions, and automate decision-making. In this article, we will discuss some of the statistical operations that are essential for machine learning.
1. Descriptive Statistics
Descriptive statistics are used to summarize and describe the main features of a dataset. The most commonly used descriptive statistics include mean, median, mode, standard deviation, and variance. These statistics help us to understand the central tendency, variability, and distribution of the data.
In ML, descriptive statistics are used for data preprocessing tasks such as data cleaning, normalization, and feature scaling. For example, we can use the mean and standard deviation to standardize the features of a dataset so that they have zero mean and unit variance.
Descriptive statistics are used to summarize and describe the features of a dataset. They provide insights into the central tendency, dispersion, and shape of the data.
a. Measures of Central Tendency:
The measures of central tendency describe where the data points are centred. The most common measures of central tendency are:
- Mean: It is the average of all the data points in a dataset. For example, if we have the following dataset: [1, 2, 3, 4, 5], then the mean would be (1+2+3+4+5)/5 = 3.
- Median: It is the value that separates the dataset into two halves. For example, if we have the following dataset: [1, 2, 3, 4, 5], then the median would be 3.
- Mode: It is the value that appears most frequently in a dataset. For example, if we have the following dataset: [1, 2, 2, 3, 4, 4, 4, 5], then the mode would be 4.
b. Measures of Dispersion:
The measures of dispersion describe how spread out the data points are. The most common measures of dispersion are:
- Variance: It measures the degree of spread in a dataset. It is calculated as the average of the squared differences from the mean. For example, if we have the following dataset: [1, 2, 3, 4, 5], then the variance would be ((1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2) / 5 = 2.
- Standard Deviation: It is the square root of the variance. It measures how much the data points deviate from the mean. For example, if we have the same dataset as above, the standard deviation would be the square root of the variance, i.e., sqrt(2) = 1.41.
Example:
For descriptive statistics, we can use the well-known Iris dataset. The Iris dataset contains information on the length and width of sepals and petals for three species of iris flowers. We can use descriptive statistics to summarize the data and gain insights into the distribution of the variables.
import pandas as pd
import seaborn as sns
iris = sns.load_dataset("iris")
# Summary statistics
iris.describe()
Output:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
2. Inferential Statistics
Inferential statistics are used to make inferences about a population based on a sample of data. Inferential statistics include hypothesis testing, confidence intervals, and regression analysis. These methods help us to estimate population parameters such as the mean, standard deviation, and correlation.
In ML, inferential statistics are used to test hypotheses about the relationship between variables and to estimate the performance of ML models. For example, we can use hypothesis testing to determine whether the difference between the mean values of two groups is significant or not.
a. t-tests:
t-tests are used to determine whether there is a significant difference between the two groups. For example, if we have two groups of students, one group that studied for a test and one group that did not, we can use a t-test to determine if there is a significant difference in their test scores.
Example:
from scipy.stats import ttest_ind
# Two groups of students, one that studied and one that did not
group1 = [75, 80, 85, 90, 95]
group2 = [60, 65, 70, 75, 80]
# Perform a t-test to determine if there is a significant difference between the two groups
t_stat, p_val = ttest_ind(group1, group2)
print("t-statistic: ", t_stat)
print("p-value: ", p_val)
In this example, we have two groups of students, one that studied and one that did not. We perform a t-test using the ttest_ind function from the scipy.stats library to determine if there is a significant difference between their test scores.
Output:
t-statistic: 3.
p-value: 0.0170716812337826340
The t-test output includes the t-statistic and the p-value. The t-statistic measures the difference between the means of the two groups relative to the variation within the groups. A higher t-statistic indicates a larger difference between the means of the groups. The p-value is the probability of observing a t-statistic as extreme as the one computed, assuming that the null hypothesis (that there is no difference between the two groups) is true. A lower p-value indicates stronger evidence against the null hypothesis and in favor of the alternative hypothesis (that there is a significant difference between the two groups).
b. Chi-squared tests:
Chi-squared tests are used to test the independence of two categorical variables. For example, if we want to determine if there is a relationship between smoking and lung cancer, we can use a chi-squared test to determine if there is a significant association.
Example:
import?numpy?as?np
from?scipy.stats?import?chi2_contingency
#?Two?categorical?variables:?smoking?and?lung?cancer
smoking?=?[100,?50]
no_smoking?=?[200,?300]
data?=?np.array([smoking,?no_smoking])
#?Perform?a?chi-squared?test?to?determine?if?there?is?a?significant?association?between?smoking?and?lung?cancer
chi2,?p_val,?dof,?expected?=?chi2_contingency(data)
print("Chi-squared?statistic:?",?chi2)
print("p-value:?",?p_val)
print("Degree?of?Freedom:?",?dof)
print("Expected?Frequency:?",?expected)
In this example, we have two categorical variables, smoking and lung cancer, and we want to determine if there is a significant association between them. We perform a chi-squared test using the chi2_contingency function from the scipy.stats library. The output includes the chi-squared statistic, the p-value, the degrees of freedom, and the expected frequencies.
Output:
Chi-squared statistic: 31.951575396825405
p-value: 1.5806413010081755e-08
Degree of Freedom: 1
Expected Frequency: [[ 69.23076923 80.76923077] [230.76923077 269.23076923]]
The chi-squared test output includes the chi-squared statistic, the p-value, the degrees of freedom, and the expected frequencies. The chi-squared statistic measures the difference between the observed frequencies and the expected frequencies, assuming that there is no association between the two categorical variables. A higher chi-squared statistic indicates a larger deviation from the expected frequencies, and thus stronger evidence against the null hypothesis (that there is no association between the two variables). The p-value is the probability of observing a chi-squared statistic as extreme as the one computed, assuming that the null hypothesis is true. A lower p-value indicates stronger evidence against the null hypothesis and in favor of the alternative hypothesis (that there is a significant association between the two variables). The degrees of freedom refers to the number of categories or cells in the contingency table that are free to vary. In other words, it's the number of cells that can be filled with counts of observations that are not predetermined by the other cells in the table. For a contingency table with r rows and c columns, the degrees of freedom is (r-1) x (c-1). This is because once the row and column marginal totals are fixed, the counts in the remaining cells are determined by the counts in the other cells. Expected frequencies, on the other hand, are the frequencies that we would expect to observe in each cell of the contingency table if there was no association between the two categorical variables. These expected frequencies are calculated based on the marginal totals of the table and the assumption of independence between the variables.
c. ANOVA:
ANOVA (Analysis of Variance) is used to compare the means of three or more groups. For example, if we have three groups of students, one group that studied for a test, one group that studied with flashcards, and one group that did not study at all, we can use ANOVA to determine if there is a significant difference in their test scores.
Example:
from scipy.stats import f_oneway
# Three groups of students, with different study methods
studied = [80, 85, 90, 95, 100]
flashcards = [70, 75, 80, 85, 90]
no_study = [50, 55, 60, 65, 70]
# Perform an ANOVA to determine if there is a significant difference in test scores between the three groups
f_stat, p_val = f_oneway(studied, flashcards, no_study)
print("F-statistic: ", f_stat)
print("p-value: ", p_val)
In this example, we have three groups of students with different study methods, and we want to determine if there is a significant difference in their test scores. We perform an ANOVA using the f_oneway function from the scipy.stats library. The output includes the F-statistic and the p-value.
Output:
F-statistic: 18.6666666666666
p-value: 0.00020713081415688798
The ANOVA output includes the F-statistic and the p-value. The F-statistic measures the variation between the means of the three or more groups relative to the variation within the groups. A higher F-statistic indicates a larger difference between the means of the groups. The p-value is the probability of observing an F-statistic as extreme as the one computed, assuming that the null hypothesis (that there is no difference between the means of the groups) is true. A lower p-value indicates stronger evidence against the null hypothesis and in favor of the alternative hypothesis (that there is a significant difference between the means of the groups).
3. Probability Distributions
Probability distributions are mathematical functions that describe the likelihood of different outcomes in a random event. The most commonly used probability distributions in ML are the normal distribution, the Bernoulli distribution, and the Poisson distribution.
a. Normal Distribution:
The normal distribution is a continuous probability distribution that is symmetric around its mean. It is often used to model naturally occurring phenomena, such as height or weight. The normal distribution has two parameters: the mean and the standard deviation.
b. Bernoulli Distribution
The Bernoulli distribution is a discrete probability distribution that models the probability of a binary outcome (e.g., success or failure). It has a single parameter, p, which represents the probability of success.
c. Poisson Distribution
The Poisson distribution is a discrete probability distribution that models the number of occurrences of an event in a given time or space interval. It has a single parameter, lambda, which represents the rate of occurrence of the event.
领英推è
4. Bayesian Inference
Bayesian inference is a statistical framework that provides a way to update our beliefs about the probability of an event based on new evidence. Bayesian inference involves using Bayes' theorem to calculate the posterior probability of an event given
Bayesian inference is a method used to update the probability of a hypothesis based on new data. It involves using prior knowledge about the probability of a hypothesis and updating it based on new evidence.
For example, if we want to determine the probability of a customer buying a product based on their age and income, we can use Bayesian inference to update our prior probability based on new data. We can use Bayes' theorem to calculate the posterior probability of the hypothesis based on the likelihood of the data and the prior probability of the hypothesis.
For Bayesian inference, we can use the Bernoulli distribution. We can use Bayesian inference to estimate the probability of success in a Bernoulli trial, given observed data.
from scipy.stats import beta
# Bernoulli Distribution
n_trials = 100
n_successes = 60
prior_alpha = 2
prior_beta = 2
posterior_alpha = prior_alpha + n_successes
posterior_beta = prior_beta + n_trials - n_successes
x = np.linspace(0, 1, 100)
prior = beta.pdf(x, prior_alpha, prior_beta)
posterior = beta.pdf(x, posterior_alpha, posterior_beta)
plt.plot(x, prior, label='Prior')
plt.plot(x, posterior, label='Posterior')
plt.legend()
plt.show()
This code shows how to use the beta distribution to update our beliefs about the probability of a certain event happening, given some observed data.
Specifically, the code simulates 100 coin flips, and we observe that the coin lands heads up 60 times (n_successes = 60).
We start with some prior belief about the probability of the coin landing heads up, represented by a beta distribution with parameters (prior_alpha = 2, prior_beta = 2). The shape of this distribution reflects our uncertainty about the probability.
Then, we update our belief about the probability of the coin landing heads up using the beta distribution again, but this time with the updated parameters (posterior_alpha = prior_alpha + n_successes, posterior_beta = prior_beta + n_trials - n_successes). This is known as the posterior distribution.
Finally, we plot the prior and posterior distributions using matplotlib, which allows us to visualize how our beliefs have changed after observing the data.
In this example, we can see that our prior belief about the probability of the coin landing heads up was relatively spread out, reflecting our uncertainty. After observing the data, our posterior belief is more concentrated around the observed proportion of heads, which is 60% in this case.
5. Correlation Analysis:
Correlation analysis is used to measure the strength and direction of the relationship between two variables. Correlation coefficients can be calculated to determine the degree to which two variables are linearly related.
a. Pearson Correlation Coefficient:
The Pearson correlation coefficient is used to measure the linear relationship between two continuous variables. It ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation. For example, if we want to determine if there is a correlation between the number of hours of sleep and the level of productivity, we can calculate the Pearson correlation coefficient to determine the strength and direction of the relationship.
b. Spearman Rank Correlation Coefficient:
The Spearman rank correlation coefficient is used to measure the relationship between two variables when one or both are measured on an ordinal scale. It ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation. For example, if we want to determine if there is a correlation between the rank of the students in a class and their scores on a test, we can calculate the Spearman rank correlation coefficient to determine the strength and direction of the relationship.
6. Regression Analysis:
Regression analysis is used to model the relationship between a dependent variable and one or more independent variables. It is used to predict the value of the dependent variable based on the values of the independent variables.
a. Linear Regression:
Linear regression is used when there is a linear relationship between the dependent variable and the independent variables. It is used to predict a continuous variable. For example, if we want to predict the salary of an employee based on their years of experience, we can use linear regression to model the relationship between these two variables.
b. Logistic Regression:
Logistic regression is used to model the relationship between a binary dependent variable and one or more independent variables. It assumes a linear relationship between the independent variables and the log odds of the dependent variable. For example, if we want to predict whether a customer will buy a product or not based on their age, gender, and income, we can use logistic regression to model the relationship between the independent variables and the probability of buying the product.
7. Clustering analysis
Clustering analysis is used to group similar data points together based on their features. Clustering algorithms can be used to identify patterns in the data that may not be apparent through visual inspection. Clustering analysis is an important technique in machine learning because it can be used to identify subgroups of data that may require different modelling approaches.
Clustering is a method used to group similar data points together based on their attributes. It is used to discover patterns in data without prior knowledge of what those patterns might be.
a. K-Means Clustering:
K-means clustering is a popular method used to cluster data points into k number of clusters. It involves selecting k initial centroids and then iteratively assigning data points to the closest centroid and updating the centroids until the clusters converge. For example, if we want to cluster customers based on their purchasing behaviour, we can use k-means clustering to group similar customers together based on their purchasing patterns.
b. Hierarchical Clustering:
Hierarchical clustering is another method used to cluster data points into groups. It involves building a tree-like structure of clusters based on the similarity of the data points. For example, if we want to cluster patients based on their medical history, we can use hierarchical clustering to group similar patients together based on their medical conditions and treatments.
8. Dimensionality reduction
Dimensionality reduction is used to reduce the number of features in a dataset. This can be done to reduce computational complexity or eliminate irrelevant or redundant features. Principal component analysis (PCA) is a common technique used for dimensionality reduction in machine learning.
Dimensionality reduction is a method used to reduce the number of features in a dataset while retaining important information. It is used to overcome the curse of dimensionality and improve the performance of machine learning models.
a. Principal Component Analysis (PCA):
PCA is a popular method used for dimensionality reduction. It involves identifying the directions of maximum variance in the data and projecting the data onto those directions to create a new set of uncorrelated variables. For example, if we want to reduce the number of features in a dataset, we can use PCA to identify the most important features and reduce the dimensionality of the data.
Following code loads the iris dataset and applies PCA to reduce the dimensionality of the data to 2 dimensions. It then plots the results using a scatter plot, coloring each point according to its class label.
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
# Load the iris dataset
iris = load_iris()
X = iris.data
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Plot the results
import matplotlib.pyplot as plt
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=iris.target)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
The plot shows the distribution of the data in a two-dimensional space, with the x-axis representing the first principal component and the y-axis representing the second principal component. Each point in the plot corresponds to an observation in the dataset, and its color represents the target class (species) of the observation.
We can see that the plot shows a clear separation between the different species of iris flowers, indicating that the two principal components have captured most of the variation in the data. The setosa species is clustered separately from the other two, while the versicolor and virginica species have some overlap between them.
Overall, the PCA plot provides a useful visualization of the distribution of the iris dataset in a reduced two-dimensional space, which can help in better understanding and analyzing the data.
b. t-SNE:
t-SNE (t-Distributed Stochastic Neighbor Embedding) is another method used for dimensionality reduction. It is particularly useful for visualizing high-dimensional data in a low-dimensional space. It involves creating a probability distribution over pairs of high-dimensional objects and a probability distribution over pairs of low-dimensional points and then minimizing the Kullback-Leibler divergence between the two distributions. For example, if we want to visualize a high-dimensional dataset, we can use t-SNE to create a low-dimensional representation of the data that preserves the structure of the high-dimensional space.
Following code loads the digits dataset and applies t-SNE to reduce the dimensionality of the data to 2 dimensions. It then plots the results using a scatter plot, coloring each point according to its class label.
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
# Load the digits dataset
digits = load_digits()
X = digits.data
y = digits.target
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, random_state=42)
X_tsne = tsne.fit_transform(X)
# Plot the results
import matplotlib.pyplot as plt
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y)
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.show()
The plot shows a low-dimensional representation of the data that preserves the structure of the high-dimensional space. We can see that the clusters are more clearly separated compared to the original high-dimensional data. This makes it easier to visualize and analyze the data. However, it's important to note that t-SNE is not guaranteed to preserve global structure, only local structure, so it should be used with caution and with consideration of the specific problem at hand.
Conclusion:
In summary, these statistical operations play an important role in machine learning. They help us understand the relationship between variables, model the relationship between variables, test hypotheses, and make more accurate predictions. Understanding these statistical concepts and applying them appropriately can help us build better machine learning models and make more informed decisions.
Data Science | Data analysis | ML engineer, Web Developer experience: Data Science and Data analysis , AI/ML inter, data analysis, business analysis
11 个月?? thank you so much