Statistics for Machine Learning

Statistics for Machine Learning


Statistics is described as a collection of tools and methods used to derive meaningful insights by performing mathematical computations on data. For example, if you eat different numbers of fries each day, statistics can help you predict how many fries you’ll get tomorrow and how confident you can be in that prediction.

Population and?Sample:

  • Population: This refers to the entire set of individuals relevant to a particular statistical question.
  • Sample: A smaller group selected from the population for analysis. Samples are used when it’s impractical to study the entire population.

Parameters and Statistics:

  • Parameter: A descriptive measure of the population.
  • Statistic: A descriptive measure of the sample.

Sampling and Sampling?Methods:

Sampling is a method used to select a subset from the population for study.

  • Probability Sampling: Every individual has an equal chance of being selected.
  • Non-Probability Sampling: Some individuals have a higher chance of being selected than others.

Types of Probability Sampling:

  • Simple Random Sampling: Every unit in the population has an equal chance of selection. This can be done with or without replacement.
  • Stratified Sampling: Stratification is the process of dividing members of the population into homogeneous subgroups before sampling. The strata should define a partition of the population. The population is divided into subgroups based on a common factor, and samples are randomly selected from each subgroup. This ensures representation from each group.

  • Cluster Sampling: The population is divided into clusters, and either single or two-stage sampling is done where entire clusters or individuals within clusters are sampled.

  • Systematic Sampling: Individuals are selected at regular intervals from a list of the population, often after a random start.

  • Multi-Stage Sampling: Combines several sampling methods, usually in stages, to gather more comprehensive or practical samples.

Sampling Errors:

  • Sampling error refers to the discrepancy between the sample result and the actual population parameter due to the sample being only a part of the entire population.


Descriptive Statistics: Measures of Central?Tendency

These are methods used to find the “center” or typical value in a dataset:

Mean (Average):

  • It’s the total sum of all numbers divided by how many numbers there are.
  • Example: If you have test scores of 80, 85, and 90, the mean is (80 + 85 + 90) / 3 = 85.
  • It’s the most commonly used method but sensitive to outliers (extremely high or low values).
  • Use when the data is numerical and evenly distributed (no outliers).

Median:

  • The middle value in a sorted list.
  • Example: In the scores 70, 80, 85, 90, and 95, the median is 85.
  • It’s useful when your data has outliers, as it’s not affected by extreme values.
  • Use when there are outliers, like house prices, where a few extremely high prices can distort the average.

Mode:

  • The most frequent value in the dataset.
  • Example: In the numbers 2, 4, 4, 6, and 8, the mode is 4 because it appears the most.
  • It’s commonly used for categorical data (e.g., most common fruit people like).
  • Use for categorical data, like finding the most popular color or brand.

Measures of Spread (Variance)

These measures tell you how much the data is spread out or varies from the center.

We measure the spread by 4 terminologies?: Range, IQR, Variance and Standard Deviation.

Range:

  • The difference between the highest and lowest values.
  • Example: In scores 50, 60, 70, 80, 90, the range is 90–50 = 40.
  • It’s simple but can be misleading if there are outliers.

Interquartile Range (IQR):

  • The range of the middle 50% of data (between the 25th and 75th percentiles).
  • Example: If you split your test scores into quarters, IQR measures the range of the middle two quarters, giving a better idea of data spread, without being affected by outliers.

Variance:

  • Tells how much each data point differs from the mean. The larger the variance, the more spread out the data is.

  • Example: If most students’ test scores are close to the average, the variance is low. If they vary a lot, it’s high.
  • It’s measured in squared units, which can sometimes make it hard to interpret.

Standard Deviation:

  • This is just the square root of the variance, which brings the units back to the original scale.
  • Example: If test scores vary widely, the standard deviation will be large. If they are close to the average, it will be small.

Measure of Position (Percentiles and Quartiles)

It helps to determine the position of a value in the dataset relative to others:

Percentile:

  • It tells you the percentage of data points below a certain value.
  • Example: If you’re in the 90th percentile on a test, you scored better than 90% of students.

Quartile:

  • The dataset is divided into four equal parts (quarters). The median is the 50th percentile or second quartile.
  • The 1st quartile (25th percentile) and 3rd quartile (75th percentile) help understand the spread of data.

Z-Score (Standard Score)

  • This indicates how far away a value is from the mean, in terms of standard deviations.

  • Example: If a student’s z-score is 2, it means they scored 2 standard deviations above the mean (better than average). A negative z-score means the score is below average.


Descriptive vs. Inferential Statistics

  • Descriptive Statistics: This is used to summarize or describe the main features of a dataset. For example, if you measure the heights of 10 people and calculate the average, that’s descriptive statistics.
  • Inferential Statistics: This is used to make predictions or inferences about a population based on a sample of data. For example, if you want to predict the average height of all people in a city based on the 10 people you measured, you’re using inferential statistics.

Types of Inferential Statistics

Estimating Parameters: We use data from a sample to estimate unknown values about a population.

  • Example: You sample 100 people’s test scores to estimate the average score for all students in a school.

Hypothesis Testing: This allows us to test whether certain claims about a population are true or not.

  • Example: Testing whether a new teaching method improves student test scores by comparing scores of students taught with the new method to those taught with the old one.

Probability and Random Variables

Probability Distribution: This is a function that describes all possible outcomes of a random event and how often they occur. For example, if we know that 70% of people prefer pumpkin pie and 30% prefer blueberry pie, we can use this information to estimate how likely certain outcomes are when we ask people about their pie preference.

Random Variables: A random variable is a value that depends on the outcome of a random event. It can be:

  • Discrete: Can take only specific values (e.g., the number of heads in coin tosses).
  • Continuous: Can take any value within a range (e.g., a person’s height).

Types of Probability Distributions

Discrete Probability Distributions: Used for countable outcomes. Example: The number of times a six appears when rolling a dice.

Types:

  • Binomial Distribution: For events with two outcomes (e.g., success or failure).
  • Poisson Distribution: For counting events over a certain time or space (e.g., number of phone calls in an hour).

Continuous Probability Distributions: Used for measurements. Example: Measuring people’s heights.

  • The most common continuous distribution is the Normal Distribution (bell curve), where most data is around the mean.


Normal Distribution

  • Normal Distribution (also called Gaussian or bell curve) is a common type of probability distribution where most data points are clustered around the mean.

  • Example: In a class, most students score close to the average, with a few scoring much higher or lower.
  • Characteristics: Mean = Median = Mode (they are all the same).
  • Symmetrical around the mean.
  • The area under the curve represents all possible outcomes, and it adds up to 1.

Sampling Distribution and Central Limit?Theorem

Sampling Distribution: When you take multiple samples from a population and calculate their means, the distribution of these sample means forms the sampling distribution.

  • Example: If you repeatedly measure the average height of different groups of 30 students, the average of those averages forms the sampling distribution.

Central Limit Theorem (CLT): This is a key principle in inferential statistics that says as the sample size grows, the sampling distribution of the sample mean becomes approximately normal, regardless of the population’s distribution.

  • Example: If you take a large enough sample from any population (e.g., heights of people), the average height of the samples will form a normal distribution, even if the original data was skewed.
  • Significance: The CLT allows us to make inferences about the population mean using the sample mean. The larger the sample, the more accurate the estimate of the population mean.


Models in Machine?Learning

A model is a mathematical equation that represents relationships in the data and helps us make predictions. For example, a model could predict a person’s height based on their weight.

Example:

If the model is [Height=0.5+0.8×Weight] and someone weighs 2.1 units, you’d predict their height as 0.5+(0.8×2.1) = 2.18.

Sum of Squared Residuals (SSR)

Residuals are the differences between the actual data points and the model’s predictions. The Sum of Squared Residuals (SSR) is a way to measure how well a model fits the data: smaller SSR values mean better fit.

Example:

If you predict someone’s height and compare it to their actual height, the difference is a residual. Squaring and summing all residuals gives you the SSR, which tells you how far off your model is overall.

Mean Squared Error?(MSE)

The Mean Squared Error (MSE) is simply the average of the SSR, and it gives a better sense of how well a model is performing across different datasets.

Example:

If the SSR for a small dataset is 14, and you have 3 data points, the MSE is 14/3 = 4.7

R-Squared (R2)

R-squared is a measure of how well a model fits the data, ranging from 0 to 1.

A higher R2 value indicates a better fit. R2 shows the percentage of variation in the data explained by the model.

Example:

If you use height to predict weight, R2 tells you how much of the weight variation can be explained by height. An R2 value of 0.7 means 70% of the variation in weight is explained by height.


What is a Hypothesis?

A hypothesis is a statement or assumption about a population that we want to test. It’s an idea made from limited evidence, serving as the starting point for further investigation.

Examples:

  • Hypothesis 1: Will I improve my grades if I study 4 hours every day?
  • Hypothesis 2: Does having breakfast help children perform better in school?
  • Hypothesis 3: Is the average time that college students spend studying 20 hours per week?

In statistics, we use sample data to test these hypotheses and see whether the results support or reject the initial assumption.

Hypothesis Testing

Hypothesis testing is a formal procedure that uses sample data to evaluate the credibility of a hypothesis about a population. In simple terms, it’s a rule that helps decide whether to accept or reject a claim based on the evidence provided by the data.

Example:

Imagine a school claims that their students score at least 70% on average in exams. To test this claim, we collect data from a sample of students and use hypothesis testing to verify if the school’s claim holds up.

Null and Alternative Hypotheses

In hypothesis testing, we always start with two hypotheses:

Null Hypothesis (H?): The null hypothesis is the assumption that there is no effect or no difference. It represents the claim or statement we are testing. Can include: =, ≥, ≤

  • Example: “The average score of students is at least 70%.”

Alternative Hypothesis (H? or H?): This hypothesis directly contradicts the null hypothesis and represents the effect or difference we are trying to find. Can include: ≠, >, <

  • Example: “The average score of students is less than 70%.”

Level of Significance (Alpha)

The level of significance (alpha), denoted by “α”, is the threshold used to determine whether to reject the null hypothesis. It represents the probability of rejecting the null hypothesis when it is actually true. Common values for α\alphaα are 0.01, 0.05, or 0.1.

  • Example: If α=0.05\alpha = 0.05α=0.05, this means that there’s a 5% risk of rejecting the null hypothesis when it is actually true.

Test Statistics and?p-Value

To decide whether to accept or reject the null hypothesis, we rely on:

Test Statistic: A value calculated from the sample data that is used to make decisions in hypothesis testing. There are several types of test statistics:

  • Z-test (used when the sample size is large and normally distributed)
  • T-test (used when the sample size is small)
  • ANOVA (used to compare more than two groups)
  • Chi-square test (used for categorical data)

p-Value: The p-value measures the strength of evidence against the null hypothesis. A smaller p-value indicates stronger evidence to reject the null hypothesis.

  • Example: A p-value less than α=0.05 suggests that we reject the null hypothesis, meaning there is significant evidence to support the alternative hypothesis.

p-Values

A p-value helps determine whether the results of an experiment are significant or just due to random chance. A p-value less than 0.05 typically indicates that the result is statistically significant.

Example:

Imagine we had two antiviral drugs, A and B, and we wanted to know if they were different.

So, we redid the experiment with lots and lots of people, and these were the results: Drug A cured a lot of people compared to Drug B, which hardly cured anyone.

Now, it’s pretty obvious that Drug A is different from Drug B because it would be unrealistic to suppose that these results were due to just random chance and that there’s no real difference between Drug A and Drug B. Drug B Not Cured 1,432 It’s possible that some of the people taking Drug A were actually cured by placebo, and some of the people taking Drug B were not cured because they had a rare allergy, but there are just too many people cured by Drug A, and too few cured by Drug B, for us to seriously think that these results are just random and that Drug A is no different from Drug B.

37% of the people who took Drug A were cured compared to 31% who took Drug B. Drug A cured a larger percentage of people, but given that no study is perfect and there are always a few random things that happen, how confident can we be that Drug A is different from Drug B? This is where p-values come in. p-values are numbers between 0 and 1 that, in this example, quantify how confident we should be that Drug A is different from Drug B. The closer a p-value is to 0, the more confidence we have that Drug A and Drug B are different. So, the question is, “how small does a p-value have to be before we’re sufficiently confident that Drug A is different from Drug B?” In other words, what threshold can we use to make a good decision about whether these drugs are different?

In practice, a commonly used threshold is 0.05. It means that if there’s no difference between Drug A and Drug B, and if we did this exact same experiment a bunch of times, then only 5% of those experiments would result in the wrong decision

Types of Tests: One-Tailed vs. Two-Tailed

Hypothesis tests can be either one-tailed or two-tailed, depending on the nature of the hypothesis.

One-tailed test: Used when we are interested in determining if a parameter is either greater than or less than a certain value.

  • Example: Testing whether students score less than 70% (left-tailed test) or more than 70% (right-tailed test).

Two-tailed test: Used when we are interested in testing whether a parameter is different from a specific value (it can be either higher or lower).

  • Example: Testing whether the average life of a car tire is exactly 36 months or not.


Critical Value and Rejection Region

The critical value is a point that divides the region where we either accept or reject the null hypothesis. If the test statistic falls within the rejection region, we reject the null hypothesis.

Example:

  • If the critical value is 1.96 (for a two-tailed test at α=0.05), and the test statistic is greater than 1.96 or less than -1.96, we reject the null hypothesis.


Types of?Errors

There are two types of errors that can occur in hypothesis testing:

  1. Type 1 Error: This occurs when we reject a true null hypothesis. It is also called a false positive, and the probability of making this error is the level of significance α.

  • Example: If α=0.05, there’s a 5% chance of rejecting the null hypothesis when it is actually true.

  1. Type 2 Error: This happens when we accept a false null hypothesis. It is also called a false negative, and the probability of making this error is denoted by β.

  • Example: Failing to detect a difference in student performance when a real difference exists.


References:

  1. For Study of p-value, the content is taken from Prof.Josh Starmer’s youtube channel [StatQuest with Josh Starmer?—?YouTube]
  2. Coding Ninjas Data Science and Machine Learning Course.


Justin Burns

Tech Resource Optimization Specialist | Enhancing Efficiency for Startups

4 个月

Great breakdown of key statistical concepts! It’s essential to understand these fundamentals, especially when applying them in data-driven decision making.

RISHABH SINGH

Actively looking for Full-time Opportunities in AI/ML/Robotics | Ex-Algorithms & ML Engineer @ Dynocardia Inc | Computer Vision Research Assistant & Robotics Graduate Student @Northeastern University

4 个月

[https://lnkd.in/gPMqjYUR], more articles like this are here

回复
Sufi Ali

Data Science Manager @ PharmaScroll

4 个月

Great resources

要查看或添加评论,请登录

RISHABH SINGH的更多文章