Statistics for Machine Learning
RISHABH SINGH
Actively looking for Full-time Opportunities in AI/ML/Robotics | Ex-Algorithms & ML Engineer @ Dynocardia Inc | Computer Vision Research Assistant & Robotics Graduate Student @Northeastern University
Statistics is described as a collection of tools and methods used to derive meaningful insights by performing mathematical computations on data. For example, if you eat different numbers of fries each day, statistics can help you predict how many fries you’ll get tomorrow and how confident you can be in that prediction.
Population and?Sample:
Parameters and Statistics:
Sampling and Sampling?Methods:
Sampling is a method used to select a subset from the population for study.
Types of Probability Sampling:
Sampling Errors:
Descriptive Statistics: Measures of Central?Tendency
These are methods used to find the “center” or typical value in a dataset:
Mean (Average):
Median:
Mode:
Measures of Spread (Variance)
These measures tell you how much the data is spread out or varies from the center.
We measure the spread by 4 terminologies?: Range, IQR, Variance and Standard Deviation.
Range:
Interquartile Range (IQR):
Variance:
Standard Deviation:
Measure of Position (Percentiles and Quartiles)
It helps to determine the position of a value in the dataset relative to others:
Percentile:
Quartile:
Z-Score (Standard Score)
Descriptive vs. Inferential Statistics
Types of Inferential Statistics
Estimating Parameters: We use data from a sample to estimate unknown values about a population.
Hypothesis Testing: This allows us to test whether certain claims about a population are true or not.
Probability and Random Variables
Probability Distribution: This is a function that describes all possible outcomes of a random event and how often they occur. For example, if we know that 70% of people prefer pumpkin pie and 30% prefer blueberry pie, we can use this information to estimate how likely certain outcomes are when we ask people about their pie preference.
Random Variables: A random variable is a value that depends on the outcome of a random event. It can be:
Types of Probability Distributions
Discrete Probability Distributions: Used for countable outcomes. Example: The number of times a six appears when rolling a dice.
Types:
Continuous Probability Distributions: Used for measurements. Example: Measuring people’s heights.
Normal Distribution
Sampling Distribution and Central Limit?Theorem
Sampling Distribution: When you take multiple samples from a population and calculate their means, the distribution of these sample means forms the sampling distribution.
Central Limit Theorem (CLT): This is a key principle in inferential statistics that says as the sample size grows, the sampling distribution of the sample mean becomes approximately normal, regardless of the population’s distribution.
Models in Machine?Learning
A model is a mathematical equation that represents relationships in the data and helps us make predictions. For example, a model could predict a person’s height based on their weight.
Example:
If the model is [Height=0.5+0.8×Weight] and someone weighs 2.1 units, you’d predict their height as 0.5+(0.8×2.1) = 2.18.
Sum of Squared Residuals (SSR)
Residuals are the differences between the actual data points and the model’s predictions. The Sum of Squared Residuals (SSR) is a way to measure how well a model fits the data: smaller SSR values mean better fit.
Example:
If you predict someone’s height and compare it to their actual height, the difference is a residual. Squaring and summing all residuals gives you the SSR, which tells you how far off your model is overall.
Mean Squared Error?(MSE)
The Mean Squared Error (MSE) is simply the average of the SSR, and it gives a better sense of how well a model is performing across different datasets.
Example:
If the SSR for a small dataset is 14, and you have 3 data points, the MSE is 14/3 = 4.7
R-Squared (R2)
R-squared is a measure of how well a model fits the data, ranging from 0 to 1.
A higher R2 value indicates a better fit. R2 shows the percentage of variation in the data explained by the model.
Example:
If you use height to predict weight, R2 tells you how much of the weight variation can be explained by height. An R2 value of 0.7 means 70% of the variation in weight is explained by height.
What is a Hypothesis?
A hypothesis is a statement or assumption about a population that we want to test. It’s an idea made from limited evidence, serving as the starting point for further investigation.
Examples:
In statistics, we use sample data to test these hypotheses and see whether the results support or reject the initial assumption.
Hypothesis Testing
Hypothesis testing is a formal procedure that uses sample data to evaluate the credibility of a hypothesis about a population. In simple terms, it’s a rule that helps decide whether to accept or reject a claim based on the evidence provided by the data.
Example:
Imagine a school claims that their students score at least 70% on average in exams. To test this claim, we collect data from a sample of students and use hypothesis testing to verify if the school’s claim holds up.
Null and Alternative Hypotheses
In hypothesis testing, we always start with two hypotheses:
Null Hypothesis (H?): The null hypothesis is the assumption that there is no effect or no difference. It represents the claim or statement we are testing. Can include: =, ≥, ≤
Alternative Hypothesis (H? or H?): This hypothesis directly contradicts the null hypothesis and represents the effect or difference we are trying to find. Can include: ≠, >, <
Level of Significance (Alpha)
The level of significance (alpha), denoted by “α”, is the threshold used to determine whether to reject the null hypothesis. It represents the probability of rejecting the null hypothesis when it is actually true. Common values for α\alphaα are 0.01, 0.05, or 0.1.
Test Statistics and?p-Value
To decide whether to accept or reject the null hypothesis, we rely on:
Test Statistic: A value calculated from the sample data that is used to make decisions in hypothesis testing. There are several types of test statistics:
p-Value: The p-value measures the strength of evidence against the null hypothesis. A smaller p-value indicates stronger evidence to reject the null hypothesis.
p-Values
A p-value helps determine whether the results of an experiment are significant or just due to random chance. A p-value less than 0.05 typically indicates that the result is statistically significant.
Example:
Imagine we had two antiviral drugs, A and B, and we wanted to know if they were different.
So, we redid the experiment with lots and lots of people, and these were the results: Drug A cured a lot of people compared to Drug B, which hardly cured anyone.
Now, it’s pretty obvious that Drug A is different from Drug B because it would be unrealistic to suppose that these results were due to just random chance and that there’s no real difference between Drug A and Drug B. Drug B Not Cured 1,432 It’s possible that some of the people taking Drug A were actually cured by placebo, and some of the people taking Drug B were not cured because they had a rare allergy, but there are just too many people cured by Drug A, and too few cured by Drug B, for us to seriously think that these results are just random and that Drug A is no different from Drug B.
37% of the people who took Drug A were cured compared to 31% who took Drug B. Drug A cured a larger percentage of people, but given that no study is perfect and there are always a few random things that happen, how confident can we be that Drug A is different from Drug B? This is where p-values come in. p-values are numbers between 0 and 1 that, in this example, quantify how confident we should be that Drug A is different from Drug B. The closer a p-value is to 0, the more confidence we have that Drug A and Drug B are different. So, the question is, “how small does a p-value have to be before we’re sufficiently confident that Drug A is different from Drug B?” In other words, what threshold can we use to make a good decision about whether these drugs are different?
In practice, a commonly used threshold is 0.05. It means that if there’s no difference between Drug A and Drug B, and if we did this exact same experiment a bunch of times, then only 5% of those experiments would result in the wrong decision
Types of Tests: One-Tailed vs. Two-Tailed
Hypothesis tests can be either one-tailed or two-tailed, depending on the nature of the hypothesis.
One-tailed test: Used when we are interested in determining if a parameter is either greater than or less than a certain value.
Two-tailed test: Used when we are interested in testing whether a parameter is different from a specific value (it can be either higher or lower).
Critical Value and Rejection Region
The critical value is a point that divides the region where we either accept or reject the null hypothesis. If the test statistic falls within the rejection region, we reject the null hypothesis.
Example:
Types of?Errors
There are two types of errors that can occur in hypothesis testing:
References:
Tech Resource Optimization Specialist | Enhancing Efficiency for Startups
4 个月Great breakdown of key statistical concepts! It’s essential to understand these fundamentals, especially when applying them in data-driven decision making.
Actively looking for Full-time Opportunities in AI/ML/Robotics | Ex-Algorithms & ML Engineer @ Dynocardia Inc | Computer Vision Research Assistant & Robotics Graduate Student @Northeastern University
4 个月[https://lnkd.in/gPMqjYUR], more articles like this are here
Data Science Manager @ PharmaScroll
4 个月Great resources