Machine Learning Day 2 - Statistics
Statistics (credit to @WallpaperAccess )

Machine Learning Day 2 - Statistics

Statistics:

Statistics is a branch of mathematics that involves collecting, analyzing, interpreting, presenting, and organizing data.

It provides methods for drawing conclusions and making inferences about populations based on a sample of data.

Statistics is broadly divided into two main categories: descriptive statistics and inferential statistics.

I. Descriptive Analysis:

  • Objective: The primary goal of descriptive statistics is to summarize and describe the main features of a dataset. It involves organizing and presenting data in a meaningful way.
  • Analysis Type: Involves measures such as central tendency (mean, median, mode), measures of variability (range, variance, standard deviation), exploratory data analysis (EDA) and measures of distribution shape (skewness, kurtosis). Descriptive statistics provide a summary and description of the data.
  • Application: Used to describe and summarize data. It helps organize information in a way that is easy to understand, providing insights into the main characteristics of a dataset.
  • Example: Calculating the average (mean) height of a group of people, finding the range of scores on a test, or determining the most frequently occurring value in a dataset.
  • Terminologies in Descriptive Analysis?:1. Mean: The mean, often referred to as the average, is the sum of all values in a dataset divided by the number of values. It is calculated using the formula:

Formula for mean

2. Median: The median is the middle value of a dataset when it is arranged in ascending or descending order. If there is an even number of observations, the median is the average of the two middle values.

3. Mode: The mode is the value that appears most frequently in a dataset. A dataset may have no mode, one mode (unimodal), or more than one mode (multimodal).

4. Variance: Variance measures the spread or dispersion of a set of values. It is calculated as the average of the squared differences between each value and the mean. The formula for variance (S^2) is:

Formula for variance

5. Range: The range is the difference between the maximum and minimum values in a dataset. It provides a simple measure of the spread of values.

Formula for range

6. Standard Deviation: The standard deviation is another measure of the spread or dispersion of a set of values. It is the square root of the variance. The formula for standard deviation (s) is:

The formula for standard deviation


7. Exploratory Data Analysis (EDA : Exploratory Data Analysis (EDA) is a part of descriptive statistics. EDA is focused on summarizing, visualizing, and understanding the main characteristics and patterns present in the data. It involves the examination of the dataset itself without making predictions.

The primary goals of EDA include:

  • Data Summarization: EDA aims to provide a summary of the main features of the dataset, such as central tendency, dispersion, and distribution of values.
  • Pattern Recognition: It involves identifying patterns, trends, and anomalies within the data using graphical and statistical methods.
  • Data Visualization: EDA often includes creating visualizations, such as histograms, scatter plots, box plots, and heatmaps, to better understand the relationships between variables.
  • Data Cleaning and Preprocessing: EDA helps in identifying missing values, outliers, or other data issues that may require cleaning or preprocessing before further analysis.

?While EDA provides valuable insights into the dataset and lays the groundwork for subsequent analyses. Data Cleaning and Preprocessing in EDA involves the following steps –

Data Cleaning and Preprocessing in EDA

8. Skewness: Skewness measures the asymmetry or lack of symmetry in a distribution. It indicates whether the data is skewed to the left (negative skewness), meaning the left tail is longer or has more mass than the right tail, or skewed to the right (positive skewness), meaning the right tail is longer or has more mass than the left tail. A skewness of 0 indicates a perfectly symmetrical distribution.

9 . Kurtosis: Kurtosis measures the "tailedness" of a distribution, indicating whether the data has heavier or lighter tails than a normal distribution. Positive kurtosis (leptokurtic) indicates heavier tails, while negative kurtosis (platykurtic) indicates lighter tails. A kurtosis of 0 (mesokurtic) implies that the distribution has tails similar to a normal distribution.

Skewness and Kurtosis are measures of the shape of a distribution in descriptive statistics.


II. Inferential Statistics:

  • Objective: The main objective of inferential statistics is to make predictions or inferences about a population based on a sample of data. It involves generalising the findings from a sample to a larger population.
  • Analysis Type: Involves techniques such as hypothesis testing, confidence intervals, and regression analysis. These methods help researchers make predictions, test hypotheses, and draw conclusions about a population based on a sample.
  • Application: Used when researchers want to make predictions or draw conclusions about a population based on a sample. It helps in determining whether observed differences or relationships are likely to be real and not just due to random chance.
  • Example: Conducting a hypothesis test to determine if there is a significant difference in test scores between two teaching methods or estimating a confidence interval for the average income of a population based on a sample.

Terminologies used in Inferential Analysis:

  1. Hypothesis Testing:

  • Hypothesis testing is a statistical method used to make inferences/assumptions about a population based on a sample of data.
  • It involves formulating a null hypothesis (often denoted as H0) that there is no effect or no difference, and
  • an alternative hypothesis (denoted as H1 or Ha) that there is a significant effect or difference.
  • The goal is to determine whether the evidence from the sample is sufficient to reject the null hypothesis in favour of the alternative hypothesis.


  1. Confidence Intervals:

  • A confidence interval is a range of values constructed around a sample estimate that is likely to include the true population parameter with a certain level of confidence.
  • For example, a 95% confidence interval means that if the same population were sampled many times, the true parameter would fall within the interval in 95% of those samples.

  1. Regression Analysis:

  • Regression analysis is a statistical technique used to examine the relationship between one dependent variable and one or more independent variables.
  • It aims to capture the underlying patterns and use them for predictive purposes.
  • Simple linear regression involves one independent variable, while multiple linear regression involves two or more.


  1. ANOVA (Analysis of Variance):

  • ANOVA is a statistical method used to compare means among more than two groups. It assesses whether there are any statistically significant differences between the means of groups.
  • ANOVA breaks down the total variance in a dataset into variance between groups and variance within groups to determine if the differences between groups are greater than expected due to random chance.


  1. Chi-Square Test:

  • The chi-square test is a statistical test used to determine if there is a significant association between two categorical variables.
  • It compares the observed distribution of data with the expected distribution under the assumption of independence.
  • In other words, it helps determine whether the observed frequencies of categories are different from what would be expected by chance.
  • The test produces a chi-square statistic, and by comparing it to a critical value or using a p-value, you can assess the significance of the relationship between the variables.
  • P-Value: In a chi-square test, the p-value is a crucial indicator of the statistical significance of the observed association between two categorical variables. The p-value associated with the chi-square statistic represents the probability of observing the data (or more extreme results) if the null hypothesis is true. - A low p-value (typically below a chosen significance level, such as 0.05) suggests that you can reject the null hypothesis. If the p-value is low: You may reject the idea that the variables are independent. In other words, you have evidence to suggest a significant association between the categorical variables. - A high p-value indicates that you do not have enough evidence to reject the null hypothesis. If the p-value is high: You do not have enough evidence to reject the idea of independence. It implies that the observed association could be due to random chance.
  • In simple words, The chi-square test is like a detective tool for categorical data. It helps us figure out if there's a real connection between two categories or if it's just a random chance.
  • For Example, Imagine you're comparing expected and observed outcomes, like whether people prefer different ice cream flavours based on their favourite colours. If the actual choices are significantly different from what we'd expect by chance, the chi-square test tells us there's likely a meaningful connection between colour preference and ice cream choice. It's a way to spot patterns and relationships in categorical data.

In summary, descriptive statistics focus on summarizing and describing data, while inferential statistics involve making predictions or drawing conclusions about a population based on a sample. Descriptive statistics provide a snapshot of the data, while inferential statistics help researchers make broader inferences about the entire population.

#Datascience #Ai #Ml #Statistics #Standarddeviation #Variance #descriptiveanalysis #inferentialstatistics #ANOVA #Chisqare #Hypothesis



Pallavi Khambale

Open to job and internship | Data analytics | Microsoft Excel | Advance Excel | My SQL | Python | Power BI | Tableau | Data cleaning | Data Visualization | Data Enthusiast | Dashboard making | Data Modelling |

1 年

In my opinion, Getting to know about basics perfectly helps you further. This article perfectly gives insight about statics basic knowledge. ??

回复
Aakash Patil

Associate Engineer @Worldline Global Services

1 年

Loved this Deepa Dixit

回复

要查看或添加评论,请登录

Deepa M Dixit的更多文章

  • Machine Learning Day 5 -Trick questions of supervised learning

    Machine Learning Day 5 -Trick questions of supervised learning

    1. Why was Machine Learning Introduced? Machine learning was introduced to enable computers to learn from data and make…

    2 条评论
  • Machine Learning Day 4 -Regression Algorithms

    Machine Learning Day 4 -Regression Algorithms

    Regression Regression is a type of supervised machine learning algorithm used for predicting a continuous outcome or…

    8 条评论
  • Machine Learning Day3 - Supervised Learning

    Machine Learning Day3 - Supervised Learning

    Machine Learning (ML) algorithms - Machine Learning (ML) algorithms are computational models or procedures that enable…

    9 条评论
  • Machine Learning DAY 1 -

    Machine Learning DAY 1 -

    Machine learning is a type of artificial intelligence (AI) that enables computer systems to enhance their performance…

    7 条评论

社区洞察

其他会员也浏览了