ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Basic Statistics for Exploratory Data Analysis (EDA)

Pranav Kumar PB

Senior Machine Learning Engineer

å‘å¸ƒæ—¥æœŸ: 2022å¹´8æœˆ10æ—¥

Even though neural networks are very effective for large unstructured data like images, text and speech, we still have to manually analyze data which is either smaller or in a structured format or both, like the ones in relational databases, excel sheets or any tables in general. In this article, I go over the concepts I learned from reading the book "Hands-on Exploratory Data Analysis with Python".

Before we look at EDA let's go over the common types of data we find in structured formats like databases.

Numerical data - There are two types of numerical data, here we refer to data as a variable because it is random in each row of a dataset.

Discrete data - A variable that is countable and has finite possibilities (we can count them with our fingers i.e they are rational integers like 3, 6, 798, 600)
Continuous data - A variable that is uncountable and can have infinite possibilities (basically irrational fractions like 3.5, 7.7, 6.123 etc)

2. Categorical data - Usually whenever data falls into one of the buckets of a given set it is called categorical data. (Example - Blood type of a person can be A, B, AB or O). If there are only two categories (two buckets) it is called "Binary Categorical Variable" whereas if there are multiple it is called "Polytomous Categorical Variable"

While we are at it let us also look at the various measurement scales in statistics.

Nominal - Used when labelling variables without any quantitative values. (These scales are generally referred to as labels, these are qualitative like Gender - Male, Female, Other, Unknown)
Ordinal - In an ordinal scale, the order of the scale is also important, one of the examples of an ordinal scale is the Likert scale - the scale which has the following options (Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree)
Interval - In this scale both the order and the exact difference between the options matter, these are widely used in statistics
Ratio - Along with order and the exact difference we will also have absolute zero on this scale.

Now that we've covered data and measuring scales let's dive into the stat concepts we'll need for EDA

Distribution Functions

Continuous Function - A continuous function is any function that does not have any unexpected changes in value. These abrupt or unexpected changes are referred to as discontinuities. For example, consider the following cubic function:

y = x ** 3 + x ** 2 - 5*x + 3

Probability Density Function (PDF) - It is the probability that a function has a value x.

Probability Mass Function (PMF) - If the function is associated with discrete random variables rather than continuous random variables.

The probability distribution or probability function of a discrete random variable is a list of probabilities linked to each of its attainable values. Continuous probability distribution includes normal distribution, exponential distribution, uniform distribution, gamma distribution, poison distribution and binomial distribution. Let us look at the equations for each.

Uniform distribution:

f(x) = 1 / (b-1) if a <= x <= b else 0 if x < a or x > b

Normal distribution:

f(x) = 
(1 / sigma * (np.sqrt(2 * math.pi))) * e ** (-((x  mu) ** 2) / 2 * sigma ** 2)

Exponential distribution:

f(x) = lambda * e ** (- (lambda * x)) if x>=0 else 0

Binomial distribution: Has only two possible outcomes (eg success or failure)

Cumulative Distributive Function: The probability that the variable takes a value less than or equal to x

f(x) = P[X <= x] = alpha

When a distribution is a scalar continuous, it provides the area under the PDF, ranging from minus infinity to x. The CDF specifies the distribution of multivariate random variables.

Descriptive Statistics - There are two types of descriptive stats

Measure of central tendency

Mean - an average of all data
Median - center most observation (element at (n + 1) / 2)
Mode - Integer that appears the most times in a data

é¢†è‹±æŽ¨è

Data Science Unicorns, RAG Pipelines, a New Coefficient of Correlation, and Other April Must-Reads

Data Science Unicorns, RAG Pipelines, a Newâ€¦

Towards Data Science 10 ä¸ªæœˆå‰

Visualization, Math, Time Series, and More: Our Best Recent Deep Dives

Visualization, Math, Time Series, and More: Our Bestâ€¦

Towards Data Science 1 å¹´å‰

Transformers Unleashed: A Comprehensive Guide to Applying Transformers Across Data Types

Transformers Unleashed: A Comprehensive Guide toâ€¦

Sanjay Basu PhD 1 å¹´å‰

Measure of dispersion

Standard Deviation - This show how much data is spread out from the mean (It is the average of the difference between each value in the dataset from it's mean)

Variance - It is the square root of the standard deviation

Skewness - The measure of asymmetry in a dataset about its mean.

Kurtosis - It is a statistical measure that illustrates how heavily the tails of distribution differ from those of a normal distribution. This technique can identify whether a given distribution contains extreme values.

Types of kurtosis -

Mesokurtic: If any dataset follows a normal distribution, it follows a mesokurtic distribution. It has kurtosis around 0.
Leptokurtic: In this case, the distribution has kurtosis greater than 3 and the fat tails indicate that the distribution produces more outliers.â€
Platykurtic: In this case, the distribution has negative kurtosis and the tails are very thin compared to the normal distribution.

Calculating percentiles

Percentiles measure the percentage of values that lies below a certain value.

formula to calculate percentile of X = ((Number of observations less than X) / (Total Number of observations)) * 100

Quartile - 25th percentile is referred to as Q1, 50th percentile is referred to as Q2, 75th percentile is referred to as Q3 and finally, Q4 is the 100th percentile

We can visualise quartiles as box plots

Correlation

Any dataset that we want to analyze will have different fields (that is, columns) of multiple observations (that is, variables) representing different facts. The columns of a dataset are, most probably, related to one another because they are collected from the same event. One field of record may or may not affect the value of another field. To examine the type of relationships these columns have and to analyze the causes and effects between them, we have to work to find the dependencies that exist among variables. The strength of such a relationship between two fields of a dataset is called correlation, which is represented by a numerical value between -1 and 1.

Correlation tells us how variables change together, both in the same or opposite directions and in the magnitude (that is, strength) of the relationship. To find the correlation, we calculate the Pearson correlation coefficient, symbolized by Ï (the Greek letter rho). This is obtained by dividing the covariance by the product of the standard deviations of the variables:

rho(xy) = ((standard deviation of xy) / (standard deviation of x) * (standard deviation of y))

Types of correlation analysis:

Univariate analysis - Univariate analysis is the simplest form of analyzing data. It means that our data has only one type of variable and that we perform analysis over it. The main purpose of the univariate analysis is to take data, summarize that data, and find patterns among the values. It doesn't deal with causes or relationships between the values. Several techniques that describe the patterns found in univariate data include central tendency (that is the mean, mode, and median) and dispersion (that is, the range, variance, maximum and minimum quartiles (including the interquartile range), and standard deviation).
Bivariate analysis - Bivariate analysis is used to find out whether there is a relationship between two different variables. When we create a scatter plot by plotting one variable against another on a Cartesian plane (think of the x and y axes), it gives us a picture of what the data is trying to tell us. If the data points seem to fit the line or curve, then there is a relationship or correlation between the two variables. Generally, bivariate analysis helps us to predict a value for one variable (that is, a dependent variable) if we are aware of the value of the independent variable.
Multivariate analysis - One common way of plotting multivariate data is to make a matrix scatter plot, known as a pair plot. A matrix plot or pair plot shows each pair of variables plotted against each other. The pair plot allows us to see both the distribution of single variables and the relationships between two variables. We can use matrix plot, pairwise plot, corr function from pandas, and heatmap plot for multivariate analysis

Simpson's paradox - It is the difference that appears in a trend of analysis when a dataset is analyzed in two different situations: first, when data is separated into groups and, second, when data is aggregated.

Hypothesis Testing

Null hypothesis - The most basic assumption made based on the knowledge about the domain.
Alternate hypothesis - A different hypothesis that opposes the null hypothesis. The main task here is whether we accept or reject the alternative hypothesis based on the experimentation results

Type - 1 error is False-positive and Type - 2 error is False negative

P-value - This is also referred to as the probability value or asymptotic significance. It is the probability for a particular statistical model given that the null hypothesis is true. Generally, if the P-value is lower than a predetermined threshold, we reject the null hypothesis.

Level of significance: This is one of the most important concepts that you should be familiar with before using the hypothesis. The level of significance is the degree of importance with which we are either accepting or rejecting the null hypothesis. We must note that 100% accuracy is not possible for accepting or rejecting. We generally select a level of significance based on our subject and domain. Generally, it is 0.05 or 5%. It means that our output should be 95% confident that it supports our null hypothesis.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Pranav Kumar PBçš„æ›´å¤šæ–‡ç«

I fine-tuned a LLaMA on Vertex AI using torchtune for $10

2025å¹´2æœˆ24æ—¥

I fine-tuned a LLaMA on Vertex AI using torchtune for $10

Sorry for the click-baity title, but I want to clarify that while the fine-tuned model from this process may not be asâ€¦
??? Expanding the Scope of LLMs: Multimodal and Task-Enhanced AI

2024å¹´9æœˆ22æ—¥

??? Expanding the Scope of LLMs: Multimodal and Task-Enhanced AI

Multimodal Large Language Models (LLMs) that understand both text and images (or other media formats) are becomingâ€¦

2 æ¡è¯„è®º
Unraveling LLMs: A PyTorch Developerâ€™s Take on Core Concepts of LLMs

2024å¹´9æœˆ14æ—¥

Unraveling LLMs: A PyTorch Developerâ€™s Take on Core Concepts of LLMs

0. Introduction Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP)â€¦
Backprop Through Time

2022å¹´3æœˆ2æ—¥

Backprop Through Time

For both Deep Neural Nets and Convoluted Neural Nets, all the examples in the training set are of the same length butâ€¦
Convolutions, Pooling & Flattening

2022å¹´2æœˆ25æ—¥

Convolutions, Pooling & Flattening

While building neural networks for visual tasks like image recognition, object detection or boundary detectionâ€¦
Deep Neural Nets & Improving them

2022å¹´2æœˆ19æ—¥

Deep Neural Nets & Improving them

In the previous article, I wrote about the building blocks of Neural nets such as cost functions, gradient descentâ€¦

2 æ¡è¯„è®º
Foundations of Neural Nets

2022å¹´2æœˆ17æ—¥

Foundations of Neural Nets

It has been a while I did anything related to Machine Learning or Deep learning so I decided to revisit it. Havingâ€¦

See all articles

Basic Statistics for Exploratory Data Analysis (EDA)

Pranav Kumar PB

Senior Machine Learning Engineer

é¢†è‹±æŽ¨è

Pranav Kumar PBçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Artificial Intelligence No 50: Machine learning v.s. Statistics

Top Data Science and Machine Learning Methods Used

KDnuggets 16:n32: Data Scientist was sexiest job untilâ€¦; Up to Speed on Deep Learning

From Excel to Neural Networks: My Journey in Data Science

Mastering Stocks Predictions and Financial Time Series Forecasting with Deep Learning: Spacewink

The Ultimate Guide to Feature Scaling in Data Science

The Power of Probability in Data Science: Unlocking Insights and Making Informed Decisions

Data Science and its Nearest-Neighbours

Class 19 - REGRESSION Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Latent Space with Kafka for Advanced Data Processing

é¢†è‹±æŽ¨è

Pranav Kumar PBçš„æ›´å¤šæ–‡ç«

I fine-tuned a LLaMA on Vertex AI using torchtune for $10

??? Expanding the Scope of LLMs: Multimodal and Task-Enhanced AI

Unraveling LLMs: A PyTorch Developerâ€™s Take on Core Concepts of LLMs

Backprop Through Time

Convolutions, Pooling & Flattening

Deep Neural Nets & Improving them

Foundations of Neural Nets

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Artificial Intelligence No 50: Machine learning v.s. Statistics

Top Data Science and Machine Learning Methods Used

KDnuggets 16:n32: Data Scientist was sexiest job untilâ€¦; Up to Speed on Deep Learning

From Excel to Neural Networks: My Journey in Data Science

Mastering Stocks Predictions and Financial Time Series Forecasting with Deep Learning: Spacewink

The Ultimate Guide to Feature Scaling in Data Science

The Power of Probability in Data Science: Unlocking Insights and Making Informed Decisions

Data Science and its Nearest-Neighbours

Class 19 - REGRESSION Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Latent Space with Kafka for Advanced Data Processing

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†