Basic Statistics for Exploratory Data Analysis (EDA)
Even though neural networks are very effective for large unstructured data like images, text and speech, we still have to manually analyze data which is either smaller or in a structured format or both, like the ones in relational databases, excel sheets or any tables in general. In this article, I go over the concepts I learned from reading the book "Hands-on Exploratory Data Analysis with Python".
Before we look at EDA let's go over the common types of data we find in structured formats like databases.
- Numerical data - There are two types of numerical data, here we refer to data as a variable because it is random in each row of a dataset.
- Discrete data - A variable that is countable and has finite possibilities (we can count them with our fingers i.e they are rational integers like 3, 6, 798, 600)
- Continuous data - A variable that is uncountable and can have infinite possibilities (basically irrational fractions like 3.5, 7.7, 6.123 etc)
2. Categorical data - Usually whenever data falls into one of the buckets of a given set it is called categorical data. (Example - Blood type of a person can be A, B, AB or O). If there are only two categories (two buckets) it is called "Binary Categorical Variable" whereas if there are multiple it is called "Polytomous Categorical Variable"
While we are at it let us also look at the various measurement scales in statistics.
- Nominal - Used when labelling variables without any quantitative values. (These scales are generally referred to as labels, these are qualitative like Gender - Male, Female, Other, Unknown)
- Ordinal - In an ordinal scale, the order of the scale is also important, one of the examples of an ordinal scale is the Likert scale - the scale which has the following options (Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree)
- Interval - In this scale both the order and the exact difference between the options matter, these are widely used in statistics
- Ratio - Along with order and the exact difference we will also have absolute zero on this scale.
Now that we've covered data and measuring scales let's dive into the stat concepts we'll need for EDA
Distribution Functions
Continuous Function - A continuous function is any function that does not have any unexpected changes in value. These abrupt or unexpected changes are referred to as discontinuities. For example, consider the following cubic function:
y = x ** 3 + x ** 2 - 5*x + 3
Probability Density Function (PDF) - It is the probability that a function has a value x.
Probability Mass Function (PMF) - If the function is associated with discrete random variables rather than continuous random variables.
The probability distribution or probability function of a discrete random variable is a list of probabilities linked to each of its attainable values. Continuous probability distribution includes normal distribution, exponential distribution, uniform distribution, gamma distribution, poison distribution and binomial distribution. Let us look at the equations for each.
Uniform distribution:
f(x) = 1 / (b-1) if a <= x <= b else 0 if x < a or x > b
Normal distribution:
f(x) =
(1 / sigma * (np.sqrt(2 * math.pi))) * e ** (-((x mu) ** 2) / 2 * sigma ** 2)
Exponential distribution:
f(x) = lambda * e ** (- (lambda * x)) if x>=0 else 0
Binomial distribution: Has only two possible outcomes (eg success or failure)
Cumulative Distributive Function: The probability that the variable takes a value less than or equal to x
f(x) = P[X <= x] = alpha
When a distribution is a scalar continuous, it provides the area under the PDF, ranging from minus infinity to x. The CDF specifies the distribution of multivariate random variables.
Descriptive Statistics - There are two types of descriptive stats
Measure of central tendency
- Mean - an average of all data
- Median - center most observation (element at (n + 1) / 2)
- Mode - Integer that appears the most times in a data
领英推è
Measure of dispersion
Standard Deviation - This show how much data is spread out from the mean (It is the average of the difference between each value in the dataset from it's mean)
Variance - It is the square root of the standard deviation
Skewness - The measure of asymmetry in a dataset about its mean.
Kurtosis - It is a statistical measure that illustrates how heavily the tails of distribution differ from those of a normal distribution. This technique can identify whether a given distribution contains extreme values.
Types of kurtosis -
- Mesokurtic: If any dataset follows a normal distribution, it follows a mesokurtic distribution. It has kurtosis around 0.
- Leptokurtic: In this case, the distribution has kurtosis greater than 3 and the fat tails indicate that the distribution produces more outliers.â€
- Platykurtic: In this case, the distribution has negative kurtosis and the tails are very thin compared to the normal distribution.
Calculating percentiles
Percentiles measure the percentage of values that lies below a certain value.
formula to calculate percentile of X = ((Number of observations less than X) / (Total Number of observations)) * 100
Quartile - 25th percentile is referred to as Q1, 50th percentile is referred to as Q2, 75th percentile is referred to as Q3 and finally, Q4 is the 100th percentile
We can visualise quartiles as box plots
Correlation
Any dataset that we want to analyze will have different fields (that is, columns) of multiple observations (that is, variables) representing different facts. The columns of a dataset are, most probably, related to one another because they are collected from the same event. One field of record may or may not affect the value of another field. To examine the type of relationships these columns have and to analyze the causes and effects between them, we have to work to find the dependencies that exist among variables. The strength of such a relationship between two fields of a dataset is called correlation, which is represented by a numerical value between -1 and 1.
Correlation tells us how variables change together, both in the same or opposite directions and in the magnitude (that is, strength) of the relationship. To find the correlation, we calculate the Pearson correlation coefficient, symbolized by Ï (the Greek letter rho). This is obtained by dividing the covariance by the product of the standard deviations of the variables:
rho(xy) = ((standard deviation of xy) / (standard deviation of x) * (standard deviation of y))
Types of correlation analysis:
- Univariate analysis - Univariate analysis is the simplest form of analyzing data. It means that our data has only one type of variable and that we perform analysis over it. The main purpose of the univariate analysis is to take data, summarize that data, and find patterns among the values. It doesn't deal with causes or relationships between the values. Several techniques that describe the patterns found in univariate data include central tendency (that is the mean, mode, and median) and dispersion (that is, the range, variance, maximum and minimum quartiles (including the interquartile range), and standard deviation).
- Bivariate analysis - Bivariate analysis is used to find out whether there is a relationship between two different variables. When we create a scatter plot by plotting one variable against another on a Cartesian plane (think of the x and y axes), it gives us a picture of what the data is trying to tell us. If the data points seem to fit the line or curve, then there is a relationship or correlation between the two variables. Generally, bivariate analysis helps us to predict a value for one variable (that is, a dependent variable) if we are aware of the value of the independent variable.
- Multivariate analysis - One common way of plotting multivariate data is to make a matrix scatter plot, known as a pair plot. A matrix plot or pair plot shows each pair of variables plotted against each other. The pair plot allows us to see both the distribution of single variables and the relationships between two variables. We can use matrix plot, pairwise plot, corr function from pandas, and heatmap plot for multivariate analysis
Simpson's paradox - It is the difference that appears in a trend of analysis when a dataset is analyzed in two different situations: first, when data is separated into groups and, second, when data is aggregated.
Hypothesis Testing
- Null hypothesis - The most basic assumption made based on the knowledge about the domain.
- Alternate hypothesis - A different hypothesis that opposes the null hypothesis. The main task here is whether we accept or reject the alternative hypothesis based on the experimentation results
Type - 1 error is False-positive and Type - 2 error is False negative
P-value - This is also referred to as the probability value or asymptotic significance. It is the probability for a particular statistical model given that the null hypothesis is true. Generally, if the P-value is lower than a predetermined threshold, we reject the null hypothesis.
Level of significance: This is one of the most important concepts that you should be familiar with before using the hypothesis. The level of significance is the degree of importance with which we are either accepting or rejecting the null hypothesis. We must note that 100% accuracy is not possible for accepting or rejecting. We generally select a level of significance based on our subject and domain. Generally, it is 0.05 or 5%. It means that our output should be 95% confident that it supports our null hypothesis.