Concise Basic Stats - Part II: Summary Statistics & Basic Exploratory Analysis

Welcome back for the second part of the Concise Basic Stats Series. This time we will be taking a look at what it means to talk about doing some exploratory data analysis, and on the way understand some key concepts like population parameters and sample statistics, population vs. sample.

A?population?of interest is the entire set from which we want to derive any knowledge from. If I want to understand the voting preferences of all the students of a school, my population would be all the current students of that school (note that I'm not including professors and school staff, since they are not the target group when we defined the research). Ok, then what is a?sample? The?sample?is a subset of that population which we gather information about. We have to take a few precautions when deriving a sample, as it can be subject to biases, (I plan on doing an episode regarding?sampling techniques?in another opportunity), therefore we want to take a sample which is representative of our population. In our example, the simplest would be for instance to take the students in a particular classroom

Parameters vs. Statistics

To start, parameters refer to measurements obtained in the population. For instance, the average height of the male population in the school is around 168 cm. To get this number for a particular year, the school had to measure the height of each male student, in order capture every element in the population of interest. This piece of information is a?parameter. Now, if we instead collect the heights of just one classroom of male students and discover that is 164.9, then we would be working with a sample with an 164.9 height?statistic. A sample is subset of our population. And from a sample, we obtain statistics, which are descriptive metrics for particular variable of interest (in this case height). The main point of inferential statistics is to?estimate?the population parameters based on sample statistics.

Categorical vs. Numerical Variables

When we are talking about age, we are dealing with a?numerical?variable. That is, we can make several arithmetic operations to obtain mean, median, and standard deviation from it. However, when we are talking about something like political affiliation for instance, we are dealing with?categorical?variables, since we cannot (at least trivially) make calculations to obtain an average for that. It’s like saying: what is the mean of your political affiliations? What’s its standard deviation? It wouldn't make sense.

Main Summary Statistics

When playing with a dataset, one of the first things to check are the summary statistics for our variables. For Categorical Variables, the most important summary statistics is the proportion and the absolute frequency of a value. For numerical attributes, the most common are mean, median, quantiles, variance and standard deviation. We will go in more detail. Let us take a look at some of then and how can we obtain them in Python. For the next examples we are going to be using?sklearn's?built-in dataset called "Diabetes dataset". Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline. More info can be found here:

Let's load the dataset:

# Loading built-in datasets module from sklearn
from sklearn import datasets

# Getting the data
data = datasets.load_diabetes()

# Put independent variables data in a pandas dataframe format
df = pd.DataFrame(
? ? data=datasets.load_diabetes()['data'],
? ? columns= data['feature_names']

# Transforming the 'sex' column into categorical (for purpose of example)
df['sex'] = df['sex'].round(2)
df['sex'] = df['sex'].replace({
? ? -0.04:'F',
? ? 0.05:'M'

Now that we have our data ready, let's extract some meaningful statistics from each of the variables. Starting with the numerical variables.

# Selecting Numerical Variables in a dataset
df_numeric = df.select_dtypes(include=['float','int'])

# Getting the summary statistics
No alt text provided for this image
output of describe method on dataframe of numerical variables

As you see above we have some important summary statistics for each of the variables, which gives us a general overview of how the data is presented. We have the number of records (count), the mean, the standard deviation (std), the min, the max, and some quantiles (we will be explaining those shortly) for each of the variables. Let's now look at our categorical variables.

# Selecting Categorical Variables in a dataset
df_categorical = df.select_dtypes(include=['object'])

# Getting the summary statistics
No alt text provided for this image
output of describe method on dataframe of categorical variables

As you can see we only have 1 categorical variable ('sex'). The summary statistics are different from the ones we saw previously, as they pertain only to categorical-type variables. We can see the total count of records, the number of unique values, the most frequent value, and its absolute frequency.

About Quantiles:

To obtain quantiles for a particular set of data. we need to first arrange our numerical data in ascending order. Next, we find the values in this big list which divides our values in a particular milestone. If we have a variable which tells us the ranking of a particular soccer team in the championship comprised of 10 teams, we could say the the team holding the third place is the 70% quantile (or 70th percentile) of the standings distribution, since 70% of the teams are below or equal to it. Similarly, the fifth team would be the 50% quantile (aka the median), because it separates the values into 50% below and 50% above it.

See how you can extract the quantiles for any dataset in Python:

print( 'Quantile 30%',df_numeric.quantile(.30))
print( 'Quantile 65%',df_numeric.quantile(.65))
print( 'Quantile 75%',df_numeric.quantile(.75))        

This produces the output of a series containing all the respective quantile values for each of the numeric-type variables.

About Variance:

As the name suggests, the variance is the measure of overall variability of a particular set of observations. Let's take a look at the formula and get some intuition behind it.

No alt text provided for this image
google result for "variance formula"

Let us bring some data to explore this notion. Lets imagine the below set of univariate data:

No alt text provided for this image
Some data points (single variable)

If we would to calculate the average for this particular dataset we would find that it lies on this red dotted line. In our formula, this number would be our x with a bar on top (x-bar), which is the mean of all observations.

The elements at the top in the variance ratio (numerator) that we see are known as the sum of the square distances. That is, we take the difference between the value of each point, and the mean of values (x-bar), that is a distance. Then, we square this value, turning it into a squared distance.

But why do we do that in the first place?

That is because the mean is the value which tries to be as close as possible to all points. Therefore it finds the very point in which the points are evenly spread around its value. Had we taken the sum of just the distances themselves (without squaring them), the result would be zero. That would not be a very useful statistic, as it would be zero for any data. In order to prevent that, we instead work with the notion of the "squared distance". That is, we take the square of the distances. This is only to prevent getting a zero at the end of this summation, since the result of a square operation will always yield a positive value. The numerator then gives us an idea of how intensely values are dispersed around the mean. The more apart the values are form the mean, i.e bigger distances, the bigger the value of the numerator will be. We then scale this value by dividing it based on the number of points that we have, n. This will enable us to get a reasonable metric/statistic for the dispersion of values.

Therefore, the variance is the square of the average dispersion of our points in relation to the mean. Its units are in the square of the original unit. For instance, if we are talking about speed in km/h, obtaining the variance would give us a value of (km/h)^2. This is not very easy to understand or useful in most cases, so we make use of the standard deviation, which brings our data back to its original units.

Skewness and Kurtosis

Understanding the shape of data is crucial. It helps to understand where the values lie in a distribution and analyze the outliers in a given dataset.

Skewness?is a measure of asymmetry of a distribution. In the case of the Normal distribution, we talked about it being symmetric. Therefore, we can say that the theoretical normal distribution has zero skewness, as all measures of a central tendency lies in the middle. As we will see, depending on the model, skewness may violate model assumptions. There exists different types of skewness. A positive skewed distribution presents a mean of the data greater than the median. In other words, the values are bent towards the lower side. See example below, with the distribution of scores from students for a particular test

No alt text provided for this image
positive and negatively-skewed distributions, respectively.

Kurtosis: Kurtosis is a statistical measure, to tell whether the data is heavy-tailed or light-tailed. To understand it, I like to think of it as a measure of how “squashed” is our distribution. If it is highly “squashed” (meaning that it is heavily-tailed) it indicates that we have a high kurtosis. If it is “pointy” (meaning that it is ligh-tailed), its kurtosis will be small. See image below. Don’t concern yourself too much about the weird terms in the image, just use it to clarify and get the intuition behind the concept of kurtosis.

No alt text provided for this image
Positive and Negative kurtosis examples
# Importing Visualization library
import seaborn as sns

# Plotting a distribution plot
ax = sns.displot(
? ? df_numeric['bmi'],?
? ? kind='hist',? ??
? ? kde=True
No alt text provided for this image
Seaborn's displot output
# Skewness and Kurtosis
print('Skewness:', df['bmi'].skew())
print('Kurtosis:', df['bmi'].kurtosis())

# Output:
Skewness: 0.5981484879110457
Kurtosis: 0.0950944742751707        

We therefore obtain a general sense of how this particular variables is distributed, what is it shape, its summary statistics and its visual representation. The good part is the?seaborn's?library makes it easy to generate visualizations that helps us get an intuition behind our data. In this next example, I've used the optional argument "hue" to get a more intricate analysis of then 'bmi' distribution when looked at for each sex.

# Plotting a distribution plots of same variable accross different gender filters

ax = sns.displot(
? ? data=df,
? ? x= 'bmi',?
? ? kind='hist',? ??
? ? kde=True,
? ? hue='sex'
No alt text provided for this image
Same variable may have different distribution shapes when compared accross different filters (M vs. F)

Bivariate Analysis

Now that we are familiar with the main statistics and some descriptive measures associated with a?single variable?at a time, let’s now turn our attention to?bivariate analysis. It refers to forms of quantitative analysis used when we wish to compare two variables. So far we only looked at some forms of univariate analysis (only one variable is involved) . Sure enough, bivariate analysis is widely used in the day to day of any analyst or scientist. It is useful to understand the relationship between two different variables X and Y, and discover something new about the behavior of our data. When comparing two variables, we can have 3 types of scenarios, that is: comparing a numerical and a numerical variable, a categorical and another categorical, or numerical and categorical. In this article we will be focusing solely on numerical and numerical analysis, as I believe it is the most useful and best for learning purposes.

We are going to be exploring three ways of carrying a bivariate analysis. This techniques are part of the fundamental tools of any analyst. So, let us begin by tackling the first one:


Humans are very visual beings. We can quickly detect patterns and trends when looking at a visualization. Scatterplots are perhaps one of the most famous and useful visualizations. It is a visual tool that is of great help when trying to figure out how a variables behaves in the presence of another. Through the use of them, we can quickly filter through our data and find what would be the best variable to predict another, as they may present a consistent linear relationship between them.

No alt text provided for this image
Graph 1: A scatterplot of Wife's Age vs. Husband's Age

In Graph 1, we see a positive linear association between the two variables. Positive means that, as one of the variables increases its numerical value, the other also does so. But, what does it mean to have a linear relationship? Visually, a perfect linear relationship occurs if a scatterplot of the points falls on a straight line. The relationship is still linear even if the points are not perfectly aligned, but its deviations from the line must be random, as opposed to systematic, as that would determine a nonlinear relationship:

No alt text provided for this image
Examples of non-linear relationships

Let's visualize the relationship between?bmi?and the?target variable?of interest in our dataset using Python:

# Importing Viz. library
import matplotlib.pyplot as plt

# Setting viz. style'ggplot')

# Plotting a scatterplot
ax = plt.scatter(
? ? x=df['bmi'],?
? ? y=y
plt.ylabel('Disease Progression')
No alt text provided for this image
Matplotlib's Visualization output

The?bmi?variables seems positively correlated with our target variable, and it may be a good candidate for a strong predictor in a future model.

Ok. We learned about how we can classify a linear relationship. But how can we have a statistical measure for the strength of such association? That is the subject of our next section.

Correlation Coefficients

The second part of this article we are going to look at measures of association. In particular, we are going to learn about the famous?Pearson's correlation metric. When talking about the population, the Pearson's correlation parameter receives the notation?"p"?(rho), and when it is measured in a sample, the according notation is?"r". The Pearson's correlation metric is a measure for the strength of a statistical relationship between two sets of values. Remember: the Pearson's correlation is used when we want to identify and measure a?linear relationship. When dealing with?non-linear associations, the Pearson's "p" will NOT be a useful metric.

No alt text provided for this image
Formula for Pearson's Correlation

Let us derive some understanding by looking at the formula. On the top, we have the product of the differences between each variable's value and their mean, which get summed up to provide a final number, which synthesizes all those quantities of differences. The more spread-out from its mean each variable is, then the bigger that numerator will be. We then scale that number by what is in the denominator. We divide by the product of the standard deviations of each variable (recall that notation from previous sections). Since the values on top cannot be bigger than what's on the bottom, the Pearson's correlation coefficient can take the values between -1 and 1.

When a variable has a strong correlation with another (meaning, by observing X we can strongly predict the behavior of Y), then its Pearson's correlation variable will be close to either 1 or -1. The sign of the value will depend on the orientation of the relationship. If?positive, it indicates that, as one variable increases its value, the other also tends to increase. If?negative, then as one variable increases, the other tends to decrease its value. Correlation measures close to 0 indicate absence of linear relation.

Let's see how to obtain this important metric using Python and our on-going example. Let's take the correlation between 'bmi' and the 'target' attribute.


# Using Numpy
print(np.corrcoef(X, y))

# Out:
[[1.         0.58645013]
 [0.58645013 1.        ]]

# Using Scipy

# Out:
PearsonRResult(statistic=0.5864501344746884, pvalue=3.466006445165805e-42)        

As you see there are different libraries you can use to achieve this. Pick whichever feels more adequate to the type of data you are working with. A nice extra from?scipy's?method is that you also get a?p-value?for the underlying hypothesis test of independence,?which we will go in more detail later.

Regression Analysis

Regression analysis is one of the most-widely used techniques to comprehend the relationships in bivariate, as well as?multivariate analysis. It is a statistical method to examine the relationship between a variable of interest, known as a dependent variable and a set of variables, known as independent variables. By understanding and measuring the impact that each independent variable has on the dependent variable, we can sort out the ones that may not be useful from a predictive standpoint, as they might have little to no association with our variable of interest. Knowing this proves to be very beneficial when building any kind of linear model, as we want to feed in the model the best variables that will best help explain our target and at the same time remove unnecessary noise.

In the particular case of bivariate analysis, i.e when we only have a single independent variable (this type of regression is known as simple linear regression), what we are doing in fact is finding the best possible line in a scatterplot that passes through all the points in with minimal distance possible from each point. Let's look at a representation of such line:

# find line of best fit
a, b = np.polyfit(X, y, 1)

print('Coefficient (a) =',a )
print('Intersept (b) =', b)
print("Pearson's Correlation=",stats.pearsonr(X,y).statistic)

line_of_best_fit_values = a*X+b

# Plotting a scatterplot
ax = plt.scatter(
? ? x=df['bmi'],?
? ? y=y

# Add line of best fit to plot
plt.plot(X, line_of_best_fit_values, color='purple')

plt.ylabel('Disease Progression')
No alt text provided for this image
Line of best fit overlayed on a scatterplot.

As you can see, we have a scatterplot of bivariate data, in which usually our dependent variable sits in the y-axis while our independent variable is in the x-axis. The red line line is the regression line. That is, it is the best possible line to "fit" this particular set of data. The process of fitting can be done in various ways, each with its pros and cons.

Final Comments

Feel free to try more analysis, metrics, and correlations analysis for each of the variables in the dataset. The best thing to learn is to just play around with the different methods using some toy data. Next up we will be diving deep into?Normal?and?T-Student?distributions, So stick around to find more!


