Descriptive Statistics in Data Science
Photo by Isaac Smith on Unsplash

Descriptive Statistics in Data Science

Looking at numbers is the best way to understand the problem and to look for a solution.

Data Science is a multidisciplinary area that involves statistics, mathematics, programming, computer science, and knowledge in business areas, having statistics as one of the fundamental pillars in Data Science.

We will discuss some Fundamental Concepts briefly in Statistical Analysis: defining statistics ? data types ? descriptive statistics ? univariate and bivariate analysis tools ? central trend measures ? dispersion measures ? distribution measures ? correlation coefficient.

Defining Statistics

Statistics is the science that allows us to learn from the data. As we live in the age of Big Data, this large set of data is generated at high volume, wide variety, and high speed; it is easy to understand that statistics become a crucial analysis tool today.

Therefore, we need techniquestools, and processes to analyze the amount of data. Statistics provide us with many of these tools to extract a lot of information relevant to understanding the current situation and decision making.

N?o foi fornecido texto alternativo para esta imagem

1.Collect data: statistics allows us to collect data; that is, it provides us with tools for sampling techniques —We will hardly collect all data on a single phenomenon. A widespread example is an electoral research, where entities research with samples of the population based on statistical techniques and procedures.

2.Organize data: in addition to collecting, we can organize the data with statistical tools. We can tabulate, calculate frequencies, place data in an organized way, and perform analysis processes or even predictive modeling in sequence.

3.Submit data: with statistics, we can also present the data through statistical graphs where visualizations summarize or simplify what the data constitutes.

4.Describe data: we can describe the data! What is the average of a given attribute, its median, or the highest value? Do the data follow a normal distribution or not? This description helps us understand how data is organized to facilitate decision-making work.

5.Interpret data: Finally, we can accomplish perhaps the most critical work of all, interpreting the data. From this interpretation, through statistical tools, we can make inferences about populations through small shows.

In short, statistics offer us a series of tools that allow us to collect, organize, present, describe and interpret data.

Data types

Therefore, we need to define what type of data we are working on to know the most appropriate statistical analysis technique to employ. We have two main classifications—quantitative and qualitative:

1.Nominal Qualitatives: profession, sex, religion — there is no defined hierarchy between the data. The nominal qualitative data represent descriptions for the data and do not allow ranking.

2.Ordinal qualitatives: in some situations, we have an apparent ordering or hierarchy between the categories (ranking), for example, schooling, social class, and positioning in a queue.

3.Discrete Quantitative: these are values that can count. We can count the number of children, the number of cars parked, the number of hits of a website, thumbs up in publications, and finite values — value and integer.

4.Continuous Quantitative: data that can assume any value within a range of values, i.e., weight, height, salary, etc. These are observations that can be measured and usually by decimal values, values that are measured.

This type of division is necessary because we will choose the best statistical technique depending on the data type. We have a set of methods for qualitative data and a set of procedures for quantitative data.

Types of Studies

We already know that statistics help us collectorganizepresentdescribe and interpret the data. We also understand that the data have to be qualitative or quantitative. Now we will see the types of studies:

1.Experimental study: in an experimental study, each individual is randomly assigned to a treatment group, then the specific data and characteristics are observed and collected. Experimental studies that help protect against potential unknown bias that interferes with the outcome of the analysis will better suit each study according to our goal.

2.Observational study: in an observational study, the specific data and characteristics are collected and observed. However, there is no ambition to modify the studies being conducted. That is, we are watching the phenomenon, collecting and analyzing the data. Observational studies do not offer the same level of protection against confounding factors, such as experimental ones.


Descriptive statistics

As its name suggests, descriptive statistics is a set of statistical methods used to describe the main characteristics of the data; these methods are graphic or numerical. We use descriptive statistics to begin our analysis process to understand our data.

There are several methods available to assist in describing the data, each method being designed to provide distinct insight into the available information or an already common hypothesis.

1.Graphical methods: the primary purpose of graphical methods is to organize and present data in a managerial and agile way—data visualization plays a crucial role in the entire data science process.

2.Data Summarization: descriptive statistics propose to sum up and show the data so that we can quickly get an overview of the information being analyzed and better understand a set through its main characteristics.

3.Main descriptive measures

  • Representative values: mean and median
  • Dispersion and variation: variance and standard deviation
  • Nature (shape) of distribution: bell, uniform, or asymmetric

Therefore, we collected the data and applied descriptive statistics to obtain a representative value, evaluate the dispersion, and assess the distribution of these data.

Understand data characteristics

Based on data information and how it is organized, we'll decide which tools to use to treat, clean, transform, normalize, and standardize data for predictive modeling. The decisions that will come in the sequence depend on Mean, variance, standard deviation, distribution, etc.

1.Frequency table to describe data: one of the simplest ways to describe data is through frequency tables —charts represent frequency tables, which reflect the observations made in the data. We observe a particular phenomenon, collect the data, and then tabulate—we create a frequency table. Each line or value corresponds to a class—category. The frequency varies depending on the count of each class type in the set. One of the main applications of artificial intelligence is natural language processing — a computer application recognizing voices and taking action or doing text readings and generating summaries by itself!

2.Frequency Distribution: a frequency distribution is one of the main tools of descriptive statistics, showing several data observations at a specific interval—a way to put more information in a frequency table. To create a frequency distribution, we create a list, define a range, determine the number of classes, determine the class range, and build the frequency distribution (frequency table with more information), to better understand the data.

Other important descriptive tools

We will apply descriptive statistics primarily during the initial phase of the analysis project. Still, We can use the techniques and tools offered by descriptive statistics almost at any stage of the process—tools to summarize data, visualize data, visualize relationships, summarize data frequency, etc.

  1. Frequency Table: Shows the occurrence of elements within the dataset;
  2. Contingency Table: used when we own two variables, and we want to visualize the relationship between these variables.
  3. Charts: Understand how data is organized, distributed, and how it relates.


Tools for Univariate Analysis

1. Frequency Table: base for almost all other tools.

2. Bar Chart: is one of the most used charts in data analysis. We can represent frequency tables through bars on a bar chart. That is, each bar represents precisely the proportion of the frequency in the frequency table.

3. Pareto Chart: This can be constructed with bars representing each of the classes in the frequency table. The height of each bar is entirely associated with the frequency and proportionally

The red line that passes through the entire chart is constructed so that the left side of the line has the leading cause of the problem, and on the right side, the reasons less relevant to the problem.

N?o foi fornecido texto alternativo para esta imagem

We can see that the leading cause of the incorrect medication problem is the dosage error, while the smallest of the issues is self-medication.

4. Pie Chart: very friendly, little recommended — every chart has its value as long as it is well constructed. The pie chart has the potential to lead to potential misinterpretations in its construction and interpretation.

5. Line Chart: We use the line chart to show the variable's evolution over the x-axis and the accumulated variable in y.

The consideration we owe with the line chart concerns its scale and the bar chart — we can easily manipulate this type of information by changing the chart's scale.

The greatest attention we owe with the line chart is its scale and the bar chart — we can easily manipulate this type of information by changing the chart's scale.

6. Stem and Leaf: widely used in statistics, not so much in Data Science. This chart divides the data into two parts, where the stem represents the most significant values to the left of the vertical stroke.

N?o foi fornecido texto alternativo para esta imagem

The sheets are the smallest values to the right of the vertical stroke. Listing all the sheets to the right of each stem, we can please describe how the data is distributed.

7.Histogram: in the histogram, the bars are glued to each other, an appearance very close to the bar chart—the information held in a histogram is about just one feature. 

The main goal in a histogram is to show frequency distribution and, thus, analyze whether or not the data follow a normal distribution—check how the data is distributed, which is already of great help.

In general, before starting the predictive modeling process, we use the histogram during preprocessing. We take the set, create the histogram and analyze how the data is distributed —depending on the algorithms we are going to use later, we have to change the distribution of the data by applying normalization so that the data has a standard normal distribution before feeding the algorithm.

Tools for Bivariate Analysis

See two very tools used for bivariate analysis when we want to represent two variables: understand how two variables relate.

1.Contingency Table: This is a table that shows the numerical relationship between two variables.

N?o foi fornecido texto alternativo para esta imagem

See that we have the Male and Female labels that are the values of the gender variable and the numbers representing the type of animal, dog, or dog — we are relating two different information (Sex * Animal) and, in addition to these relationships, has its totals.

It is a table widely used in classification problems in Machine Learning to interpret the results of machine learning models.

2.Scatter Plot: it is one of the main tools for this type of analysis, allowing us to study the relationship between two variables.

In the chart above, we illustrate a correlation between per capita income and the degree of happiness. Unsurprisingly, we can see that as income increases, the degree of joy also increases - unbelievable.

N?o foi fornecido texto alternativo para esta imagem

In the chart above, we illustrate a correlation between per capita income and the degree of happiness. Unsurprisingly, we can see that as income increases, the degree of joy also increases - unbelievable.

However, the objective of the scatterplot is not to study causality; that is, we cannot affirm based on the graph that a higher per capita income infers in the degree of joy- happiness can be the consequence of factors other than income.

Statistical Measurements

1.Measure of central tendency—the centrality

In addition to the tools we saw earlier, descriptive statistics also offers us several measures that we can use to interpret the data: central tendency measures, dispersion measures, and shape measures.

Main central trend measures

These are the leading measures of central tendency used in descriptive statistics: mode, median, and the arithmetic average.

N?o foi fornecido texto alternativo para esta imagem

1.Mean: The primary measure of the central trend of the data is a number around which an entire dataset is distributed. It is a unique number that can estimate the value of the complete dataset. Averages are the simplest ways to identify trends in a dataset.

However, averages can bring pitfalls that lead to distorted conclusions - we cannot rely solely on the Mean; it is only a starting point of analysis. The disadvantage of using the Mean is when we have extreme values in the set (outliers), compromising the Mean's consistency.

2.Median: The median is the value that divides the data into two equal parts; that is, the number of terms on the right side is similar to the number of terms on the left side when the data is organized in ascending or descending order.

The advantage of using the median is that it is a measure that is not affected by extreme values.

3.Mode: The term appears most often in the dataset, that is, the term with the highest frequency. However, there may be a dataset that there is no mode because all values appear the same number of times.

If two values appear simultaneously and more than the rest of the values, the dataset is bimodal. If three values appear simultaneously and more than the rest of the values, the dataset is trimodal, and for n modes, this dataset is multimodal.

2.Measure of variance—variability

The central trend measures that we saw above help us understand the centrality of the data. However, we also need to know how far away the data is from the center of the distribution; if we have the Mean of the data, we also need to understand how the data points around the Mean are dispersed - the variability of the data set.

Therefore, we will always work with the central trend measures in conjunction with the dispersion measures - most of the time, the average is not enough to get an idea of how the data is organized.

1.Standard Deviation: This is the measure of the mean distance between each element, and the set mean. That is, how the data is distributed from the average. A low standard deviation indicates that data points tend to be concentrated close to the data set mean. In contrast, a high standard deviation indicates that data points are spread more widely.

N?o foi fornecido texto alternativo para esta imagem

2.Variance: is the square of the standard deviation. In some situations, we use variance; in others, the standard deviation. The big difference is that we don't have a unit with variance, while the standard deviation has a unit that we are studying.

3.Amplitude: it is one of the most straightforward techniques of descriptive statistics. The amplitude is the difference between the smallest and highest value of the dataset.

4.Percentile: This is a way to represent the position of a value in the dataset. To calculate the percentile, the values in the dataset must always be ascending me. We have 99 percentiles within a data set; that is, at any time, we can search for data in a given percentile (position) in the group and from that make inferences, stating that the vast majority of the data is lowered or above a specific place.

5.Quartile: These values divide the data into quarters, as long as the data is sorted in ascending order. There are four sections, each with 25% of the elements of the set. The interquartile interval is the difference between Q3-Q1, a measure that shows the concentration of 50% of the data.

N?o foi fornecido texto alternativo para esta imagem

3.Measure of distribution—form

The measurements of skewness and kurtosis characterize the shape of the distribution of elements around the Mean. It is a fundamental concept because we should apply some technique to adjust the data before predictive modeling according to the data's distribution.

N?o foi fornecido texto alternativo para esta imagem

1.Perfect symmetry: in a perfect normal distribution, the tails on each side of the curve are exact mirror images - mean, median, and mode have the same value.

2.Positive asymmetry: when a distribution is tilted to the right, the tail on the right side of the curve is larger than the tail on the left side, and the average is higher than the mode. This situation is called positive asymmetry. According to our goal, if we identify a positive asymmetry data distribution, we might need to apply a statistical technique to bring the data to asymmetric distribution to feed machine learning algorithms.

3.Negative asymmetry: When a distribution is tilted to the left, the tail on the left side of the curve is greater than the tail on the right side, and the average is smaller than the mode. This situation is called negative asymmetry.

To calculate the coefficient of asymmetry, we use the coefficient of skewness based on fashion: (average-fashion) / standard deviation, or we have the option to use based on the average where 3 * (average-median) / standard deviation:

N?o foi fornecido texto alternativo para esta imagem


How to interpret the coefficient of asymmetry?

The signal gives the direction of asymmetry. 

  • Zero means no asymmetry. 
  • A negative value means that the distribution is negatively asymmetric. 
  • A positive value means that the distribution is positively asymmetric.

The coefficient comments the sample distribution with a normal distribution. The higher the value, the more the distribution differs from a normal distribution.

Kurtosis coefficient

The Kurtosis coefficient is one of the most used coefficients to measure the flattening degree of a distribution curve, or simply coefficient (k), calculated from the interquartile interval of the percentiles of orders 10 and 90.

N?o foi fornecido texto alternativo para esta imagem


Correlation Coefficient

So far, we have seen several measures that help describe the data: central trend measures to verify the centrality of the data, dispersion measures to determine the variability of the data.

In some situations, we will want to beyond these measures, check the relationship between two variables, one with the other - for this, we calculate the correlation coefficient.

The correlation coefficient is often used during the exploratory analysis phase of the data to understand beyond the definition of a single variable and its relation to the other variables in the set.

The correlation allows determining how strongly the pairs of variables are related; that is, the correlation allows analyzing two variables and then extracting the relationship's strength. The main result of a correlation is the correlation coefficient (r), ranging from -1.0 to +1.0. The closer to +1.0, the closer the two variables are related.

It is up to the data scientist to choose the appropriate tool for each step of the data analysis process. We will continue in Part II with the concepts of Probability and Inferential Statistics, the other two pillars of statistical analysis.

And there we have it. I hope you have found this useful. Thank you for reading. ??

Anello – Medium

Ameet Topare

GCP| AI | ML | K8S |

3 年

Excellent short and perfect... #datascience

要查看或添加评论,请登录

Leonardo A.的更多文章

社区洞察

其他会员也浏览了