登录查看更多内容

A complete Exploratory Data Analysis in Python

Leonardo A.

Data Analyst

发布日期: 2021年7月19日

This tutorial will explore the rental dataset, perform transformations, and reorganize the data to prepare the data for modeling and creating models.

Visit?Jupyter Notebook?to see the concepts that we will cover about?Exploratory Data Analysis. Note: essential?functions,?outputs,?and?terms?are?bold?to facilitate understanding — at least mine.

It is expected that we receive the data to solve any problem and analyze and?explore?data, seek?relationships, seek how variables are?organized, have or not transform the data — all this is done through Exploratory Analysis.

Case study

Analyze?rental data for used bikes as taxis in New York City. New York City is absolute urban chaos; traffic is chaotic. Because of this, it is widespread to find bikes that are used as taxis, which carry more than one person. This dataset represents this type of rental.

Importing libraries

Let’s import the packages we’ll work with:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
pd.__version__

Importing data

We have the?read_csv?the Pandas and record on the?df?object:

df = pd.read_csv('Taxis-Exploratory-Analysis.csv')

Checking types

We can see that we have a DataFrame type object to manipulate:

print(type(df))

<class'pandas.core.frame.DataFrame'>

Checking data

The first column is the index created automatically by Pandas, and the other columns were collected from the city of New York — date, path distance, and path time.

df.head(10)

N?o foi fornecido texto alternativo para esta imagem

Checking data type

Let’s see how Pandas determined the types of each column when loading them:

df.dtypes

Data          object
Distancia     float64
Tempo         object

dtype: object

However, the first column Data is not a string format but a DateTime format. The moment Pandas did the import; it didn’t detect that column was a DateTime — it affects our analysis process.

Checking columns

Returns a list of?columns?we have:

df.columns
Index(['Data', 'Distancia', 'Tempo'], dtype='object')

Adjusting column name

Here we have a code to adjust the name of the?columns. Here we will translate Portuguese into English. Just create a?list of the strings?we want and assign this list to the columns attribute of the DataFrame?df:

cols = ['Date', 'Distance', 'Time']

df.columns = cols # columns receive cols

Checking indexes

It starts at 0 and ends at 81, jumping 1 in 1:

df.index

RangeIndex(start=0, stop=81, step=1)

Looking at one column

We call the DataFrame df and filter by the Date column:

df['Date'].head()

0    8/2/15
1    8/3/15
2    8/4/15
3    8/6/15
4    8/7/15
Name: Data, dtype: object

Converting Date column type

The Date column was loaded as a string type, but we can change this behavior when we load the data. To do this, include the parse_dates parameter of the?read_csv:

Attention: When adjusting?parse_dates,?have to readjust the columns name as well

df = pd.read_csv('Taxis-Exploratory-Analysis.csv', parse_dates = ['Data'])

cols = ['Date', 'Distance', 'Time']
df.columns = cols

Check the?type?of the Date column:

df['Date'].head()

0    8/2/15
1    8/3/15
2    8/4/15
3    8/6/15
4    8/7/15
Name: Date, dtype: datetime64[ns]

As we can see, the dtype of Date returns as datetime64, we successfully performed the conversion. This change is critical to be made since each object type has specific attributes and methods.

Setting Date to Index

We will define the Date column as the index of our df table. If we’re going to analyze the data over time, it makes sense to use Date as an Index.

By placing The Date as the index, we have in our favor all the power of indexing that Pandas offer:

df.set_index('Date', inplace = True)

df.head(10)

Now we can perform a series of slicing, assemble time series to track the data over time — we have much more possibilities. This conversion depends, of course, on our goals.

Creating plot

When we call the plot function, it automatically recognized that the Date is an index and created a time series.

df.plot()

We can see above the evolution of the distance of the bike path over time. Clearly, we were able to extract the information that there is an average distance of around 5, and at any given time, there is a relevant peak.

Consulting help

With help? We can visualize all parameters, some examples of using the function, values we can use, etc.

?df.plot

Changing chart type

We’ll make some changes to the chart, starting with its type. But, first, let’s adjust the kind parameter for the bar chart:

df.plot(kind = 'bar')

plt.show()

This pollution below the graph are the subtitles because of the high compaction of information.

Changing chart

Adjust the king parameter for an area chart and determining the y-axis name:

df.plot(kind = 'area')

plt.ylabel("Count")

Attention!

The peak we see at the end of the chart, although it shows that something relevant has occurred, if we keep this peak, we will have problems in the future in predictive modeling — we have a constant average distance, the peak is an outlier that entirely runs away from the average.

The ideal would be to take the outlier, analyze it separately, and remove it from the dataset — so we’ll have the average preserved to train the dataset.

Statistical summary

Get the statistical summary of this dataset with a simple method:

df.describe()

Slicing

We can slice by Date since Date is now an index:

df['2015-11']

Notice that the filter was made by month and year, so the return was every day of November 2015 — This is a great advantage of putting the date as the index.

Length

We can have the entire length of the filter with the len function:

len(df['2015-11'])
13

Attention!

So far, we’ve run every line of the Jupyter Notebook. However, if we close the browser, we should rerun it.

Saving in CSV

Therefore, we can save intermediate results in a CSV file to not lose what was done. For this, we call the?to_csv?function and store it in df in a?CSV?called?dataframe_saved_v1:

df.to_csv('dataframe_saved_v1.csv')

We finish our day and use .to_csv

Notebook - 2

Visit?Jupyter Notebook?to see the concepts that we will cover about?Exploratory Data Analysis. Note: important functions, outputs, and terms are bold to facilitate understanding — at least mine.

Uploading saved data

Let’s upload the file we saved through the to_csv:

df = pd.read_csv('dataframe_saved_v1.csv')

df.head()

Checking type

Be very careful here! Note that the Date column has been loaded as an object format!

When loading?CSV,?we need to determine?parse_dates?again:

df.dtypes

Data          object
Distancia    float64
Tempo         object
dtype: object

Reloading file

This time, let’s reload the file by parse_dates:

df = pd.read_csv('data/dataframe_saved_v1.csv', parse_dates = ['Data'])

cols = ['Date', 'Distance', 'Time'] # convert columns again
df.colums = cols
df.dtypes
Date              datetime64[ns]
Distance          float64
Time              object

By rechecking the type, we can have the security following with the Date in?datetime64?format.

Setting Date to Index

Let’s set the Date column to index with the?set_index?function once again and check the change with head:

df.set_index('Date', inplace = True)

Plot

df.plot()

Converting time to Seconds

We can identify some needs during the exploratory analysis phase, such as breaking the time column into parts.

领英推荐

50 Days of Data Analysis: Analyzing Data with NumPy

Benjamin Bennett Alexander 3 周前

A Compilation of my articles on various Data…

Parul Pandey 6 年前

Seaborn: Elevating Data Visualization in Python

Shakil Khan 5 个月前

First, we use the reserved word def, name the function total_seconds and pass a time parameter.

The function will verify that the passed parameter is of type?NAN. If so, it returns NAN and goes on.

Then we will point to the hours, minutes, and seconds of the column where there is the time format, in this case, the Time column split by “, “dividing the column each time you find “:” and extracting the time, to separate into hrs, mins and seconds.

Finally, a calculation is done to convert hrs, mins, and seconds into total seconds. That is, instead of having the duration path in timed time, we will have the time standardized in total seconds:

def total_seconds(time):
    if time is np.nan:
        return np.nan 
    hrs, mins, seconds = str(time).split(':')
    seconds = int(seconds) + 60 * int(mins) + 60 * 60 * int(hrs)
    return seconds

Applying the function with Map

Let’s apply the function to the Time column, with the?map?function, and record it in a new column called Seconds in?df:

df['Seconds'] = df.Time.map(total_seconds)
df.head(10)

We have the Time column intact and now the new Seconds column, the path duration converted to total — Very useful.

Statistical summary

Statistical summary for the?Distance?column and?Seconds?that are in the numeric format:

df.describe()

Filling missing values

If you think it’s necessary, we can fill in the?NaN?values with?0?and then call the statistical summary:

df.fillna(0).describe()

The statistical values have now changed because we fill in the missing values with 0 — we have more techniques to solve this, which we’ll see later.

Converting seconds to minutes

Now we can quantify how many Minutes the journey time lasted since we already have the count by Seconds.

Let’s take the Seconds column and apply the map function. As a parameter, we will pass an anonymous function using the lambda expression — for each value of x, that is, for each element, take the element and divide it by 60 for each value in Seconds and write the output in Minutes:

df['Minutes'] = df['Seconds'].map(lambda x: x / 60)

Fill in?NAN?values with?0?and request the statistical summary:

df.fillna(0).describe()

Now we have one more column analyze, the Minutes column. So, we’re breaking the Time column, and so far, we already have the total Seconds and quantified it to Minutes:

Plot

Let’s take advantage of the DataFrame already more robust and draw a scatter plot between Distance and Minutes:

df.plot(x = 'Distance', 
        y = 'Minutes', 
        kind = 'scatter')

df.corr()

Clearly, there is a positive relationship between the Distance variable and the Minute variable. As Distance increases, it takes more Minutes for that path — this shows that there is consistency in the data.

Checking correlation — Pearson

We can call the corr method to verify the correlation between the variables, creating the table of correlation of variables.?Diagonal 1?is just the ratio of a variable to itself — the highest possible.

Pearson coefficient is used to calculate the standard correlation.

Checking correlation — Spearman

Switch the coefficient to Spearman:

f.corr(method = 'spearman')

Checking correlation — Kendall

Switch the coefficient to Kendall:

df.corr(method = 'kendall')

Boxplot

A chart helps us visualize outliers, values that run far from the average distribution — extreme values.

The circles represent the trips that lasted much longer than the average, and it is necessary to investigate these cases.

df.boxplot('Minutes', return_type = 'axes')

Creating other variables

We can create a variable that indicates the minuting per kilometer:

df['Min_Per_Km'] = df['Minutes'] / df['Distance']

df.fillna(0).describe()

Applying Histogram

The histogram shows us the frequency distribution of a variable, in this case, our new variable Min_Per_Km:

df.hist('Min_Per_Km')

df.hist('Min_Per_Km', bins = 20)

See that we have the vast majority of journeys between 7.0 and 7.5 Minutes — this shows us how the data is distributed. Depending on the Machine Learning Model, you may need to Normalize the data.

Adjusting Histogram columns

We can change the number of columns with bins:

df.hist('Min_Per_Km', bins = 20, figsize = (10, 8))

plt.xlim((5, 11))
plt.ylim((0, 12))
plt.title("Histogram Minutes Per Km")
plt.grid(False)
plt.savefig('imagens/hist_minutes_per_km.png')

Plotting a variable

A simple plot of a single variable in the set:

df['Distance'].plot()

df.head(15)

Applying Time Windows

Here we have a fascinating concept, the time window. This concept is applied more frequently when using?Apache Spark?— working with real-time data.

Within the windows, we apply rolling statistics, which are statistics calculated specifically from this window. The data is passing; we take a time window, look at this data, and apply statistics.

df['Distance'].plot()
pd.Series(df['Distance']).rolling(window = 2).mean().plot()

First, we convert the column into a Series, apply the rolling function, apply rolling statistics within this window lasting two hours, and apply the average.

Very interesting!

The blue line referred to the Distance variable that was already plotted. Now we have the new orange line that refers to the average created in the window!

In a single chart, we have the distance traveled over time, and also the average of the distances traveled — All in the same chart.

Summing to Time Window

Use another statistic in the two-hour time window:

df['Distance'].plot()
pd.Series(df['Distance']).rolling(window = 2).sum().plot()

Indexing

Finally, we can use indexing.

df.index

dtype='datetime64[ns]', name='Date', freq=None)

Above, we have indexes that are the dates we set through set_index previously, now; we can apply the indexing Filter by Date:

df['2015-11':'2015-12']

Plotting a range

We apply indexing by Date, indicate the Distance column, and end with a plot method. All just with a simple command:

df['2015-11':'2016-1-1']['Distance'].plot()

Index Location

We can find an index using a specific Date:

df.loc['2015-8-12']

Distance        1.420000
Time             0:10:35
Seconds       635.000000
Minutes        10.583333
Min_Per_Km      7.453052
Name: 2015-08-12 00:00:00, dtype: object

Saving in CSV

Finally, when our work is completed, we can save our DataFrame to a new file.

df.to_csv('data/dataframe_saved_v2.csv')

Resetting the Index

If necessary, we can reset the index, see that the Data column is again a common data column, and an index is again a number:

df.reset_index()

We move the data during the exploratory analysis, collect insights or the information we seek, then return the data to the original format if we wish. The Data Scientist decides what is best for your process.

And there we have it. I hope you have found this helpful. Thank you for reading. ??

Leonardo Anello

要查看或添加评论，请登录

Leonardo A.的更多文章

Techniques for Exploratory Data Analysis and Interpretation of Statistical Graphs

2024年11月20日

Techniques for Exploratory Data Analysis and Interpretation of Statistical Graphs

Overview In this project, we’ll explore techniques for exploratory data analysis and dive into the interpretation of…

2 条评论
SQL: Mastering Data Engineering Essentials

2024年9月19日

SQL: Mastering Data Engineering Essentials

Here’s an interesting fact: do you know when the SQL language was created? When it first appeared? I do! It was in…
The Power of Hypothesis Testing

2024年8月3日

The Power of Hypothesis Testing

Hypothesis testing is a fundamental tool in inferential statistics and data science, allowing us to evaluate claims…
Normalization and Standardization in Data?Science: When to apply one, when to apply the?other?

2024年8月2日

Normalization and Standardization in Data?Science: When to apply one, when to apply the?other?

I’m going to bring you now probably the topic that generates the most doubts among those who are just starting their…
Mastering Data Preprocessing in Python Pandas: 23+ Clear Examples

2024年7月4日

Mastering Data Preprocessing in Python Pandas: 23+ Clear Examples

1. Introduction Data preprocessing is a critical step in any data analysis or machine learning project.
Data Splitting in Machine Learning: Techniques and?Pitfalls

2024年7月1日

Data Splitting in Machine Learning: Techniques and?Pitfalls

Machine learning is all the rage these days, but are you really grasping the fundamentals? If you’re diving into this…
Building and Deploying a Machine Learning Model with Flask (Model & Deploy Guide)

2024年6月28日

Building and Deploying a Machine Learning Model with Flask (Model & Deploy Guide)

We have completed the first part of our project, which was building the Machine Learning model. Now, let’s move on to…
8 Steps to Building a Machine Learning Model for Classification

2024年6月26日

8 Steps to Building a Machine Learning Model for Classification

Explore the process of creating, training, and deploying a machine learning model to predict product types based on…

1 条评论
9-Step Guide to Building Machine Learning Models

2024年6月24日

9-Step Guide to Building Machine Learning Models

In this article, I will walk you through the process of building machine learning models. I will first describe the…
Data Engineering: Principles of ETL vs. ELT

2024年6月21日

Data Engineering: Principles of ETL vs. ELT

Introduction There is a long journey within data engineering, especially in the ETL process. ETL is an acronym that…

See all articles

Case study

Importing libraries

Importing data

Checking types

Checking data

Checking data type

Checking columns

Adjusting column name

Checking indexes

Looking at one column

Converting Date column type

Setting Date to Index

Creating plot

Consulting help

Changing chart type

Changing chart

Attention!

Statistical summary

Slicing

Length

Attention!

Saving in CSV

Notebook - 2

Uploading saved data

Checking type

Reloading file

Setting Date to Index

Plot

Converting time to Seconds

领英推荐

Applying the function with Map

Statistical summary

Filling missing values

Converting seconds to minutes

Plot

Checking correlation — Pearson

Checking correlation — Spearman

Checking correlation — Kendall

Boxplot

Creating other variables

Applying Histogram

Adjusting Histogram columns

Plotting a variable

Applying Time Windows

Very interesting!

Summing to Time Window

Indexing

Plotting a range

Index Location

Saving in CSV

Resetting the Index

Leonardo A.的更多文章

Techniques for Exploratory Data Analysis and Interpretation of Statistical Graphs

SQL: Mastering Data Engineering Essentials

The Power of Hypothesis Testing

Normalization and Standardization in Data?Science: When to apply one, when to apply the?other?

Mastering Data Preprocessing in Python Pandas: 23+ Clear Examples

Data Splitting in Machine Learning: Techniques and?Pitfalls

Building and Deploying a Machine Learning Model with Flask (Model & Deploy Guide)

8 Steps to Building a Machine Learning Model for Classification

9-Step Guide to Building Machine Learning Models

Data Engineering: Principles of ETL vs. ELT

社区洞察

其他会员也浏览了

Exploring Qualitative Data Analysis with PyCharm

Matplotlib

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part I

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part II

Data Visualization in Python

Seaborn

Introduction to Pandas

Accessing Data with iloc: Position-Based Indexing in Pandas

Boost Your Data Analysis with These 30 Essential Pandas Tricks!

Essential Tools and Libraries for Data Science