A complete Exploratory Data Analysis in Python

A complete Exploratory Data Analysis in Python

This tutorial will explore the rental dataset, perform transformations, and reorganize the data to prepare the data for modeling and creating models.

Visit?Jupyter Notebook?to see the concepts that we will cover about?Exploratory Data Analysis. Note: essential?functions,?outputs,?and?terms?are?bold?to facilitate understanding — at least mine.

It is expected that we receive the data to solve any problem and analyze and?explore?data, seek?relationships, seek how variables are?organized, have or not transform the data — all this is done through Exploratory Analysis.

Case study

Analyze?rental data for used bikes as taxis in New York City. New York City is absolute urban chaos; traffic is chaotic. Because of this, it is widespread to find bikes that are used as taxis, which carry more than one person. This dataset represents this type of rental.

Importing libraries

Let’s import the packages we’ll work with:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
pd.__version__        

Importing data

We have the?read_csv?the Pandas and record on the?df?object:

df = pd.read_csv('Taxis-Exploratory-Analysis.csv')        

Checking types

We can see that we have a DataFrame type object to manipulate:

print(type(df))

<class'pandas.core.frame.DataFrame'>        

Checking data

The first column is the index created automatically by Pandas, and the other columns were collected from the city of New York — date, path distance, and path time.

df.head(10)        
N?o foi fornecido texto alternativo para esta imagem

Checking data type

Let’s see how Pandas determined the types of each column when loading them:

df.dtypes

Data          object
Distancia     float64
Tempo         object

dtype: object        

However, the first column Data is not a string format but a DateTime format. The moment Pandas did the import; it didn’t detect that column was a DateTime — it affects our analysis process.

Checking columns

Returns a list of?columns?we have:

df.columns
Index(['Data', 'Distancia', 'Tempo'], dtype='object')        

Adjusting column name

Here we have a code to adjust the name of the?columns. Here we will translate Portuguese into English. Just create a?list of the strings?we want and assign this list to the columns attribute of the DataFrame?df:

cols = ['Date', 'Distance', 'Time']

df.columns = cols # columns receive cols        

Checking indexes

It starts at 0 and ends at 81, jumping 1 in 1:

df.index

RangeIndex(start=0, stop=81, step=1)        

Looking at one column

We call the DataFrame df and filter by the Date column:

df['Date'].head()

0    8/2/15
1    8/3/15
2    8/4/15
3    8/6/15
4    8/7/15
Name: Data, dtype: object        

Converting Date column type

The Date column was loaded as a string type, but we can change this behavior when we load the data. To do this, include the parse_dates parameter of the?read_csv:

Attention: When adjusting?parse_dates,?have to readjust the columns name as well

df = pd.read_csv('Taxis-Exploratory-Analysis.csv', parse_dates = ['Data'])

cols = ['Date', 'Distance', 'Time']
df.columns = cols        

Check the?type?of the Date column:

df['Date'].head()

0    8/2/15
1    8/3/15
2    8/4/15
3    8/6/15
4    8/7/15
Name: Date, dtype: datetime64[ns]        

As we can see, the dtype of Date returns as datetime64, we successfully performed the conversion. This change is critical to be made since each object type has specific attributes and methods.

Setting Date to Index

We will define the Date column as the index of our df table. If we’re going to analyze the data over time, it makes sense to use Date as an Index.

By placing The Date as the index, we have in our favor all the power of indexing that Pandas offer:

df.set_index('Date', inplace = True)

df.head(10)        
N?o foi fornecido texto alternativo para esta imagem

Now we can perform a series of slicing, assemble time series to track the data over time — we have much more possibilities. This conversion depends, of course, on our goals.

Creating plot

When we call the plot function, it automatically recognized that the Date is an index and created a time series.

df.plot()        
N?o foi fornecido texto alternativo para esta imagem

We can see above the evolution of the distance of the bike path over time. Clearly, we were able to extract the information that there is an average distance of around 5, and at any given time, there is a relevant peak.

Consulting help

With help? We can visualize all parameters, some examples of using the function, values we can use, etc.

?df.plot        
N?o foi fornecido texto alternativo para esta imagem

Changing chart type

We’ll make some changes to the chart, starting with its type. But, first, let’s adjust the kind parameter for the bar chart:

df.plot(kind = 'bar')

plt.show()        
N?o foi fornecido texto alternativo para esta imagem

This pollution below the graph are the subtitles because of the high compaction of information.

Changing chart

Adjust the king parameter for an area chart and determining the y-axis name:

df.plot(kind = 'area')

plt.ylabel("Count")        
N?o foi fornecido texto alternativo para esta imagem

Attention!

The peak we see at the end of the chart, although it shows that something relevant has occurred, if we keep this peak, we will have problems in the future in predictive modeling — we have a constant average distance, the peak is an outlier that entirely runs away from the average.

The ideal would be to take the outlier, analyze it separately, and remove it from the dataset — so we’ll have the average preserved to train the dataset.

Statistical summary

Get the statistical summary of this dataset with a simple method:

df.describe()        
N?o foi fornecido texto alternativo para esta imagem

Slicing

We can slice by Date since Date is now an index:

df['2015-11']        
N?o foi fornecido texto alternativo para esta imagem

Notice that the filter was made by month and year, so the return was every day of November 2015 — This is a great advantage of putting the date as the index.

Length

We can have the entire length of the filter with the len function:

len(df['2015-11'])
13        

Attention!

So far, we’ve run every line of the Jupyter Notebook. However, if we close the browser, we should rerun it.

Saving in CSV

Therefore, we can save intermediate results in a CSV file to not lose what was done. For this, we call the?to_csv?function and store it in df in a?CSV?called?dataframe_saved_v1:

df.to_csv('dataframe_saved_v1.csv')        
We finish our day and use .to_csv

Notebook - 2

Visit?Jupyter Notebook?to see the concepts that we will cover about?Exploratory Data Analysis. Note: important functions, outputs, and terms are bold to facilitate understanding — at least mine.

Uploading saved data

Let’s upload the file we saved through the to_csv:

df = pd.read_csv('dataframe_saved_v1.csv')

df.head()        
N?o foi fornecido texto alternativo para esta imagem

Checking type

Be very careful here! Note that the Date column has been loaded as an object format!

When loading?CSV,?we need to determine?parse_dates?again:

df.dtypes

Data          object
Distancia    float64
Tempo         object
dtype: object        

Reloading file

This time, let’s reload the file by parse_dates:

df = pd.read_csv('data/dataframe_saved_v1.csv', parse_dates = ['Data'])

cols = ['Date', 'Distance', 'Time'] # convert columns again
df.colums = cols
df.dtypes
Date              datetime64[ns]
Distance          float64
Time              object        

By rechecking the type, we can have the security following with the Date in?datetime64?format.

Setting Date to Index

Let’s set the Date column to index with the?set_index?function once again and check the change with head:

df.set_index('Date', inplace = True)        
N?o foi fornecido texto alternativo para esta imagem

Plot

df.plot()        
N?o foi fornecido texto alternativo para esta imagem

Converting time to Seconds

We can identify some needs during the exploratory analysis phase, such as breaking the time column into parts.

First, we use the reserved word def, name the function total_seconds and pass a time parameter.

The function will verify that the passed parameter is of type?NAN. If so, it returns NAN and goes on.

Then we will point to the hours, minutes, and seconds of the column where there is the time format, in this case, the Time column split by “, “dividing the column each time you find “:” and extracting the time, to separate into hrs, mins and seconds.

Finally, a calculation is done to convert hrs, mins, and seconds into total seconds. That is, instead of having the duration path in timed time, we will have the time standardized in total seconds:

def total_seconds(time):
    if time is np.nan:
        return np.nan 
    hrs, mins, seconds = str(time).split(':')
    seconds = int(seconds) + 60 * int(mins) + 60 * 60 * int(hrs)
    return seconds        

Applying the function with Map

Let’s apply the function to the Time column, with the?map?function, and record it in a new column called Seconds in?df:

df['Seconds'] = df.Time.map(total_seconds)
df.head(10)        

We have the Time column intact and now the new Seconds column, the path duration converted to total — Very useful.

N?o foi fornecido texto alternativo para esta imagem

Statistical summary

Statistical summary for the?Distance?column and?Seconds?that are in the numeric format:

df.describe()        
N?o foi fornecido texto alternativo para esta imagem

Filling missing values

If you think it’s necessary, we can fill in the?NaN?values with?0?and then call the statistical summary:

df.fillna(0).describe()        
N?o foi fornecido texto alternativo para esta imagem

The statistical values have now changed because we fill in the missing values with 0 — we have more techniques to solve this, which we’ll see later.

Converting seconds to minutes

Now we can quantify how many Minutes the journey time lasted since we already have the count by Seconds.

Let’s take the Seconds column and apply the map function. As a parameter, we will pass an anonymous function using the lambda expression — for each value of x, that is, for each element, take the element and divide it by 60 for each value in Seconds and write the output in Minutes:

df['Minutes'] = df['Seconds'].map(lambda x: x / 60)        

Fill in?NAN?values with?0?and request the statistical summary:

df.fillna(0).describe()        
N?o foi fornecido texto alternativo para esta imagem

Now we have one more column analyze, the Minutes column. So, we’re breaking the Time column, and so far, we already have the total Seconds and quantified it to Minutes:

Plot

Let’s take advantage of the DataFrame already more robust and draw a scatter plot between Distance and Minutes:

df.plot(x = 'Distance', 
        y = 'Minutes', 
        kind = 'scatter')        
N?o foi fornecido texto alternativo para esta imagem
df.corr()        
N?o foi fornecido texto alternativo para esta imagem

Clearly, there is a positive relationship between the Distance variable and the Minute variable. As Distance increases, it takes more Minutes for that path — this shows that there is consistency in the data.

Checking correlation — Pearson

We can call the corr method to verify the correlation between the variables, creating the table of correlation of variables.?Diagonal 1?is just the ratio of a variable to itself — the highest possible.

Pearson coefficient is used to calculate the standard correlation.

Checking correlation — Spearman

Switch the coefficient to Spearman:

N?o foi fornecido texto alternativo para esta imagem
f.corr(method = 'spearman')
        

Checking correlation — Kendall

Switch the coefficient to Kendall:

df.corr(method = 'kendall')        
N?o foi fornecido texto alternativo para esta imagem

Boxplot

A chart helps us visualize outliers, values that run far from the average distribution — extreme values.

The circles represent the trips that lasted much longer than the average, and it is necessary to investigate these cases.

df.boxplot('Minutes', return_type = 'axes')        
N?o foi fornecido texto alternativo para esta imagem

Creating other variables

We can create a variable that indicates the minuting per kilometer:

df['Min_Per_Km'] = df['Minutes'] / df['Distance']

df.fillna(0).describe()        
N?o foi fornecido texto alternativo para esta imagem

Applying Histogram

The histogram shows us the frequency distribution of a variable, in this case, our new variable Min_Per_Km:

df.hist('Min_Per_Km')        
N?o foi fornecido texto alternativo para esta imagem
df.hist('Min_Per_Km', bins = 20)        
N?o foi fornecido texto alternativo para esta imagem

See that we have the vast majority of journeys between 7.0 and 7.5 Minutes — this shows us how the data is distributed. Depending on the Machine Learning Model, you may need to Normalize the data.

Adjusting Histogram columns

We can change the number of columns with bins:

df.hist('Min_Per_Km', bins = 20, figsize = (10, 8))

plt.xlim((5, 11))
plt.ylim((0, 12))
plt.title("Histogram Minutes Per Km")
plt.grid(False)
plt.savefig('imagens/hist_minutes_per_km.png')        
N?o foi fornecido texto alternativo para esta imagem

Plotting a variable

A simple plot of a single variable in the set:

df['Distance'].plot()        
N?o foi fornecido texto alternativo para esta imagem
df.head(15)        
N?o foi fornecido texto alternativo para esta imagem

Applying Time Windows

Here we have a fascinating concept, the time window. This concept is applied more frequently when using?Apache Spark?— working with real-time data.

Within the windows, we apply rolling statistics, which are statistics calculated specifically from this window. The data is passing; we take a time window, look at this data, and apply statistics.

df['Distance'].plot()
pd.Series(df['Distance']).rolling(window = 2).mean().plot()        

First, we convert the column into a Series, apply the rolling function, apply rolling statistics within this window lasting two hours, and apply the average.

N?o foi fornecido texto alternativo para esta imagem

Very interesting!

The blue line referred to the Distance variable that was already plotted. Now we have the new orange line that refers to the average created in the window!

In a single chart, we have the distance traveled over time, and also the average of the distances traveled — All in the same chart.

Summing to Time Window

Use another statistic in the two-hour time window:

df['Distance'].plot()
pd.Series(df['Distance']).rolling(window = 2).sum().plot()        
N?o foi fornecido texto alternativo para esta imagem

Indexing

Finally, we can use indexing.

df.index

dtype='datetime64[ns]', name='Date', freq=None)        

Above, we have indexes that are the dates we set through set_index previously, now; we can apply the indexing Filter by Date:

df['2015-11':'2015-12']        
N?o foi fornecido texto alternativo para esta imagem

Plotting a range

We apply indexing by Date, indicate the Distance column, and end with a plot method. All just with a simple command:

df['2015-11':'2016-1-1']['Distance'].plot()        
N?o foi fornecido texto alternativo para esta imagem

Index Location

We can find an index using a specific Date:

df.loc['2015-8-12']

Distance        1.420000
Time             0:10:35
Seconds       635.000000
Minutes        10.583333
Min_Per_Km      7.453052
Name: 2015-08-12 00:00:00, dtype: object        

Saving in CSV

Finally, when our work is completed, we can save our DataFrame to a new file.

df.to_csv('data/dataframe_saved_v2.csv')        

Resetting the Index

If necessary, we can reset the index, see that the Data column is again a common data column, and an index is again a number:

df.reset_index()        
N?o foi fornecido texto alternativo para esta imagem

We move the data during the exploratory analysis, collect insights or the information we seek, then return the data to the original format if we wish. The Data Scientist decides what is best for your process.

And there we have it. I hope you have found this helpful. Thank you for reading. ??

Leonardo Anello

要查看或添加评论,请登录

Leonardo A.的更多文章

社区洞察

其他会员也浏览了