A complete Exploratory Data Analysis in Python
This tutorial will explore the rental dataset, perform transformations, and reorganize the data to prepare the data for modeling and creating models.
Visit?Jupyter Notebook?to see the concepts that we will cover about?Exploratory Data Analysis. Note: essential?functions,?outputs,?and?terms?are?bold?to facilitate understanding — at least mine.
It is expected that we receive the data to solve any problem and analyze and?explore?data, seek?relationships, seek how variables are?organized, have or not transform the data — all this is done through Exploratory Analysis.
Case study
Analyze?rental data for used bikes as taxis in New York City. New York City is absolute urban chaos; traffic is chaotic. Because of this, it is widespread to find bikes that are used as taxis, which carry more than one person. This dataset represents this type of rental.
Importing libraries
Let’s import the packages we’ll work with:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
pd.__version__
Importing data
We have the?read_csv?the Pandas and record on the?df?object:
df = pd.read_csv('Taxis-Exploratory-Analysis.csv')
Checking types
We can see that we have a DataFrame type object to manipulate:
print(type(df))
<class'pandas.core.frame.DataFrame'>
Checking data
The first column is the index created automatically by Pandas, and the other columns were collected from the city of New York — date, path distance, and path time.
df.head(10)
Checking data type
Let’s see how Pandas determined the types of each column when loading them:
df.dtypes
Data object
Distancia float64
Tempo object
dtype: object
However, the first column Data is not a string format but a DateTime format. The moment Pandas did the import; it didn’t detect that column was a DateTime — it affects our analysis process.
Checking columns
Returns a list of?columns?we have:
df.columns
Index(['Data', 'Distancia', 'Tempo'], dtype='object')
Adjusting column name
Here we have a code to adjust the name of the?columns. Here we will translate Portuguese into English. Just create a?list of the strings?we want and assign this list to the columns attribute of the DataFrame?df:
cols = ['Date', 'Distance', 'Time']
df.columns = cols # columns receive cols
Checking indexes
It starts at 0 and ends at 81, jumping 1 in 1:
df.index
RangeIndex(start=0, stop=81, step=1)
Looking at one column
We call the DataFrame df and filter by the Date column:
df['Date'].head()
0 8/2/15
1 8/3/15
2 8/4/15
3 8/6/15
4 8/7/15
Name: Data, dtype: object
Converting Date column type
The Date column was loaded as a string type, but we can change this behavior when we load the data. To do this, include the parse_dates parameter of the?read_csv:
Attention: When adjusting?parse_dates,?have to readjust the columns name as well
df = pd.read_csv('Taxis-Exploratory-Analysis.csv', parse_dates = ['Data'])
cols = ['Date', 'Distance', 'Time']
df.columns = cols
Check the?type?of the Date column:
df['Date'].head()
0 8/2/15
1 8/3/15
2 8/4/15
3 8/6/15
4 8/7/15
Name: Date, dtype: datetime64[ns]
As we can see, the dtype of Date returns as datetime64, we successfully performed the conversion. This change is critical to be made since each object type has specific attributes and methods.
Setting Date to Index
We will define the Date column as the index of our df table. If we’re going to analyze the data over time, it makes sense to use Date as an Index.
By placing The Date as the index, we have in our favor all the power of indexing that Pandas offer:
df.set_index('Date', inplace = True)
df.head(10)
Now we can perform a series of slicing, assemble time series to track the data over time — we have much more possibilities. This conversion depends, of course, on our goals.
Creating plot
When we call the plot function, it automatically recognized that the Date is an index and created a time series.
df.plot()
We can see above the evolution of the distance of the bike path over time. Clearly, we were able to extract the information that there is an average distance of around 5, and at any given time, there is a relevant peak.
Consulting help
With help? We can visualize all parameters, some examples of using the function, values we can use, etc.
?df.plot
Changing chart type
We’ll make some changes to the chart, starting with its type. But, first, let’s adjust the kind parameter for the bar chart:
df.plot(kind = 'bar')
plt.show()
This pollution below the graph are the subtitles because of the high compaction of information.
Changing chart
Adjust the king parameter for an area chart and determining the y-axis name:
df.plot(kind = 'area')
plt.ylabel("Count")
Attention!
The peak we see at the end of the chart, although it shows that something relevant has occurred, if we keep this peak, we will have problems in the future in predictive modeling — we have a constant average distance, the peak is an outlier that entirely runs away from the average.
The ideal would be to take the outlier, analyze it separately, and remove it from the dataset — so we’ll have the average preserved to train the dataset.
Statistical summary
Get the statistical summary of this dataset with a simple method:
df.describe()
Slicing
We can slice by Date since Date is now an index:
df['2015-11']
Notice that the filter was made by month and year, so the return was every day of November 2015 — This is a great advantage of putting the date as the index.
Length
We can have the entire length of the filter with the len function:
len(df['2015-11'])
13
Attention!
So far, we’ve run every line of the Jupyter Notebook. However, if we close the browser, we should rerun it.
Saving in CSV
Therefore, we can save intermediate results in a CSV file to not lose what was done. For this, we call the?to_csv?function and store it in df in a?CSV?called?dataframe_saved_v1:
df.to_csv('dataframe_saved_v1.csv')
We finish our day and use .to_csv
Notebook - 2
Visit?Jupyter Notebook?to see the concepts that we will cover about?Exploratory Data Analysis. Note: important functions, outputs, and terms are bold to facilitate understanding — at least mine.
Uploading saved data
Let’s upload the file we saved through the to_csv:
df = pd.read_csv('dataframe_saved_v1.csv')
df.head()
Checking type
Be very careful here! Note that the Date column has been loaded as an object format!
When loading?CSV,?we need to determine?parse_dates?again:
df.dtypes
Data object
Distancia float64
Tempo object
dtype: object
Reloading file
This time, let’s reload the file by parse_dates:
df = pd.read_csv('data/dataframe_saved_v1.csv', parse_dates = ['Data'])
cols = ['Date', 'Distance', 'Time'] # convert columns again
df.colums = cols
df.dtypes
Date datetime64[ns]
Distance float64
Time object
By rechecking the type, we can have the security following with the Date in?datetime64?format.
Setting Date to Index
Let’s set the Date column to index with the?set_index?function once again and check the change with head:
df.set_index('Date', inplace = True)
Plot
df.plot()
Converting time to Seconds
We can identify some needs during the exploratory analysis phase, such as breaking the time column into parts.
领英推荐
First, we use the reserved word def, name the function total_seconds and pass a time parameter.
The function will verify that the passed parameter is of type?NAN. If so, it returns NAN and goes on.
Then we will point to the hours, minutes, and seconds of the column where there is the time format, in this case, the Time column split by “, “dividing the column each time you find “:” and extracting the time, to separate into hrs, mins and seconds.
Finally, a calculation is done to convert hrs, mins, and seconds into total seconds. That is, instead of having the duration path in timed time, we will have the time standardized in total seconds:
def total_seconds(time):
if time is np.nan:
return np.nan
hrs, mins, seconds = str(time).split(':')
seconds = int(seconds) + 60 * int(mins) + 60 * 60 * int(hrs)
return seconds
Applying the function with Map
Let’s apply the function to the Time column, with the?map?function, and record it in a new column called Seconds in?df:
df['Seconds'] = df.Time.map(total_seconds)
df.head(10)
We have the Time column intact and now the new Seconds column, the path duration converted to total — Very useful.
Statistical summary
Statistical summary for the?Distance?column and?Seconds?that are in the numeric format:
df.describe()
Filling missing values
If you think it’s necessary, we can fill in the?NaN?values with?0?and then call the statistical summary:
df.fillna(0).describe()
The statistical values have now changed because we fill in the missing values with 0 — we have more techniques to solve this, which we’ll see later.
Converting seconds to minutes
Now we can quantify how many Minutes the journey time lasted since we already have the count by Seconds.
Let’s take the Seconds column and apply the map function. As a parameter, we will pass an anonymous function using the lambda expression — for each value of x, that is, for each element, take the element and divide it by 60 for each value in Seconds and write the output in Minutes:
df['Minutes'] = df['Seconds'].map(lambda x: x / 60)
Fill in?NAN?values with?0?and request the statistical summary:
df.fillna(0).describe()
Now we have one more column analyze, the Minutes column. So, we’re breaking the Time column, and so far, we already have the total Seconds and quantified it to Minutes:
Plot
Let’s take advantage of the DataFrame already more robust and draw a scatter plot between Distance and Minutes:
df.plot(x = 'Distance',
y = 'Minutes',
kind = 'scatter')
df.corr()
Clearly, there is a positive relationship between the Distance variable and the Minute variable. As Distance increases, it takes more Minutes for that path — this shows that there is consistency in the data.
Checking correlation — Pearson
We can call the corr method to verify the correlation between the variables, creating the table of correlation of variables.?Diagonal 1?is just the ratio of a variable to itself — the highest possible.
Pearson coefficient is used to calculate the standard correlation.
Checking correlation — Spearman
Switch the coefficient to Spearman:
f.corr(method = 'spearman')
Checking correlation — Kendall
Switch the coefficient to Kendall:
df.corr(method = 'kendall')
Boxplot
A chart helps us visualize outliers, values that run far from the average distribution — extreme values.
The circles represent the trips that lasted much longer than the average, and it is necessary to investigate these cases.
df.boxplot('Minutes', return_type = 'axes')
Creating other variables
We can create a variable that indicates the minuting per kilometer:
df['Min_Per_Km'] = df['Minutes'] / df['Distance']
df.fillna(0).describe()
Applying Histogram
The histogram shows us the frequency distribution of a variable, in this case, our new variable Min_Per_Km:
df.hist('Min_Per_Km')
df.hist('Min_Per_Km', bins = 20)
See that we have the vast majority of journeys between 7.0 and 7.5 Minutes — this shows us how the data is distributed. Depending on the Machine Learning Model, you may need to Normalize the data.
Adjusting Histogram columns
We can change the number of columns with bins:
df.hist('Min_Per_Km', bins = 20, figsize = (10, 8))
plt.xlim((5, 11))
plt.ylim((0, 12))
plt.title("Histogram Minutes Per Km")
plt.grid(False)
plt.savefig('imagens/hist_minutes_per_km.png')
Plotting a variable
A simple plot of a single variable in the set:
df['Distance'].plot()
df.head(15)
Applying Time Windows
Here we have a fascinating concept, the time window. This concept is applied more frequently when using?Apache Spark?— working with real-time data.
Within the windows, we apply rolling statistics, which are statistics calculated specifically from this window. The data is passing; we take a time window, look at this data, and apply statistics.
df['Distance'].plot()
pd.Series(df['Distance']).rolling(window = 2).mean().plot()
First, we convert the column into a Series, apply the rolling function, apply rolling statistics within this window lasting two hours, and apply the average.
Very interesting!
The blue line referred to the Distance variable that was already plotted. Now we have the new orange line that refers to the average created in the window!
In a single chart, we have the distance traveled over time, and also the average of the distances traveled — All in the same chart.
Summing to Time Window
Use another statistic in the two-hour time window:
df['Distance'].plot()
pd.Series(df['Distance']).rolling(window = 2).sum().plot()
Indexing
Finally, we can use indexing.
df.index
dtype='datetime64[ns]', name='Date', freq=None)
Above, we have indexes that are the dates we set through set_index previously, now; we can apply the indexing Filter by Date:
df['2015-11':'2015-12']
Plotting a range
We apply indexing by Date, indicate the Distance column, and end with a plot method. All just with a simple command:
df['2015-11':'2016-1-1']['Distance'].plot()
Index Location
We can find an index using a specific Date:
df.loc['2015-8-12']
Distance 1.420000
Time 0:10:35
Seconds 635.000000
Minutes 10.583333
Min_Per_Km 7.453052
Name: 2015-08-12 00:00:00, dtype: object
Saving in CSV
Finally, when our work is completed, we can save our DataFrame to a new file.
df.to_csv('data/dataframe_saved_v2.csv')
Resetting the Index
If necessary, we can reset the index, see that the Data column is again a common data column, and an index is again a number:
df.reset_index()
We move the data during the exploratory analysis, collect insights or the information we seek, then return the data to the original format if we wish. The Data Scientist decides what is best for your process.
And there we have it. I hope you have found this helpful. Thank you for reading. ??
Leonardo Anello