The Five-Step Process for Data Exploration in a Jupyter?Notebook
Teddy Petrou
Python Data Science Expert - Author of Multiple Books and Python Libraries
The post is an excerpt from the book Master Data Analysis with Python Volume 1. The goal of this post is to give you a practical and repeatable approach to doing data analysis with pandas in a Jupyter Notebook. This simple process can help keep your notebooks clean, accurate and readable.
I also have a video from the Dunder Data YouTube channel where I demonstrate this entire process. I believe this is a post that is much better viewed as opposed to read, so if you have the time see the video below.
A major pain point for beginners is writing too many lines of code in a single cell. When you are learning, you need to get feedback on every single line of code that you write and verify that it is in fact correct. Only once you have verified the result should you move on to the next line of code.
To help increase your ability to do data exploration in Jupyter Notebooks, I recommend the following five-step process:
- Write and execute a single line of code to explore your data
- Verify that this line of code works by inspecting the output
- Assign the result to a variable
- Within the same cell, in a second line output the head of the DataFrame or Series
- Continue to the next cell. Do not add more lines of code to the cell
Apply to every part of the analysis
You can apply this process to every part of your data analysis. Let’s see this process in action with a few examples. We will start by reading in the data.
import pandas as pd
Step 1: Write and execute a single line of code to explore your data
In this step, we make a call to the read_csv function.
pd.read_csv('../data/bikes.csv')
Step 2: Verify that this line of code works by inspecting the output
Looking above, the output appears to be correct. Of course, we can’t inspect every single value, but we can do a sanity check to see if a reasonable-looking DataFrame is produced.
Step 3: Assign the result to a variable
You would normally do this step in the same cell, but for this demonstration, we will place it in the cell below.
bikes = pd.read_csv('../data/bikes.csv')
Step 4: Within the same cell, in a second line output the head of the DataFrame or Series
Again, all these steps would be combined in the same cell.
bikes.head()
Step 5: Continue to the next cell. Do not add more lines of code to the cell
It is tempting to do more analysis in a single cell. I advise against doing so when you are a beginner. By limiting your analysis to a single main line per cell, and outputting that result, you can easily trace your work from one step to the next. Most lines of code in a notebook will apply some operation to the data. It is vital that you can see exactly what this operation is doing. If you put multiple lines of code in a single cell, you lose track of what is happening and can’t easily determine the veracity of each operation.
More examples
Let’s see another simple example of the five-step process for data exploration in the notebook. Instead of writing each of the five steps in their own cell, the final result is shown with an explanation that follows.
bikes_id = bikes.set_index('trip_id')
bikes_id.head()
In this part of the analysis, we want to set one of the columns as the index. During step 1, we write a single line of code, bikes.set_index('trip_id'). In step 2, we manually verify that the output looks correct. In step 3, we assign the result to a variable with bikes_id = bikes.set_index('trip_id'). In step 4, we output the head as another line of code, and in step 5, we move on to the next cell.
No strict requirement for one line of code
The above examples each had a single main line of code followed by outputting the head of the DataFrame. Often times there will be a few more very simple lines of code that can be written in the same cell. You should not strictly adhere to writing a single line of code, but instead, think about keeping the amount of code written in a single cell to a minimum.
For instance, the following block has three lines of code. The first is very simple and creates a list of column names as strings. This is an instance where multiple lines of code are easily interpreted.
cols = ['gender', 'tripduration']
bikes_gt = bikes[cols]
bikes_gt.head()
When to assign the result to a variable
Not all operations on our data will need to be assigned to a variable. We might just be interested in seeing the results. But, for many operations, you will want to continue with the new transformed data. By assigning the result to a variable, you have immediate access to the previous result.
When to create a new variable name
In the second example, bikes_id was used as the new variable name for the result. Instead, we could have assigned the result to the same variable like this:
bikes = bikes.set_index('trip_id')
This would have the advantage of saving us some memory. Using two variable names keeps both DataFrames bikes and bikes_id in memory. The disadvantage of overwriting a variable name is that we lose traceability within our code. We no longer have access to the original bikes DataFrame. When you are first examining a dataset, I recommend creating new variable names for each new DataFrame/Series that you create. This way, you can access the state of your data at any time.
Continuously verifying results
Regardless of how adept you become at doing data explorations, it is good practice to verify each line of code. Data science is difficult and it is easy to make mistakes, even with trivial tasks. Data is also messy and it is good to be skeptical while proceeding through an analysis. Getting visual verification that each line of code is producing the desired result is important. Doing this also provides feedback to help you think about what avenues to explore next.
Get the book
If you’d like to learn more and support my work, please consider purchasing the book Master Data Analysis with Python Volume 1. It is a comprehensive guide to doing data analysis with Python and contains over 600 pages, 300 exercises, multiple projects, and detailed solutions.
Subscribe to the Dunder Data YouTube channel. New videos are released every day at 10 a.m. and 2 p.m.
Follow me on Twitter @TedPetrou for daily musing on Python data science.