登录查看更多内容

The Five-Step Process for Data Exploration in a Jupyter?Notebook

Teddy Petrou

Python Data Science Expert - Author of Multiple Books and Python Libraries

发布日期: 2019年3月10日

The post is an excerpt from the book Master Data Analysis with Python Volume 1. The goal of this post is to give you a practical and repeatable approach to doing data analysis with pandas in a Jupyter Notebook. This simple process can help keep your notebooks clean, accurate and readable.

I also have a video from the Dunder Data YouTube channel where I demonstrate this entire process. I believe this is a post that is much better viewed as opposed to read, so if you have the time see the video below.

A major pain point for beginners is writing too many lines of code in a single cell. When you are learning, you need to get feedback on every single line of code that you write and verify that it is in fact correct. Only once you have verified the result should you move on to the next line of code.

To help increase your ability to do data exploration in Jupyter Notebooks, I recommend the following five-step process:

Write and execute a single line of code to explore your data
Verify that this line of code works by inspecting the output
Assign the result to a variable
Within the same cell, in a second line output the head of the DataFrame or Series
Continue to the next cell. Do not add more lines of code to the cell

Apply to every part of the analysis

You can apply this process to every part of your data analysis. Let’s see this process in action with a few examples. We will start by reading in the data.

import pandas as pd

Step 1: Write and execute a single line of code to explore your data

In this step, we make a call to the read_csv function.

pd.read_csv('../data/bikes.csv')

Step 2: Verify that this line of code works by inspecting the output

Looking above, the output appears to be correct. Of course, we can’t inspect every single value, but we can do a sanity check to see if a reasonable-looking DataFrame is produced.

Step 3: Assign the result to a variable

You would normally do this step in the same cell, but for this demonstration, we will place it in the cell below.

bikes = pd.read_csv('../data/bikes.csv')

Step 4: Within the same cell, in a second line output the head of the DataFrame or Series

Again, all these steps would be combined in the same cell.

bikes.head()

Step 5: Continue to the next cell. Do not add more lines of code to the cell

It is tempting to do more analysis in a single cell. I advise against doing so when you are a beginner. By limiting your analysis to a single main line per cell, and outputting that result, you can easily trace your work from one step to the next. Most lines of code in a notebook will apply some operation to the data. It is vital that you can see exactly what this operation is doing. If you put multiple lines of code in a single cell, you lose track of what is happening and can’t easily determine the veracity of each operation.

More examples

Let’s see another simple example of the five-step process for data exploration in the notebook. Instead of writing each of the five steps in their own cell, the final result is shown with an explanation that follows.

bikes_id = bikes.set_index('trip_id')
bikes_id.head()

In this part of the analysis, we want to set one of the columns as the index. During step 1, we write a single line of code, bikes.set_index('trip_id'). In step 2, we manually verify that the output looks correct. In step 3, we assign the result to a variable with bikes_id = bikes.set_index('trip_id'). In step 4, we output the head as another line of code, and in step 5, we move on to the next cell.

No strict requirement for one line of code

The above examples each had a single main line of code followed by outputting the head of the DataFrame. Often times there will be a few more very simple lines of code that can be written in the same cell. You should not strictly adhere to writing a single line of code, but instead, think about keeping the amount of code written in a single cell to a minimum.

For instance, the following block has three lines of code. The first is very simple and creates a list of column names as strings. This is an instance where multiple lines of code are easily interpreted.

cols = ['gender', 'tripduration']
bikes_gt = bikes[cols]
bikes_gt.head()

When to assign the result to a variable

Not all operations on our data will need to be assigned to a variable. We might just be interested in seeing the results. But, for many operations, you will want to continue with the new transformed data. By assigning the result to a variable, you have immediate access to the previous result.

When to create a new variable name

In the second example, bikes_id was used as the new variable name for the result. Instead, we could have assigned the result to the same variable like this:

bikes = bikes.set_index('trip_id')

This would have the advantage of saving us some memory. Using two variable names keeps both DataFrames bikes and bikes_id in memory. The disadvantage of overwriting a variable name is that we lose traceability within our code. We no longer have access to the original bikes DataFrame. When you are first examining a dataset, I recommend creating new variable names for each new DataFrame/Series that you create. This way, you can access the state of your data at any time.

Continuously verifying results

Regardless of how adept you become at doing data explorations, it is good practice to verify each line of code. Data science is difficult and it is easy to make mistakes, even with trivial tasks. Data is also messy and it is good to be skeptical while proceeding through an analysis. Getting visual verification that each line of code is producing the desired result is important. Doing this also provides feedback to help you think about what avenues to explore next.

Get the book

If you’d like to learn more and support my work, please consider purchasing the book Master Data Analysis with Python Volume 1. It is a comprehensive guide to doing data analysis with Python and contains over 600 pages, 300 exercises, multiple projects, and detailed solutions.

Subscribe to the Dunder Data YouTube channel. New videos are released every day at 10 a.m. and 2 p.m.

Follow me on Twitter @TedPetrou for daily musing on Python data science.

要查看或添加评论，请登录

Teddy Petrou的更多文章

How to become an Expert at Pandas for Data Analysis for FREE

2022年6月21日

How to become an Expert at Pandas for Data Analysis for FREE

In 2014, I was first introduced to pandas and had no idea how to use it. By 2017, I had written the 500 page book…

3 条评论
Daily Python and Pandas Challenges

2021年11月10日

Daily Python and Pandas Challenges

I’m excited to announce Dunder Data Python and Pandas Challenges! One Python and one Pandas challenge will be released…

1 条评论
Displaying Pandas DataFrames Horizontally in Jupyter Notebooks

2021年8月30日

Displaying Pandas DataFrames Horizontally in Jupyter Notebooks

In this tutorial, you’ll learn how to display pandas DataFrames horizontally in your Jupyter Notebooks. I find this…

1 条评论
Python Pandas Certification Courses

2021年3月22日

Python Pandas Certification Courses

I am excited to announce Python Pandas Certification Courses — a series of courses to help you become an expert at…
Pandas 1.0 - Summary of enhancements and recommendations

2020年1月30日

Pandas 1.0 - Summary of enhancements and recommendations

Pandas 1.0 was just released today Below, I summarize the most important enhancements to the popular data analysis…

3 条评论
Anaconda is bloated - Set up a lean, reliable data science environment with Miniconda

2019年6月3日

Anaconda is bloated - Set up a lean, reliable data science environment with Miniconda

Installing the Anaconda distribution is the norm for data science but it comes with lots of excess software and…

4 条评论
Master Data Analysis with Python?—?Chapter 1: Intro to Data Analysis with pandas

2019年3月7日

Master Data Analysis with Python?—?Chapter 1: Intro to Data Analysis with pandas

Welcome to my new book, Master Data Analysis with Python — Volume 1: Foundations of Data Exploration. The book contains…
Learn the Fundamentals of Python, Data Science, and Machine Learning in 4 Weeks

2018年8月11日

Learn the Fundamentals of Python, Data Science, and Machine Learning in 4 Weeks

My name is Ted Petrou and I am offering a Data Science Bootcamp with Python from Sep 10-14 in Houston and separately on…

2 条评论
Pandas Cookbook Giveaway: 24 hours to Beat my Poker AI and Win!

2017年9月12日

Pandas Cookbook Giveaway: 24 hours to Beat my Poker AI and Win!

I am the author of the upcoming book, Pandas Cookbook, written to develop powerful and idiomatic routines using the…

4 条评论
Donating 110% of the September profit from Pandas Cookbook to Bear Creek Elementary

2017年9月1日

Donating 110% of the September profit from Pandas Cookbook to Bear Creek Elementary

Bear Creek is a small neighborhood on the very west side of Houston unfortunately positioned directly next to the…

1 条评论

See all articles

The Five-Step Process for Data Exploration in a Jupyter?Notebook

Teddy Petrou

Python Data Science Expert - Author of Multiple Books and Python Libraries

Apply to every part of the analysis

Step 1: Write and execute a single line of code to explore your data

Step 2: Verify that this line of code works by inspecting the output

Step 3: Assign the result to a variable

Step 4: Within the same cell, in a second line output the head of the DataFrame or Series

Step 5: Continue to the next cell. Do not add more lines of code to the cell

More examples

No strict requirement for one line of code

When to assign the result to a variable

When to create a new variable name

Continuously verifying results

Get the book

Teddy Petrou的更多文章

社区洞察

其他会员也浏览了

Mastering Data Visualization with Matplotlib: A Comprehensive Guide to Creating Powerful Plots and Charts

Understanding Pandas DataFrames: A Complete Guide with Real-World Examples

Seaborn

Accessing Data with loc: Label-Based Indexing in Pandas

?? Day 11: Navigating the Depths of Data Structures and Algorithms for Data Science!

Accessing Data with iloc: Position-Based Indexing in Pandas

Pandas - GroupBy Practice

Boost Your Data Analysis with These 30 Essential Pandas Tricks!

Essential Tools and Libraries for Data Science

How to Parse API Responses (XML, JSON, or Other Formats) into Tabular Format in Domo Jupyter Workspace

Apply to every part of the analysis

Step 1: Write and execute a single line of code to explore your data

Step 2: Verify that this line of code works by inspecting the output

Step 3: Assign the result to a variable

Step 4: Within the same cell, in a second line output the head of the DataFrame or Series

Step 5: Continue to the next cell. Do not add more lines of code to the cell

More examples

No strict requirement for one line of code

When to assign the result to a variable

When to create a new variable name

Continuously verifying results

Get the book

Teddy Petrou的更多文章

How to become an Expert at Pandas for Data Analysis for FREE

Daily Python and Pandas Challenges

Displaying Pandas DataFrames Horizontally in Jupyter Notebooks

Python Pandas Certification Courses

Pandas 1.0 - Summary of enhancements and recommendations

Anaconda is bloated - Set up a lean, reliable data science environment with Miniconda

Master Data Analysis with Python?—?Chapter 1: Intro to Data Analysis with pandas

Learn the Fundamentals of Python, Data Science, and Machine Learning in 4 Weeks

Pandas Cookbook Giveaway: 24 hours to Beat my Poker AI and Win!

Donating 110% of the September profit from Pandas Cookbook to Bear Creek Elementary

社区洞察

其他会员也浏览了

Mastering Data Visualization with Matplotlib: A Comprehensive Guide to Creating Powerful Plots and Charts

Understanding Pandas DataFrames: A Complete Guide with Real-World Examples

Seaborn

Accessing Data with loc: Label-Based Indexing in Pandas

?? Day 11: Navigating the Depths of Data Structures and Algorithms for Data Science!

Accessing Data with iloc: Position-Based Indexing in Pandas

Pandas - GroupBy Practice

Boost Your Data Analysis with These 30 Essential Pandas Tricks!

Essential Tools and Libraries for Data Science

How to Parse API Responses (XML, JSON, or Other Formats) into Tabular Format in Domo Jupyter Workspace