Data Loading with Pandas: Understanding the Intricacies of the read_csv Function
Introduction
One of the most popular formats for structured data is CSV (Comma-Separated Values). CSV files are plain text files with a simple structure: values are separated by commas (or other delimiters like tabs or semicolons), and each row represents a record. This simplicity makes CSV files easy to understand and work with, even for non-technical users.
Pandas is one of the most widely used libraries for working with structured data. Due to the popularity of CSV files, one of the most frequently used Pandas functions is read_csv().
Most people use this function simply to load CSV files and stop there. However, few are aware of the many parameters it offers. The read_csv() function is a powerful tool capable of handling the complexities of real-world data with precision.
In this article, I’ll explore some of its lesser-known parameters. We will explore: usecols, parse_dates, dtype, skiprows, nrows, and na_values parameters. I will show how you can use them effectively in your analysis.
1. The usecols parameter
By default, the read_csv() function loads an entire file. This is fine for small files, but when dealing with large datasets, it can negatively impact performance, especially in memory-constrained environments. In such cases, it’s more efficient to load only the necessary columns. The read_csv() function provides the usecols parameter, which allows you to specify which columns to load. Here is an example:
In this code, instead of loading the entire dataset, we use the usecols parameter to specify that we only want the 'Product Type' and 'Payment Method' columns. Loading only the necessary columns improves performance, especially when working with large files.
Build the Confidence to Tackle Data Analysis Projects [40% OFF]
To build a successful data analysis project, one must have skills in data cleaning and preprocessing, visualization, modeling, EDA, and so forth. The main purpose of this book is to ensure that you develop data analysis skills with Python by tackling challenges. By the end of 50 days, you should be confident enough to take on any data analysis project with Python Take advantage of the March discount by clicking: 50-day challenge now.
Other Resources
Want to learn Python fundamentals the easy way? Check out Master Python Fundamentals: The Ultimate Python Course for Beginners
Challenge yourself with Python challenges. Check out 50 Days of Python: A Challenge a Day.
100 Python Tips and Tricks, Python Tips and Tricks: A Collection of 100 Basic & Intermediate Tips & Tricks.
2. The parse_dates parameter
When performing time-series analysis, it’s crucial to ensure that date columns are parsed as datetime objects. Without this, date-related operations (such as calculating the difference between specific dates) may not produce accurate results.
By default, if we load the data from the "Electronics_Sales.csv" file without specifying any parameters, the "Purchase Date" column will be treated as an object rather than a datetime type, as shown below (highlighted below).
To convert the "Purchase Date" column into a datetime object, we can pass it to the parse_dates parameter, as shown below:
You can see now that the "Purchase Date" column has now been converted to the datetime64[ns] data type. With this format, you can carry out timeseries analysis on this column.
3. The dtype Parameter
By default, when loading data using the read_csv() function, Pandas infers data types from the file. While this is convenient, it can lead to several issues. Automatic type inference adds extra processing time, especially for large datasets, and may result in inefficient memory usage. For instance, Pandas might assign float64 instead of float32, consuming more memory than necessary.
To avoid these issues, it's often better to manually specify data types using the dtype parameter in read_csv(). This ensures accuracy, improves performance, and optimizes memory usage.
In our loaded dataset, the inferred data type for the "Total Price" and "Unit Price" columns is float64. We can improve efficiency by explicitly setting them to float32 using the dtype parameter, as shown below:
You can see in the output that the data type of the two columns has changed to float32.
4. The skiprows and nrows parameters
By default, the read_csv() function loads data from the first row to the last row of the dataset. While this is convenient, it may not always be efficient. For instance, the first few rows might contain metadata that isn’t relevant to your analysis. Instead of loading unnecessary rows, it’s better to skip them.
Similarly, if you're running a test or working with large datasets, you may not need to load the entire file. In such cases, limiting the number of rows can improve performance.
The skiprows parameter allows you to skip a specified number of rows, while the nrows parameter let's you define how many rows to load. See the example below:
Here, we have skipped the first two rows and limited the number of rows to five using the nrows parameter. Since we didn’t explicitly define the header, read_csv() automatically treats the first remaining row as the header.
5. The na_values parameter
By default, Pandas recognizes NaN, NA, and empty strings as missing, but you can extend this list. Using the na_values parameter, you can customize how missing values are interpreted. Here is an example:
Here we are using the usecols parameter to select specific columns from the dataset. The na_values=['missing', '0.00'] specifies which values will be treated as NaN when the data is loaded. Any occurrence of the string "missing" will be replaced with NaN, and any occurrence of "0.00" will also be replaced with NaN. Because we are using nrows=5, only five rows are loaded.
Wrap-Up
Just like that, we’ve explored five powerful parameters you can use with the popular read_csv() function. These are just a few of the many options available, each designed to help you fine-tune data loading, improve efficiency, and handle real-world datasets with ease.
By leveraging these parameters, you can optimize performance, avoid common pitfalls, and ensure cleaner data for your analysis.
Take the time to explore these options and see how they can enhance your workflow. Thanks for reading!
Newsletter Sponsorship
You can reach a highly engaged audience of over 345,000 tech-savvy subscribers and grow your brand with a newsletter sponsorship. Contact me at benjaminbennettalexander@gmail.com today to learn more about the sponsorship opportunities.
OK Bo?tjan Dolin?ek
Certified BI Data Analytics Architect - BI, Big Data Analytics, Viz, DWH - MicroStrategy, Tableau, Power BI, QlikView
1 天å‰Very well explained Ben, it has open my mind on how to handle null and empty values. Please keep on sending your news letters. Appreciate it very much.
Data Science Leader | ClearScape Analyst | Principal Data Engineer at Teradata | Certified Vantage Cloud Architect | Empowering Businesses with Data
6 天å‰Very nice information, thank you .
Student at Southern Cross University
1 周brilliant stuff thanks so much
Bsc Information Technology| Process Instrumentation| Data Analyst| MEIZ
1 周If my date is stored in a "dd/mm/yyyy" object type format I tried "parse_date" but it didn't work so I had to split date column in 3 columns using delimiter "/" then converted the columns into "int64" then created a new date column using them with "to_datetime()". Is it a good idea?