登录查看更多内容

Memory Matters: How Downcasting Can Optimize Your Pandas and NumPy Workflows

Munazza Bhombal

Data Analyst | Expertise in Python, SQL & Statistics | Pursuing a Post-Graduate degree in ML & AI from IIIT-B

发布日期: 2024年10月6日

We all deal with different kinds of values in programming, like whole numbers, decimals, and text. In Python, these are simply known as int, float, and str. You can easily convert between them, but there’s a catch: Python is dynamically typed, which means you can’t get specific about how exactly you want to store these values—whether it’s a small integer or a large one, Python handles it all under the hood. While that’s convenient, it’s not always the most efficient, especially when you’re working with big datasets.

Why Does Python's Flexibility Come at a Cost?

In basic Python, you can’t specify exactly how much memory a number should use. For example, Python’s int type automatically grows to accommodate the size of the number you store, which is handy but also takes up more memory than necessary. Here’s a simple example:

num = 10
print(type(num))  # Output: <class 'int'>

No matter how small or large num is, Python doesn’t give you a way to control the exact size of the integer. That’s where the inefficiency comes in, especially if you're working with a lot of numbers and don’t need all that extra space.

Enter Downcasting in Pandas & NumPy: Control Over Data Types

So, let’s talk about downcasting which is a very handy technique that helps you save memory when working with large datasets in Pandas. Think of it as putting your data on a diet—without sacrificing the important stuff!

Why Should You Downcast?

Why downcast, you ask? Well, it’s all about using the least amount of memory possible. If your integers are like tiny baby elephants (cute, right?), you don’t need to give them the entire elephant-sized space of int64. Instead, you can downcast them to int16—the more compact, lightweight version. This way, you keep things efficient, which is super helpful when you’re dealing with mountains of data!

How to Downcast in Pandas

Using pd.to_numeric() for Numbers: You can use the pd.to_numeric() function to downcast numeric columns. Here’s an example:

import pandas as pd

df = pd.DataFrame({'numbers': [1000, 2000, 3000]})
df['numbers'] = pd.to_numeric(df['numbers'], downcast='integer')
print(df['numbers'].dtype)  # Output: int16

In this case, the integers are now stored as int16, reducing memory usage.

Using astype() for Specific Data Types: You can also directly change the data type of a column using astype():

df['numbers'] = df['numbers'].astype('int16')

This explicitly sets the data type to int16.

Converting Strings to Category: If you have a column with repeated values, like genders, you can convert it to the category type:

df['genders'] = pd.Series(['male', 'female', 'female'], dtype='category')
print(df['genders'].dtype)  # Output: category

This conversion can save memory if there are many repeated entries.

How to Downcast in NumPy

Downcasting in NumPy is also straightforward:

Creating Arrays with Specified Data Types: When creating a NumPy array, you can specify the data type:

import numpy as np

# Create an array with 32-bit integers
array_int32 = np.array([1, 2, 3, 4, 5], dtype=np.int32)
print(array_int32.dtype)  # Output: int32

Using astype() to Change Types: If you need to change the type later, you can do so with astype():

领英推荐

Releasing Snakes into the Wild

Helen Wall 6 个月前

Python Primer: Newsletter Editions Roundup

Manish V. 1 年前

The lambda() and more

Can Arslan 2 年前

# Create an array with 64-bit integers
array_int64 = np.array([1, 2, 3, 4, 5], dtype=np.int64)

# Downcast to 32-bit integers
array_int32_downcasted = array_int64.astype(np.int32)
print(array_int32_downcasted.dtype)  # Output: int32

Memory Optimization Using Downcasting

In this section, I will demonstrate how to optimize memory usage in Python by utilizing downcasting techniques in Pandas and NumPy. The goal is to reduce the memory footprint of my dataset without compromising its integrity.

Step 1: Import Libraries

First, I imported the necessary libraries:

import pandas as pd
import numpy as np

Step 2: Import a Dataset

Initially, I worked with a small dataset for demonstration purposes. Here’s how I created it:

deaths = pd.read_csv('time_series_covid19_deaths_global.csv')

Step 3: Check Initial Memory Usage

I checked the initial memory usage of the DataFrame:

print(deaths.memory_usage(deep=True).sum())  # 58.34 MB

Step 4: Downcasting Numeric Columns and Converting Strings to Categorical

To optimize memory usage, I downcasted the numeric columns and converted string columns to categorical data types:

# Reduce data types for the DataFrame

deaths['Lat'] = deaths['Lat'].astype('float32')
deaths['Long'] = deaths['Long'].astype('float32')
deaths['num_deaths'] = pd.to_numeric(deaths['num_deaths'], downcast='integer')
deaths['Country/Region'] = deaths['Country/Region'].astype('category')

Step 5: Check Memory Usage After Downcasting

After applying downcasting, I checked the memory usage again:

print(deaths.memory_usage(deep=True).sum())

Step 6: Results Comparison

Finally, I compared the initial and optimized memory usage:

initial_memory = 58.34  # in MB

# Convert to MB
optimized_memory = deaths.memory_usage(deep=True).sum() / (1024 ** 2) 

print(f"Initial Memory Usage: {initial_memory} MB") # 58.34
print(f"Optimized Memory Usage: {optimized_memory:.2f} MB") #26.79

The optimized memory usage after downcasting was approximately 26.79 MB, resulting in a memory footprint reduction of about 54.20%. Although the dataset is small, these memory optimization principles are essential for managing larger datasets effectively in real-world applications.

Thank you for taking the time to read my blog. I hope you found the insights helpful for your own projects. If you have any questions or experiences to share, please feel free to leave a comment below. Happy coding!

Tuba humayun

5 个月

Useful tips

查看更多评论

要查看或添加评论，请登录

Munazza Bhombal的更多文章

Python Object-Oriented Programming (OOP) Demystified: A Beginner’s Guide

2024年3月9日

Python Object-Oriented Programming (OOP) Demystified: A Beginner’s Guide

In this article, we'll explore the fundamentals of Object-Oriented Programming (OOP) in Python. We'll cover essential…

1 条评论

Memory Matters: How Downcasting Can Optimize Your Pandas and NumPy Workflows

Munazza Bhombal

Data Analyst | Expertise in Python, SQL & Statistics | Pursuing a Post-Graduate degree in ML & AI from IIIT-B

Why Does Python's Flexibility Come at a Cost?

Enter Downcasting in Pandas & NumPy: Control Over Data Types

Why Should You Downcast?

How to Downcast in Pandas

How to Downcast in NumPy

领英推荐

Memory Optimization Using Downcasting

Step 1: Import Libraries

Step 2: Import a Dataset

Step 3: Check Initial Memory Usage

Step 4: Downcasting Numeric Columns and Converting Strings to Categorical

Step 5: Check Memory Usage After Downcasting

Step 6: Results Comparison

Munazza Bhombal的更多文章

社区洞察

其他会员也浏览了

Demystify Python 2D Charts -- A Hackable Step-by-step Jupyter Notebook

Getting Started with NumPy

Python for Finance in Excel — Moving Averages Chart

How To Index Bulk URL Using Python & Google Indexing API…?

Exploring NumPy: Your Gateway to Easy Number Crunching in Python

Working with Python for Beginners

NumPy Supremacy

Seaborn Tutorial in Python 3.6+

Using Google Sheets to create live updating charts in Python

Datacamp - Data Scientist with Python Track

Why Does Python's Flexibility Come at a Cost?

Enter Downcasting in Pandas & NumPy: Control Over Data Types

Why Should You Downcast?

How to Downcast in Pandas

How to Downcast in NumPy

领英推荐

Memory Optimization Using Downcasting

Step 1: Import Libraries

Step 2: Import a Dataset

Step 3: Check Initial Memory Usage

Step 4: Downcasting Numeric Columns and Converting Strings to Categorical

Step 5: Check Memory Usage After Downcasting

Step 6: Results Comparison

Munazza Bhombal的更多文章

Python Object-Oriented Programming (OOP) Demystified: A Beginner’s Guide

社区洞察

其他会员也浏览了

Demystify Python 2D Charts -- A Hackable Step-by-step Jupyter Notebook

Getting Started with NumPy

Python for Finance in Excel — Moving Averages Chart

How To Index Bulk URL Using Python & Google Indexing API…?

Exploring NumPy: Your Gateway to Easy Number Crunching in Python

Working with Python for Beginners

NumPy Supremacy

Seaborn Tutorial in Python 3.6+

Using Google Sheets to create live updating charts in Python

Datacamp - Data Scientist with Python Track