Memory Matters: How Downcasting Can Optimize Your Pandas and NumPy Workflows

Memory Matters: How Downcasting Can Optimize Your Pandas and NumPy Workflows

We all deal with different kinds of values in programming, like whole numbers, decimals, and text. In Python, these are simply known as int, float, and str. You can easily convert between them, but there’s a catch: Python is dynamically typed, which means you can’t get specific about how exactly you want to store these values—whether it’s a small integer or a large one, Python handles it all under the hood. While that’s convenient, it’s not always the most efficient, especially when you’re working with big datasets.

Why Does Python's Flexibility Come at a Cost?

In basic Python, you can’t specify exactly how much memory a number should use. For example, Python’s int type automatically grows to accommodate the size of the number you store, which is handy but also takes up more memory than necessary. Here’s a simple example:

num = 10
print(type(num))  # Output: <class 'int'>        

No matter how small or large num is, Python doesn’t give you a way to control the exact size of the integer. That’s where the inefficiency comes in, especially if you're working with a lot of numbers and don’t need all that extra space.

Enter Downcasting in Pandas & NumPy: Control Over Data Types

So, let’s talk about downcasting which is a very handy technique that helps you save memory when working with large datasets in Pandas. Think of it as putting your data on a diet—without sacrificing the important stuff!

Why Should You Downcast?

Why downcast, you ask? Well, it’s all about using the least amount of memory possible. If your integers are like tiny baby elephants (cute, right?), you don’t need to give them the entire elephant-sized space of int64. Instead, you can downcast them to int16—the more compact, lightweight version. This way, you keep things efficient, which is super helpful when you’re dealing with mountains of data!

How to Downcast in Pandas

  • Using pd.to_numeric() for Numbers: You can use the pd.to_numeric() function to downcast numeric columns. Here’s an example:

import pandas as pd

df = pd.DataFrame({'numbers': [1000, 2000, 3000]})
df['numbers'] = pd.to_numeric(df['numbers'], downcast='integer')
print(df['numbers'].dtype)  # Output: int16        

In this case, the integers are now stored as int16, reducing memory usage.

  • Using astype() for Specific Data Types: You can also directly change the data type of a column using astype():

df['numbers'] = df['numbers'].astype('int16')        

This explicitly sets the data type to int16.

  • Converting Strings to Category: If you have a column with repeated values, like genders, you can convert it to the category type:

df['genders'] = pd.Series(['male', 'female', 'female'], dtype='category')
print(df['genders'].dtype)  # Output: category        

This conversion can save memory if there are many repeated entries.

How to Downcast in NumPy

Downcasting in NumPy is also straightforward:

  • Creating Arrays with Specified Data Types: When creating a NumPy array, you can specify the data type:

import numpy as np

# Create an array with 32-bit integers
array_int32 = np.array([1, 2, 3, 4, 5], dtype=np.int32)
print(array_int32.dtype)  # Output: int32        

  • Using astype() to Change Types: If you need to change the type later, you can do so with astype():

# Create an array with 64-bit integers
array_int64 = np.array([1, 2, 3, 4, 5], dtype=np.int64)

# Downcast to 32-bit integers
array_int32_downcasted = array_int64.astype(np.int32)
print(array_int32_downcasted.dtype)  # Output: int32
        

Memory Optimization Using Downcasting

In this section, I will demonstrate how to optimize memory usage in Python by utilizing downcasting techniques in Pandas and NumPy. The goal is to reduce the memory footprint of my dataset without compromising its integrity.

Step 1: Import Libraries

First, I imported the necessary libraries:

import pandas as pd
import numpy as np        

Step 2: Import a Dataset

Initially, I worked with a small dataset for demonstration purposes. Here’s how I created it:

deaths = pd.read_csv('time_series_covid19_deaths_global.csv')        

Step 3: Check Initial Memory Usage

I checked the initial memory usage of the DataFrame:

print(deaths.memory_usage(deep=True).sum())  # 58.34 MB        

Step 4: Downcasting Numeric Columns and Converting Strings to Categorical

To optimize memory usage, I downcasted the numeric columns and converted string columns to categorical data types:

# Reduce data types for the DataFrame

deaths['Lat'] = deaths['Lat'].astype('float32')
deaths['Long'] = deaths['Long'].astype('float32')
deaths['num_deaths'] = pd.to_numeric(deaths['num_deaths'], downcast='integer')
deaths['Country/Region'] = deaths['Country/Region'].astype('category')        

Step 5: Check Memory Usage After Downcasting

After applying downcasting, I checked the memory usage again:

print(deaths.memory_usage(deep=True).sum())         

Step 6: Results Comparison

Finally, I compared the initial and optimized memory usage:

initial_memory = 58.34  # in MB

# Convert to MB
optimized_memory = deaths.memory_usage(deep=True).sum() / (1024 ** 2) 

print(f"Initial Memory Usage: {initial_memory} MB") # 58.34
print(f"Optimized Memory Usage: {optimized_memory:.2f} MB") #26.79        

The optimized memory usage after downcasting was approximately 26.79 MB, resulting in a memory footprint reduction of about 54.20%. Although the dataset is small, these memory optimization principles are essential for managing larger datasets effectively in real-world applications.

Thank you for taking the time to read my blog. I hope you found the insights helpful for your own projects. If you have any questions or experiences to share, please feel free to leave a comment below. Happy coding!


Tuba humayun

Data Scientist | Data Analyst | Excel | Python | Statistics | Ui/ Ux | Digital Marketing | Power BI

5 个月

Useful tips

回复

要查看或添加评论,请登录

Munazza Bhombal的更多文章

社区洞察

其他会员也浏览了