Memory Matters: How Downcasting Can Optimize Your Pandas and NumPy Workflows
Munazza Bhombal
Data Analyst | Expertise in Python, SQL & Statistics | Pursuing a Post-Graduate degree in ML & AI from IIIT-B
We all deal with different kinds of values in programming, like whole numbers, decimals, and text. In Python, these are simply known as int, float, and str. You can easily convert between them, but there’s a catch: Python is dynamically typed, which means you can’t get specific about how exactly you want to store these values—whether it’s a small integer or a large one, Python handles it all under the hood. While that’s convenient, it’s not always the most efficient, especially when you’re working with big datasets.
Why Does Python's Flexibility Come at a Cost?
In basic Python, you can’t specify exactly how much memory a number should use. For example, Python’s int type automatically grows to accommodate the size of the number you store, which is handy but also takes up more memory than necessary. Here’s a simple example:
num = 10
print(type(num)) # Output: <class 'int'>
No matter how small or large num is, Python doesn’t give you a way to control the exact size of the integer. That’s where the inefficiency comes in, especially if you're working with a lot of numbers and don’t need all that extra space.
Enter Downcasting in Pandas & NumPy: Control Over Data Types
So, let’s talk about downcasting which is a very handy technique that helps you save memory when working with large datasets in Pandas. Think of it as putting your data on a diet—without sacrificing the important stuff!
Why Should You Downcast?
Why downcast, you ask? Well, it’s all about using the least amount of memory possible. If your integers are like tiny baby elephants (cute, right?), you don’t need to give them the entire elephant-sized space of int64. Instead, you can downcast them to int16—the more compact, lightweight version. This way, you keep things efficient, which is super helpful when you’re dealing with mountains of data!
How to Downcast in Pandas
import pandas as pd
df = pd.DataFrame({'numbers': [1000, 2000, 3000]})
df['numbers'] = pd.to_numeric(df['numbers'], downcast='integer')
print(df['numbers'].dtype) # Output: int16
In this case, the integers are now stored as int16, reducing memory usage.
df['numbers'] = df['numbers'].astype('int16')
This explicitly sets the data type to int16.
df['genders'] = pd.Series(['male', 'female', 'female'], dtype='category')
print(df['genders'].dtype) # Output: category
This conversion can save memory if there are many repeated entries.
How to Downcast in NumPy
Downcasting in NumPy is also straightforward:
import numpy as np
# Create an array with 32-bit integers
array_int32 = np.array([1, 2, 3, 4, 5], dtype=np.int32)
print(array_int32.dtype) # Output: int32
领英推荐
# Create an array with 64-bit integers
array_int64 = np.array([1, 2, 3, 4, 5], dtype=np.int64)
# Downcast to 32-bit integers
array_int32_downcasted = array_int64.astype(np.int32)
print(array_int32_downcasted.dtype) # Output: int32
Memory Optimization Using Downcasting
In this section, I will demonstrate how to optimize memory usage in Python by utilizing downcasting techniques in Pandas and NumPy. The goal is to reduce the memory footprint of my dataset without compromising its integrity.
Step 1: Import Libraries
First, I imported the necessary libraries:
import pandas as pd
import numpy as np
Step 2: Import a Dataset
Initially, I worked with a small dataset for demonstration purposes. Here’s how I created it:
deaths = pd.read_csv('time_series_covid19_deaths_global.csv')
Step 3: Check Initial Memory Usage
I checked the initial memory usage of the DataFrame:
print(deaths.memory_usage(deep=True).sum()) # 58.34 MB
Step 4: Downcasting Numeric Columns and Converting Strings to Categorical
To optimize memory usage, I downcasted the numeric columns and converted string columns to categorical data types:
# Reduce data types for the DataFrame
deaths['Lat'] = deaths['Lat'].astype('float32')
deaths['Long'] = deaths['Long'].astype('float32')
deaths['num_deaths'] = pd.to_numeric(deaths['num_deaths'], downcast='integer')
deaths['Country/Region'] = deaths['Country/Region'].astype('category')
Step 5: Check Memory Usage After Downcasting
After applying downcasting, I checked the memory usage again:
print(deaths.memory_usage(deep=True).sum())
Step 6: Results Comparison
Finally, I compared the initial and optimized memory usage:
initial_memory = 58.34 # in MB
# Convert to MB
optimized_memory = deaths.memory_usage(deep=True).sum() / (1024 ** 2)
print(f"Initial Memory Usage: {initial_memory} MB") # 58.34
print(f"Optimized Memory Usage: {optimized_memory:.2f} MB") #26.79
The optimized memory usage after downcasting was approximately 26.79 MB, resulting in a memory footprint reduction of about 54.20%. Although the dataset is small, these memory optimization principles are essential for managing larger datasets effectively in real-world applications.
Thank you for taking the time to read my blog. I hope you found the insights helpful for your own projects. If you have any questions or experiences to share, please feel free to leave a comment below. Happy coding!
Data Scientist | Data Analyst | Excel | Python | Statistics | Ui/ Ux | Digital Marketing | Power BI
5 个月Useful tips