Boost Your Pandas?with GPUs
Awais Aslam
Practice Manager, Data Engineering & Analytics at AlphaBOLD | Data Engineering | Big Data Analytics | Cloud Data Architect
(NVIDIA's RAPIDS cuDF)
Data scientists and analysts widely use Pandas for data manipulation and analysis in Python. However, when working with large datasets, Pandas can become a bottleneck due to its single-threaded processing. NVIDIA's RAPIDS cuDF offers a solution by providing GPU-accelerated DataFrame operations that can speed up your workflows dramatically—often without any code changes.?
What is cuDF??
cuDF is a GPU DataFrame library that mirrors the Pandas API, allowing for seamless integration into existing workflows. It's a part of NVIDIA's RAPIDS suite of libraries designed to accelerate data science and analytics pipelines using GPUs.?
Key Benefits?
1. Significant Speed Improvements?
By leveraging the parallel processing power of GPUs, cuDF can perform DataFrame operations up to 150 times faster than Pandas. This acceleration is particularly noticeable with large datasets that strain CPU resources.?
2. Minimal Code Changes?
One of the standout features of cuDF is its API compatibility with Pandas. In many cases, you can switch from Pandas to cuDF by simply changing your import statement:?
# From
import pandas as pd
# To
import cudf as pd
3. Handle Larger Datasets?
GPUs have high memory bandwidth and can manage larger datasets more efficiently than CPUs. cuDF enables you to work with data sizes that might be impractical with Pandas alone.?
Getting Started with cuDF?
Installation?
To start using cuDF, you'll need an NVIDIA GPU and the appropriate drivers. You can install cuDF via Conda:?
conda install -c rapidsai -c nvidia -c conda-forge cudf
You have a large dataset containing 10 million entries of hypothetical sales data. Each entry includes:
Solution Using Pandas
First, we'll solve the problem using Pandas.
Step 1: Generate Synthetic Data
We'll create a large DataFrame to simulate the dataset.
领英推荐
import pandas as pd
import numpy as np
import time
# Number of entries
N = 10_000_000
# Generate random data
np.random.seed(0)
dates = pd.date_range('2020-01-01', periods=N//10000)
date_choices = np.random.choice(dates, N)
store_ids = np.random.randint(1, 1001, size=N)
product_ids = np.random.randint(1, 5001, size=N)
sales = np.random.uniform(1, 1000, size=N)
# Create the DataFrame
df = pd.DataFrame({
'date': date_choices,
'store_id': store_ids,
'product_id': product_ids,
'sales': sales
})
Step 2: Data Processing with Pandas
Perform the required computations and measure the time taken.
start_time = time.time()
# Filter products with sales > 500
filtered_df = df[df['sales'] > 500]
# Group by 'store_id' and sum 'sales'
grouped_df = filtered_df.groupby('store_id')['sales'].sum().reset_index()
# Sort and get top 5 stores
top_stores = grouped_df.sort_values('sales', ascending=False).head(5)
end_time = time.time()
print("Pandas execution time: {:.2f} seconds".format(end_time - start_time))
print(top_stores)
Test the output as time could vary baed on the current configuration.
Solution Using cuDF
Now, let's solve the same problem using cuDF.
Step 1: Install and Import cuDF
First, ensure that cuDF is installed and import it.
import cudf
import cupy as cp
import time
Step 2: Generate Synthetic Data on the GPU
We'll use CuPy to generate data directly on the GPU to avoid data transfer overhead.
# Generate random data using CuPy
cp.random.seed(0)
dates = pd.date_range('2020-01-01', periods=N//10000)
date_choices = cp.random.choice(cp.array(dates), N)
store_ids = cp.random.randint(1, 1001, size=N)
product_ids = cp.random.randint(1, 5001, size=N)
sales = cp.random.uniform(1, 1000, size=N)
# Create the cuDF DataFrame
gdf = cudf.DataFrame({
'date': date_choices,
'store_id': store_ids,
'product_id': product_ids,
'sales': sales
})
Step 3: Data Processing with cuDF
Perform the computations and measure the time taken.
start_time = time.time()
# Filter products with sales > 500
filtered_gdf = gdf[gdf['sales'] > 500]
# Group by 'store_id' and sum 'sales'
grouped_gdf = filtered_gdf.groupby('store_id')['sales'].sum().reset_index()
# Sort and get top 5 stores
top_stores = grouped_gdf.sort_values('sales', ascending=False).head(5)
end_time = time.time()
print("cuDF execution time: {:.2f} seconds".format(end_time - start_time))
print(top_stores)
Comparing Performance
This indicates a 40x speedup with cuDF.
Organizations dealing with big data can see immediate benefits:
Sr. Practice Manager, AIoT at AlphaBold | C++ | ML | Open source | Fullstack [JS/dotnet core/python] | Enterprise Architecture | Cloud [AWS/Azure]
5 个月Excellent