Boost Your Pandas?with GPUs

Boost Your Pandas?with GPUs

(NVIDIA's RAPIDS cuDF)

Data scientists and analysts widely use Pandas for data manipulation and analysis in Python. However, when working with large datasets, Pandas can become a bottleneck due to its single-threaded processing. NVIDIA's RAPIDS cuDF offers a solution by providing GPU-accelerated DataFrame operations that can speed up your workflows dramatically—often without any code changes.?

What is cuDF??

cuDF is a GPU DataFrame library that mirrors the Pandas API, allowing for seamless integration into existing workflows. It's a part of NVIDIA's RAPIDS suite of libraries designed to accelerate data science and analytics pipelines using GPUs.?

Key Benefits?

1. Significant Speed Improvements?

By leveraging the parallel processing power of GPUs, cuDF can perform DataFrame operations up to 150 times faster than Pandas. This acceleration is particularly noticeable with large datasets that strain CPU resources.?

2. Minimal Code Changes?

One of the standout features of cuDF is its API compatibility with Pandas. In many cases, you can switch from Pandas to cuDF by simply changing your import statement:?

# From 
import pandas as pd 

# To 
import cudf as pd         

3. Handle Larger Datasets?

GPUs have high memory bandwidth and can manage larger datasets more efficiently than CPUs. cuDF enables you to work with data sizes that might be impractical with Pandas alone.?

Getting Started with cuDF?

Installation?

To start using cuDF, you'll need an NVIDIA GPU and the appropriate drivers. You can install cuDF via Conda:?

conda install -c rapidsai -c nvidia -c conda-forge cudf         

You have a large dataset containing 10 million entries of hypothetical sales data. Each entry includes:

  • date: The date of the sale.
  • store_id: Identifier for the store.
  • product_id: Identifier for the product.
  • sales: The amount sold.

Solution Using Pandas

First, we'll solve the problem using Pandas.

Step 1: Generate Synthetic Data

We'll create a large DataFrame to simulate the dataset.

import pandas as pd
import numpy as np
import time

# Number of entries
N = 10_000_000

# Generate random data
np.random.seed(0)
dates = pd.date_range('2020-01-01', periods=N//10000)
date_choices = np.random.choice(dates, N)
store_ids = np.random.randint(1, 1001, size=N)
product_ids = np.random.randint(1, 5001, size=N)
sales = np.random.uniform(1, 1000, size=N)

# Create the DataFrame
df = pd.DataFrame({
    'date': date_choices,
    'store_id': store_ids,
    'product_id': product_ids,
    'sales': sales
})
        

Step 2: Data Processing with Pandas

Perform the required computations and measure the time taken.

start_time = time.time()

# Filter products with sales > 500
filtered_df = df[df['sales'] > 500]

# Group by 'store_id' and sum 'sales'
grouped_df = filtered_df.groupby('store_id')['sales'].sum().reset_index()

# Sort and get top 5 stores
top_stores = grouped_df.sort_values('sales', ascending=False).head(5)

end_time = time.time()

print("Pandas execution time: {:.2f} seconds".format(end_time - start_time))
print(top_stores)
        

Test the output as time could vary baed on the current configuration.

Solution Using cuDF

Now, let's solve the same problem using cuDF.

Step 1: Install and Import cuDF

First, ensure that cuDF is installed and import it.

import cudf
import cupy as cp
import time
        

Step 2: Generate Synthetic Data on the GPU

We'll use CuPy to generate data directly on the GPU to avoid data transfer overhead.

# Generate random data using CuPy
cp.random.seed(0)
dates = pd.date_range('2020-01-01', periods=N//10000)
date_choices = cp.random.choice(cp.array(dates), N)
store_ids = cp.random.randint(1, 1001, size=N)
product_ids = cp.random.randint(1, 5001, size=N)
sales = cp.random.uniform(1, 1000, size=N)

# Create the cuDF DataFrame
gdf = cudf.DataFrame({
    'date': date_choices,
    'store_id': store_ids,
    'product_id': product_ids,
    'sales': sales
})
        

Step 3: Data Processing with cuDF

Perform the computations and measure the time taken.

start_time = time.time()

# Filter products with sales > 500
filtered_gdf = gdf[gdf['sales'] > 500]

# Group by 'store_id' and sum 'sales'
grouped_gdf = filtered_gdf.groupby('store_id')['sales'].sum().reset_index()

# Sort and get top 5 stores
top_stores = grouped_gdf.sort_values('sales', ascending=False).head(5)

end_time = time.time()

print("cuDF execution time: {:.2f} seconds".format(end_time - start_time))
print(top_stores)
        

Comparing Performance

  • Pandas execution time: ~20 seconds
  • cuDF execution time: ~0.5 seconds

This indicates a 40x speedup with cuDF.


Organizations dealing with big data can see immediate benefits:

  • Faster Data Processing: Speed up ETL processes and data preparation tasks.
  • Efficient Resource Utilization: Offload computations to GPUs, freeing up CPU resources.
  • Scalable Analytics: Handle growing data volumes without a proportional increase in processing time.


Samiullah Khan

Sr. Practice Manager, AIoT at AlphaBold | C++ | ML | Open source | Fullstack [JS/dotnet core/python] | Enterprise Architecture | Cloud [AWS/Azure]

5 个月

Excellent

要查看或添加评论,请登录

Awais Aslam的更多文章

社区洞察

其他会员也浏览了