登录查看更多内容

Accelerating Data Processing: RAPIDS & Spark

Sivan Sasidharan

Driving Business Transformation | Data | Cloud | AI/ML | Gen AI

发布日期: 2024年3月27日

RAPIDS Accelerator for Apache Spark

The RAPIDS Accelerator, a product of the NVIDIA-Databricks partnership, integrates GPU acceleration into Apache Spark, enhancing data processing capabilities. This technology enables users to leverage the power of NVIDIA GPUs seamlessly within the Spark framework, leading to significant performance improvements.

GPU Acceleration for Spark 3.0

RAPIDS is an open-source suite of software libraries and APIs built on CUDA that allows for GPU acceleration of data science and machine learning workflows. It provides a set of GPU-accelerated libraries for data processing, machine learning, and graph analytics. The RAPIDS Accelerator for Apache Spark is designed to future-proof Spark 3.0, allowing users to run Spark code on GPUs without modifications. This integration results in faster machine learning model training, improved Spark SQL and DataFrame operations, and substantial cost savings compared to CPU-based processing.

The RAPIDS Accelerator for Apache Spark combines the power of the RAPIDS cuDF library with the scalability of Spark's distributed computing framework, enabling rapid and cost-efficient data processing with GPUs. This integration allows users to run Spark batch processing on GPUs without the need for code changes, resulting in faster data processing at a lower cost compared to CPU-based approaches. Additionally, the RAPIDS Accelerator library includes an accelerated shuffle mechanism based on UCX, enabling GPU-to-GPU communication and remote direct memory access capabilities to enhance performance and efficiency within the Spark environment. Overall, the architecture of RAPIDS with Spark optimizes data analytics workflows by harnessing GPU acceleration to process data quickly and cost-effectively, addressing the increasing demands for efficient data processing in the era of big data and AI applications.https://github.com/NVIDIA/spark-rapids

Configuring RAPIDS

To configure RAPIDS on Databricks with Spark, follow these steps based on the provided sources:

Initialization Script Creation: Create an initialization script in your Databricks workspace to install the RAPIDS jars. Use the following script:#!/bin/bash sudo wget -O /databricks/jars/rapids-4-spark_2.12-<SPARK_RAPIDS_VERSION>.jar
Cluster Creation: Go to "Compute" and click on "+ Create compute. "Configure the cluster by selecting the Databricks Runtime Version, number of workers matching the GPUs you want to use, worker type, and driver type. In the "Advanced Options" section, navigate to the "Init Scripts" tab and paste the workspace path to the initialization script: /Users/user@domain/init.sh.
Spark Configuration: In the "Spark" tab, paste the necessary config options into the Spark Config section, adjusting the values based on the workers you choose.

Testing RAPIDS

Once your cluster is running, create a new notebook or open an existing one from the Workspace directory and attach it to your running cluster.
Import the cudf library and test its functionality with a simple example to ensure that RAPIDS libraries are successfully installed and working on Databricks

Additional Considerations: Databricks may make changes to existing runtimes without notification, so regular testing is recommended. Ensure the correct behavior of window frames defined by a range in case of DecimalTypes with precision greater than 38. Be aware of the limitations related to the number of GPUs per node and the configuration overrides by Databricks.

领英推荐

Gift-Wrapping Your Data: Unwrap the Benefits of…

Katonic AI 2 年前

NuNets Computational Model

NuNet 7 个月前

Analytics and Data Science News for the Week of March…

Data Analytics and Business Intelligence Solutions Review 11 个月前

Existing ML Libraries

The RAPIDS Accelerator for Apache Spark can be used to accelerate the ETL portions (e.g., loading training data from parquet files) of applications using ML libraries with Spark DataFrame APIs. Examples of such libraries include the original Apache Spark MLlib, XGBoost, RAPIDS Accelerator ML, and the DL inference UDF function introduced in Spark 3.4. The latter three also enable leveraging GPUs (in the case of the DL inference UDF, indirectly via the underlying DL framework) to accelerate the core ML algorithms, and thus, in conjunction with the RAPIDS Accelerator for Apache Spark for ETL, can further enhance the cost-benefit of GPU accelerated Spark clusters.

Key Benefits

The benefits of using RAPIDS with Spark over traditional Spark are significant and include:

1. Accelerated Data Processing: RAPIDS leverages GPUs to accelerate data processing, enabling faster and more cost-efficient processing compared to traditional CPU-based approaches.

2. Cost Savings: Running Spark batch processing on GPUs with RAPIDS can lead to lower costs and more power savings, making it a cost-effective solution for organizations dealing with massive data jobs.

3. Efficiency and Performance: The integration of RAPIDS with Spark allows for faster execution of data jobs, as demonstrated by the comparison where GPU nodes completed 100 queries in just 31 minutes compared to 176 minutes taken by CPU nodes, showcasing the efficiency and performance gains.

4. Power Efficiency: GPU-accelerated processing with RAPIDS is not only faster but also more power-efficient, making it a sustainable solution for organizations looking to optimize their data analytics workflows.

5. Ease of Implementation: RAPIDS seamlessly integrates with Spark, allowing users to continue working with familiar APIs like SQL, Python, R, Java, and Scala while benefiting from GPU acceleration for enhanced performance and efficiency.

Rapids Accelerator for Apache Spark reaps the benefit of GPU performance while saving infrastructure costs. Perf-cost ETL for FannieMae Mortgage Dataset (~200GB). Costs based on Cloud T4 GPU instance market price.

Demo Video

How does Business Benefit

The partnership between NVIDIA and Databricks benefits businesses by optimizing data processing and AI workloads, enhancing speed, efficiency, and insights. By integrating NVIDIA accelerated computing into Databricks' platforms like Photon and Databricks SQL, businesses experience improved performance in data warehousing and analytics tasks. This collaboration empowers organizations to leverage NVIDIA GPUs seamlessly for machine learning and deep learning initiatives, ensuring quick setup and consistent environments across projects. Additionally, the support for NVIDIA Tensor Core GPUs on major cloud platforms enables high-performance computing for both single-node and distributed operations, enhancing the quality and agility of AI solutions.

#NVIDIA #Databricks #Spark

AI Nexus

1,203 位关注者

要查看或添加评论，请登录

Sivan Sasidharan的更多文章

Securing and Enhancing the Trustworthiness of Generative AI Applications

2024年4月8日

Securing and Enhancing the Trustworthiness of Generative AI Applications

As artificial intelligence (AI) continues to permeate various aspects of our lives, ensuring its quality and safety has…

Accelerating Data Processing: RAPIDS & Spark

Sivan Sasidharan

Driving Business Transformation | Data | Cloud | AI/ML | Gen AI

领英推荐

Existing ML Libraries

AI Nexus

1,203 位关注者

Sivan Sasidharan的更多文章

社区洞察

其他会员也浏览了

openEuler SIGs News & Updates - By September 30, 2024

The Actor Model in NuNet

Unlocking the Need for Speed: The Secrets Behind Kafka's Blazing Performance

KubeCon + CloudNativeCon Europe 2024 list of sessions with YouTube links

Hardware Accelerated Databases

Crossing the AI Chasm: Building a Strong Data (and Ecosystem) Foundation

Power10 is taking AI and Security to a new level

Cray Shasta Architecture AI Performance Computing

Training Data container in NVIDIA GPU Cloud

The Hitchhiker’s Guide to an Optimized IBM MAS 9 First Edition

领英推荐

Existing ML Libraries

AI Nexus

1,203 位关注者

Sivan Sasidharan的更多文章

Securing and Enhancing the Trustworthiness of Generative AI Applications

社区洞察

其他会员也浏览了

openEuler SIGs News & Updates - By September 30, 2024

The Actor Model in NuNet

Unlocking the Need for Speed: The Secrets Behind Kafka's Blazing Performance

KubeCon + CloudNativeCon Europe 2024 list of sessions with YouTube links

Hardware Accelerated Databases

Crossing the AI Chasm: Building a Strong Data (and Ecosystem) Foundation

Power10 is taking AI and Security to a new level

Cray Shasta Architecture AI Performance Computing

Training Data container in NVIDIA GPU Cloud

The Hitchhiker’s Guide to an Optimized IBM MAS 9 First Edition