Enterprise-ready pandas for Snowflake - Doris Lee's session BUILD 2024

Enterprise-ready pandas for Snowflake - Doris Lee's session BUILD 2024

At #SnowflakeBUILD2024, Doris Lee (Senior Product Manager at Snowflake ) delivered an inspiring session "Enterprise-ready Pandas for Snowflake." The session explored how Pandas often struggles with large datasets.

Despite being used by one in every two Python developers, its reputation for inefficiency at scale is well known. In this session, Doris presented Snowpark Pandas and how it resolves these issues, including research on scaling dataframes and the introduction of the open-source project Modin (20M+ downloads). Snowpark Pandas aims to maintain Pandas' friendly and intuitive API and semantics while addressing the challenges of deploying Pandas in production.

Would you like to see the session on demand? [Click here]


What's pandas?

Pandas is one of the most popular libraries used for data science and engineering. With its intuitive API and over 600 functions, it is the go-to library for data cleaning, transformation, and analysis. However, despite its widespread adoption—one in two Python developers uses it—Pandas has earned a reputation for struggling with large datasets, leading to challenges in enterprise scenarios.


The Power and Pitfalls of Pandas

Pandas is celebrated for its flexibility, ease of use, and quick prototyping capabilities. Developers leverage its rich API—functions like df.describe(), pd.concat(), and df.groupby()—to rapidly process data. However, when datasets grow beyond memory capacity, Pandas works in a not efficient way.

Key issues with Pandas at scale include:

  • Memory Constraints: Pandas operates entirely in memory, which leads to out-of-memory (OOM) errors when working with large datasets.
  • Single-threaded Operations: Pandas executes operations on a single core, wasting the compute potential of modern multi-core systems.
  • Limited Distributed Processing: For enterprise-scale datasets, the lack of distributed computing support requires rewriting code for big data frameworks.


These limitations make it challenging to scale Pandas workflows for enterprise needs, leading to inefficient workflows and significant development overhead. Doris presented in a very clear way what can be considered enterprise needs and showed how Pandas fails in accomplish them.


The Enterprise Challenges/Needs

Many organizations follow this workflow:

  1. Prototyping: Analysts and data engineers develop Pandas scripts on their local machines.
  2. Testing: These scripts are rewritten to work with big data tools like Spark for scalability.
  3. Production: The rewritten code is deployed to large clusters for production workloads.

Each step often involves rewrites to adapt code for different tools, slowing development cycles and introducing inefficiencies. The iterative feedback loops between these phases can result in wasted effort and increased cost.


Scaling Pandas with Modin and Snowpark

To address the challenges of scaling Pandas, Doris Lee and an incredible team of developers and experts worked for year on the open-source project Modin, which is undeniably an innovative solution in many ways.

Modin is designed to scale Pandas seamlessly across multiple cores and distributed systems, eliminating the traditional limitations of single-threaded operations. With over 26 million downloads and contributions from 100+ developers, Modin has become a key player in enabling efficient large-scale data processing while maintaining the familiar Pandas API.

Modin achieves scalability by leveraging distributed backends such as Ray, Dask, and even Snowpark, allowing Pandas operations to run on distributed infrastructure without requiring code rewrites. For example, analysts can process terabytes of data efficiently while continuing to use Pandas’ intuitive API, enabling a smooth transition from prototype to production.

By integrating Snowpark as a backend, Modin unlocks the power of Snowflake’s cloud-native processing engine. Snowpark translates Pandas operations into distributed SQL queries executed directly within Snowflake’s infrastructure, preserving data security and eliminating the need for data movement. This partnership ensures Modin and Snowpark users can scale their workloads to enterprise-level datasets effortlessly.

Why Modin?

  • Scalability: Modin distributes Pandas operations across multiple cores and backends, reducing processing times and maximizing resource utilization.
  • Seamless Integration: Modin integrates with Snowpark, leveraging Snowflake’s secure and scalable infrastructure to handle massive workloads.
  • Familiar API: Users can continue working with the Pandas API, ensuring minimal disruption to existing workflows.

How to install it?

$ pip install "snowflake-snowpark-python[modin]"

import modin.pandas as pd
import snowflake.snowpark.modin.plugin        

That's it! Modin is a scalable "drop in" replacement for pandas. DROP IN = all you need is to change one single line of code and then work with large datasets.


Performance at Scale: Pandas on Snowflake

Doris shows how the combination of Snowpark Pandas and Snowflake’s cloud data platform can deliver unmatched performances:

  • 30x Faster Performance: Benchmarks demonstrate up to 30x faster execution times compared to standard Pandas, especially on datasets exceeding 10GB.
  • Effortless Scaling to Terabytes: Snowpark Pandas processes terabytes of data, allowing enterprises to handle even the largest workloads seamlessly.

How it works?

Snowpark Pandas translates Pandas operations into SQL using Snowflake’s processing engine. The process is transparent, preserving the “look-and-feel” of Pandas while leveraging the scalability of Snowflake:

  • Query Translator: Converts Pandas operations into incremental SQL queries.
  • Distributed Processing: Executes the queries in Snowflake’s cloud-native, distributed infrastructure.

This architecture ensures that users retain familiar Pandas semantics while benefiting from enterprise-grade scalability.

During the session, Doris presented also a demo about a real-world application. She focuses on reading financial data from Cybersyn Fincance and Economics dataset. Already from the demo, you can clearly see the advantages of working with Snowpark Pandas and Modin over traditional Pandas.


Conclusion: The Future of Pandas in Enterprises

Snowpark Pandas and Modin redefine what is possible with Pandas in enterprise environments. Their combination addresses memory limitations by distributing workloads across Snowflake's scalable infrastructure, eliminating out-of-memory errors. There is zero data movement, everything is kept secure within Snowflake’s environment. Developers don't have to rewrite code and it accelerates development, enabling teams to move from prototype to production faster.

If you've been hesitant to use Pandas for enterprise-scale workloads, it's time to think again. With Snowpark Pandas and Modin, you can scale your Pandas workflows seamlessly from prototype to production, without sacrificing simplicity or performance. In addition, Doris is the best guide you could ask for in this discovery. If you are curious to learn more about her work, use these links:


要查看或添加评论,请登录

Sofia Pierini的更多文章

社区洞察

其他会员也浏览了