Enterprise-ready pandas for Snowflake - Doris Lee's session BUILD 2024
Sofia Pierini
Senior Data Engineer @Ey |?? Snowflake Data Superhero | ?? Founder & Chapter Leader Italy User Group
At #SnowflakeBUILD2024, Doris Lee (Senior Product Manager at Snowflake ) delivered an inspiring session "Enterprise-ready Pandas for Snowflake." The session explored how Pandas often struggles with large datasets.
Despite being used by one in every two Python developers, its reputation for inefficiency at scale is well known. In this session, Doris presented Snowpark Pandas and how it resolves these issues, including research on scaling dataframes and the introduction of the open-source project Modin (20M+ downloads). Snowpark Pandas aims to maintain Pandas' friendly and intuitive API and semantics while addressing the challenges of deploying Pandas in production.
Would you like to see the session on demand? [Click here]
What's pandas?
Pandas is one of the most popular libraries used for data science and engineering. With its intuitive API and over 600 functions, it is the go-to library for data cleaning, transformation, and analysis. However, despite its widespread adoption—one in two Python developers uses it—Pandas has earned a reputation for struggling with large datasets, leading to challenges in enterprise scenarios.
The Power and Pitfalls of Pandas
Pandas is celebrated for its flexibility, ease of use, and quick prototyping capabilities. Developers leverage its rich API—functions like df.describe(), pd.concat(), and df.groupby()—to rapidly process data. However, when datasets grow beyond memory capacity, Pandas works in a not efficient way.
Key issues with Pandas at scale include:
These limitations make it challenging to scale Pandas workflows for enterprise needs, leading to inefficient workflows and significant development overhead. Doris presented in a very clear way what can be considered enterprise needs and showed how Pandas fails in accomplish them.
The Enterprise Challenges/Needs
Many organizations follow this workflow:
Each step often involves rewrites to adapt code for different tools, slowing development cycles and introducing inefficiencies. The iterative feedback loops between these phases can result in wasted effort and increased cost.
Scaling Pandas with Modin and Snowpark
To address the challenges of scaling Pandas, Doris Lee and an incredible team of developers and experts worked for year on the open-source project Modin, which is undeniably an innovative solution in many ways.
Modin is designed to scale Pandas seamlessly across multiple cores and distributed systems, eliminating the traditional limitations of single-threaded operations. With over 26 million downloads and contributions from 100+ developers, Modin has become a key player in enabling efficient large-scale data processing while maintaining the familiar Pandas API.
领英推荐
Modin achieves scalability by leveraging distributed backends such as Ray, Dask, and even Snowpark, allowing Pandas operations to run on distributed infrastructure without requiring code rewrites. For example, analysts can process terabytes of data efficiently while continuing to use Pandas’ intuitive API, enabling a smooth transition from prototype to production.
By integrating Snowpark as a backend, Modin unlocks the power of Snowflake’s cloud-native processing engine. Snowpark translates Pandas operations into distributed SQL queries executed directly within Snowflake’s infrastructure, preserving data security and eliminating the need for data movement. This partnership ensures Modin and Snowpark users can scale their workloads to enterprise-level datasets effortlessly.
Why Modin?
How to install it?
$ pip install "snowflake-snowpark-python[modin]"
import modin.pandas as pd
import snowflake.snowpark.modin.plugin
That's it! Modin is a scalable "drop in" replacement for pandas. DROP IN = all you need is to change one single line of code and then work with large datasets.
Performance at Scale: Pandas on Snowflake
Doris shows how the combination of Snowpark Pandas and Snowflake’s cloud data platform can deliver unmatched performances:
How it works?
Snowpark Pandas translates Pandas operations into SQL using Snowflake’s processing engine. The process is transparent, preserving the “look-and-feel” of Pandas while leveraging the scalability of Snowflake:
This architecture ensures that users retain familiar Pandas semantics while benefiting from enterprise-grade scalability.
During the session, Doris presented also a demo about a real-world application. She focuses on reading financial data from Cybersyn Fincance and Economics dataset. Already from the demo, you can clearly see the advantages of working with Snowpark Pandas and Modin over traditional Pandas.
Conclusion: The Future of Pandas in Enterprises
Snowpark Pandas and Modin redefine what is possible with Pandas in enterprise environments. Their combination addresses memory limitations by distributing workloads across Snowflake's scalable infrastructure, eliminating out-of-memory errors. There is zero data movement, everything is kept secure within Snowflake’s environment. Developers don't have to rewrite code and it accelerates development, enabling teams to move from prototype to production faster.
If you've been hesitant to use Pandas for enterprise-scale workloads, it's time to think again. With Snowpark Pandas and Modin, you can scale your Pandas workflows seamlessly from prototype to production, without sacrificing simplicity or performance. In addition, Doris is the best guide you could ask for in this discovery. If you are curious to learn more about her work, use these links: