A Python Data Engineer’s Journey with Snowflake: From ingestion, transformation to operationalization - Doris Lee & Manuela Wei's session BUILD 2024

A Python Data Engineer’s Journey with Snowflake: From ingestion, transformation to operationalization - Doris Lee & Manuela Wei's session BUILD 2024

At #SnowflakeBUILD2024, Doris Lee (Senior Product Manager at @Snowflake) and (Microsoft Principal Archite @ManuelaWei (Senior Product Manager at @Snowflake) delivered an inspiring session "A Python Data Engineer’s Journey with Snowflake: From ingestion, transformation to operationalization". The session explored how Snowflake addresses the daily challenges faced by data engineers and streamlines their workflows using Snowpark, Serverless Tasks, and the Python API.

Data engineers are often confronted with:

  • Fragmented tools: Multiple platforms for development, orchestration, and deployment.
  • Scalability issues: Traditional tools like Pandas struggle with large datasets.
  • Complex orchestration: Tuning and managing infrastructure for pipelines is time-consuming.
  • DevOps bottlenecks: Inefficient CI/CD processes and limited observability hinder productivity.

Snowflake’s platform solves these challenges with a unified approach built around four core pillars: Authoring, Ingestion & Transformation, Pipeline Orchestration, and DevOps.

Would you like to see the sessione on demand? [Click here]


Snowpark: Recap

Before diving into the four pillars, Snowpark deserves a special mention as it is the foundation of Snowflake’s Python capabilities.

Snowpark provides a unified framework for running Python (alongside Java and Scala) directly on Snowflake. Whether you’re building pipelines, machine learning models, or applications, Snowpark simplifies the process with:

  • Seamless Integration: Write code in your favorite tools—Notebooks, VS Code, or Streamlit—and run it on Snowflake’s elastic compute engine.
  • Scalable Execution: Processing happens inside Snowflake, eliminating data movement and infrastructure management.
  • Security & Governance: Snowpark inherits all of Snowflake’s robust governance policies, ensuring secure execution without trade-offs.

With Snowpark, Python developers can focus on writing code while Snowflake handles scalability, security, and performance.


Authoring: Unified Development with Notebooks

The session kicked off by tackling a common challenge for data engineers—fragmented development tools. Doris introduced Snowflake Notebooks, which streamline the process of connecting to data and building workflows.

How Snowflake Notebooks Simplify Development:

  • Connect seamlessly to Snowflake and other data sources.
  • Combine Python, SQL, and Markdown in one place for unified development.
  • Integrate with Git for version control and collaboration.
  • Automate workflows by scheduling notebooks to run regularly.

Notebooks can be created directly in Snowsight or integrated into existing projects. Snowflake Notebooks unify development, collaboration, and orchestration into a single interface—eliminating tool fragmentation and making data workflows more efficient.


Ingestion & Transformation

The second core pillar focused on data ingestion and scalable transformations—two critical processes for any data pipeline.

External Access and Private Connectivity

Snowflake makes it seamless to connect to external resources via a private link. Once your external sources (e.g., Azure SQL) are connected to Snowflake, you can securely ingest the data and leverage Snowflake’s compute power for scalable transformations.


Snowpark Pandas API: Transforming Data at Scale

We all know that Pandas is one of the most popular tools for Python data engineers. However, it comes with critical limitations when dealing with large datasets. Snowflake solves this challenge with its Snowpark Pandas API, which allows you to:

  1. Avoid Out-of-Memory (OOM) Errors Traditional Pandas runs on a single node and can’t scale, often causing painful memory errors. With Snowflake, your Pandas code is automatically compiled and executed on Snowflake’s distributed compute engine. This ensures seamless scaling for large datasets without the need to manage or tune infrastructure.
  2. Ensure Security and Governance Pulling sensitive data onto local machines raises security risks. With Snowflake, all computations happen securely within Snowflake. Data stays in place, inheriting Snowflake’s governance policies and ensuring zero data movement.
  3. Eliminate Code Rewrites Prototypes built with Pandas often need to be rewritten for production. Snowflake eliminates this step. You can write your Pandas code once and run it at scale on Snowflake. This saves time and improves productivity

How Snowpark Pandas API works?

The Snowpark Pandas API bridges the gap between Python Pandas code and Snowflake's SQL engine. When a user writes a Pandas query, such as:

pd.concat([df1, df2], axis=1)        

The API processes this in three key steps:

  1. Query Translation: The Snowpark Pandas API translates the Pandas query into an optimized SQL query.
  2. Snowflake Connector for Python: The connector ensures the query runs natively within Snowflake's environment, pushing the computation to the SQL engine.
  3. Processing Execution: The translated SQL query is executed in Snowflake's SQL engine, enabling distributed and scalable compute without pulling data to external environments


Pipeline Orchestration: Automating Workflows with Serverless Tasks

The third core pillar of the session, presented by Manuela, addressed pipeline orchestration challenges. Snowflake’s Serverless Tasks allow data engineers to automate their workflows efficiently while optimizing both cost and performance.


Key Updates Announced

  1. Serverless Tasks Flex Snowflake introduced Serverless Tasks Flex, which offers cost-optimized execution of batch ETL pipelines. It supports flexible scheduling by allowing users to define a target completion interval instead of a fixed schedule. This is the most efficient way to execute user-defined SQL, Python, Java, or Scala functions without managing compute.
  2. Serverless Tasks for Python/JVM Snowflake now supports secure execution of Python, Java, and Scala user-defined functions (UDFs) and stored procedures within Serverless Tasks. Code runs in a secure sandbox environment without requiring infrastructure management. How to use it:

//calls pre-defined python function in a serverless task: 
CREATE TASK T1 SCHEDULE='1 MINUTE'
USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = XSMALL 
AS 
SELECT myPY_UDF();

//define the python function in the SQL as clause
CREATE TASK T2
SCHEDULE='USING CRON 0 0 * * *'
SCHEDULING_MODE= 'FLEXIBLE'
TARGET_COMPLETION_INTERVAL='240 MINUTE'
AS 
	//my python code
;        

Manuela highlighted how Serverless Tasks improve productivity by automating pipelines and dynamically right-sizing compute based on workloads. This reduces resource waste, lowers the total cost of ownership (TCO), and removes the complexity of fine-tuning performance.


DevOps and CI/CD Pipeline

The final core pillar of the session focuses on DevOps practices and CI/CD pipelines in Snowflake. Manuela presented how Snowflake enables modern, agile development for data engineers by integrating tools and processes that simplify build, deployment, and monitoring workflows.


Building and Coding

Data engineers can now package and organize their projects efficiently. Snowflake supports defining configurations like databases, schemas, and tables, all while integrating with version control systems such as Git. This allows seamless collaboration and versioning within development workflows.

Deployment and Release

Using the Snowflake CLI and Python API, developers can automate deployments by triggering CI/CD pipelines directly from their remote Git repositories. Changes are managed declaratively, ensuring consistency and reducing manual errors during deployments.

Example Workflow:

  • Code changes are committed to Git.
  • The Snowflake CLI automatically triggers the pipeline, deploying changes to Snowflake environments.

Monitoring and Operating

Snowflake’s Database Change Management (DCM) handles updates to database objects automatically. It supports commands like CREATE, ALTER, and EXECUTE IMMEDIATE, simplifying database lifecycle management.

To ensure robust monitoring, Manuela introduced Snowflake Trail, a set of tools for monitoring pipelines, applications, and compute performance. Key capabilities include:

  • Metrics, logs, and traces for debugging and monitoring.
  • Visualizing errors and performance bottlenecks in real time.

Snowflake Trail integrates natively with Snowsight but also supports external tools via OpenTelemetry, making it easy to bring your preferred observability platform.


Demo: End-to-End Workflow in Action

The session concluded with a demo that brought all four pillars together. The workflow demonstrated:

  1. External Access: Connect securely to external resources using Private Links.
  2. Data Ingestion: Load data into Snowflake tables.
  3. Transformation: Use Snowpark Pandas for scalable data aggregation and transformation.
  4. Orchestration: Automate the entire pipeline with Serverless Tasks.
  5. Automation: Schedule Notebooks for automated workflows and integrate with Git for version control.

The key outcome was demonstrating how Snowflake simplifies data workflows, from external access to pipeline automation, all within a secure, scalable, and Python-friendly environment. It’s important to note that there is no code, repository, or link provided to recreate the demo. It’s purely demonstrated in the video, making it a reference rather than a hands-on walkthrough.

Conclusion: Simplifying Data Engineering with Snowflake

The session showcased how Snowflake streamlines every step of the data engineering process. From seamless development with familiar tools, secure and scalable data transformations with Snowpark Pandas, to efficient workflow automation using Serverless Tasks and modern DevOps practices, Snowflake provides a unified platform for building and managing data pipelines.

By eliminating the complexities of infrastructure management and code rewrites, Snowflake allows data engineers to focus on creating value and solving business problems. It ensures scalability, security, and cost efficiency, enabling teams to work smarter and faster.

If you haven’t seen the session yet, I highly recommend watching it to discover how Snowflake can simplify and enhance your Python-based workflows.

Vivekanandan Srinivasan

Snowflake Data Superhero | Author - The Ultimate Guide to Snowpark | Data Analytics Professional | Top 10 Percentile in Kaggle Kernel Ranking | | Technical Blogger | Snowflake & GenAI Practitioner

2 个月

That’s very detailed one ??

Kevin Seely

Senior Account Manager, Solution Architecture & Data Engineering

2 个月

This is great, nice work!

Sofia Pierini

Senior Data Engineer @Ey |?? Snowflake Data Superhero | ?? Founder & Chapter Leader Italy User Group

2 个月
回复

要查看或添加评论,请登录

Sofia Pierini的更多文章

社区洞察

其他会员也浏览了