A Python Data Engineer’s Journey with Snowflake: From ingestion, transformation to operationalization - Doris Lee & Manuela Wei's session BUILD 2024
Sofia Pierini
Senior Data Engineer @Ey |?? Snowflake Data Superhero | ?? Founder & Chapter Leader Italy User Group
At #SnowflakeBUILD2024, Doris Lee (Senior Product Manager at @Snowflake) and (Microsoft Principal Archite @ManuelaWei (Senior Product Manager at @Snowflake) delivered an inspiring session "A Python Data Engineer’s Journey with Snowflake: From ingestion, transformation to operationalization". The session explored how Snowflake addresses the daily challenges faced by data engineers and streamlines their workflows using Snowpark, Serverless Tasks, and the Python API.
Data engineers are often confronted with:
Snowflake’s platform solves these challenges with a unified approach built around four core pillars: Authoring, Ingestion & Transformation, Pipeline Orchestration, and DevOps.
Would you like to see the sessione on demand? [Click here]
Snowpark: Recap
Before diving into the four pillars, Snowpark deserves a special mention as it is the foundation of Snowflake’s Python capabilities.
Snowpark provides a unified framework for running Python (alongside Java and Scala) directly on Snowflake. Whether you’re building pipelines, machine learning models, or applications, Snowpark simplifies the process with:
With Snowpark, Python developers can focus on writing code while Snowflake handles scalability, security, and performance.
Authoring: Unified Development with Notebooks
The session kicked off by tackling a common challenge for data engineers—fragmented development tools. Doris introduced Snowflake Notebooks, which streamline the process of connecting to data and building workflows.
How Snowflake Notebooks Simplify Development:
Notebooks can be created directly in Snowsight or integrated into existing projects. Snowflake Notebooks unify development, collaboration, and orchestration into a single interface—eliminating tool fragmentation and making data workflows more efficient.
Ingestion & Transformation
The second core pillar focused on data ingestion and scalable transformations—two critical processes for any data pipeline.
External Access and Private Connectivity
Snowflake makes it seamless to connect to external resources via a private link. Once your external sources (e.g., Azure SQL) are connected to Snowflake, you can securely ingest the data and leverage Snowflake’s compute power for scalable transformations.
Snowpark Pandas API: Transforming Data at Scale
We all know that Pandas is one of the most popular tools for Python data engineers. However, it comes with critical limitations when dealing with large datasets. Snowflake solves this challenge with its Snowpark Pandas API, which allows you to:
How Snowpark Pandas API works?
The Snowpark Pandas API bridges the gap between Python Pandas code and Snowflake's SQL engine. When a user writes a Pandas query, such as:
pd.concat([df1, df2], axis=1)
The API processes this in three key steps:
领英推荐
Pipeline Orchestration: Automating Workflows with Serverless Tasks
The third core pillar of the session, presented by Manuela, addressed pipeline orchestration challenges. Snowflake’s Serverless Tasks allow data engineers to automate their workflows efficiently while optimizing both cost and performance.
Key Updates Announced
//calls pre-defined python function in a serverless task:
CREATE TASK T1 SCHEDULE='1 MINUTE'
USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = XSMALL
AS
SELECT myPY_UDF();
//define the python function in the SQL as clause
CREATE TASK T2
SCHEDULE='USING CRON 0 0 * * *'
SCHEDULING_MODE= 'FLEXIBLE'
TARGET_COMPLETION_INTERVAL='240 MINUTE'
AS
//my python code
;
Manuela highlighted how Serverless Tasks improve productivity by automating pipelines and dynamically right-sizing compute based on workloads. This reduces resource waste, lowers the total cost of ownership (TCO), and removes the complexity of fine-tuning performance.
DevOps and CI/CD Pipeline
The final core pillar of the session focuses on DevOps practices and CI/CD pipelines in Snowflake. Manuela presented how Snowflake enables modern, agile development for data engineers by integrating tools and processes that simplify build, deployment, and monitoring workflows.
Building and Coding
Data engineers can now package and organize their projects efficiently. Snowflake supports defining configurations like databases, schemas, and tables, all while integrating with version control systems such as Git. This allows seamless collaboration and versioning within development workflows.
Deployment and Release
Using the Snowflake CLI and Python API, developers can automate deployments by triggering CI/CD pipelines directly from their remote Git repositories. Changes are managed declaratively, ensuring consistency and reducing manual errors during deployments.
Example Workflow:
Monitoring and Operating
Snowflake’s Database Change Management (DCM) handles updates to database objects automatically. It supports commands like CREATE, ALTER, and EXECUTE IMMEDIATE, simplifying database lifecycle management.
To ensure robust monitoring, Manuela introduced Snowflake Trail, a set of tools for monitoring pipelines, applications, and compute performance. Key capabilities include:
Snowflake Trail integrates natively with Snowsight but also supports external tools via OpenTelemetry, making it easy to bring your preferred observability platform.
Demo: End-to-End Workflow in Action
The session concluded with a demo that brought all four pillars together. The workflow demonstrated:
The key outcome was demonstrating how Snowflake simplifies data workflows, from external access to pipeline automation, all within a secure, scalable, and Python-friendly environment. It’s important to note that there is no code, repository, or link provided to recreate the demo. It’s purely demonstrated in the video, making it a reference rather than a hands-on walkthrough.
Conclusion: Simplifying Data Engineering with Snowflake
The session showcased how Snowflake streamlines every step of the data engineering process. From seamless development with familiar tools, secure and scalable data transformations with Snowpark Pandas, to efficient workflow automation using Serverless Tasks and modern DevOps practices, Snowflake provides a unified platform for building and managing data pipelines.
By eliminating the complexities of infrastructure management and code rewrites, Snowflake allows data engineers to focus on creating value and solving business problems. It ensures scalability, security, and cost efficiency, enabling teams to work smarter and faster.
If you haven’t seen the session yet, I highly recommend watching it to discover how Snowflake can simplify and enhance your Python-based workflows.
Snowflake Data Superhero | Author - The Ultimate Guide to Snowpark | Data Analytics Professional | Top 10 Percentile in Kaggle Kernel Ranking | | Technical Blogger | Snowflake & GenAI Practitioner
2 个月That’s very detailed one ??
Senior Account Manager, Solution Architecture & Data Engineering
2 个月This is great, nice work!
Senior Data Engineer @Ey |?? Snowflake Data Superhero | ?? Founder & Chapter Leader Italy User Group
2 个月You can read it also in Discourse: https://snowflake.discourse.group/t/missed-snowflakebuild-blog-3-let-s-catch-up-together/14170