登录查看更多内容

Data Engineering: Snowflake vs Databricks

Vincent Rainardi

Data Architect & Data Engineer

发布日期: 2024年8月11日

From data engineering perspective, Snowflake and Databricks are quite different. Snowflake is almost pure SQL, whereas Databricks is PySpark. So from loading and transformation in Databricks you’ve got to understand DataFrame and Pandas. Whereas in Snowflake for loading and transformation it’s all SQL, i.e. COPY INTO, CTAS, Snowpipe.

That, my friends, was old news. These days, Databricks shops use Autoloader. Autoloader loads files as soon as they arrive into Delta Live tables in Databricks. Simple, easy, flexible and scalable (see diagram below, source: link). It uses df = spark.readStream.format(...) And for batch load, Databricks uses COPY INTO too, just like Snowflake. It loads files into Delta Live tables. Again, simple, easy and flexible.

Talking about ingestion, one has to mention streaming (as in Kafka, not files). In Snowflake it’s using Snowpipe Streaming API (link, see below). Whereas Databricks uses spark.readStream.

领英推荐

Data Engineering: From Zero ETL in the Past to LLM as…

Dr. RVS Praveen Ph.D 1 年前

Building a Simple Data Pipeline with Mage: A…

Rana Sheharyar 11 个月前

Docker & Kafka on AWS: The Ultimate Guide for Data…

Steven Murhula 3 周前

Then we talk about orchestration. A topic which every data engineering project must have. How are you going to organize those loadings and transformations jobs and tasks? As in triggering them, monitoring them, etc. In Databricks they use Workflow to organise jobs and task, including PySpark and SQL notebooks, SQ queries, DLT pipeline and control flow tasks. You can trigger using a schedule/timer, or file arrival. Or continuously running.

In Snowflake the only internal tool is |Task, which we can use to organize procedures, but that’s it. So the solution is external tools, such as Airflow, Prefect, dbt, and ADF.

Of course, every company must have DevOps in data engineering. And IaC tool. These two are not dependent on whether you use Snowflake or Databricks. Either way, you can use Terraform or ARM for your IaC. You can use ADO or Jenkins+Git for DevOps.

That’s data ingestion (batch and streaming), orchestration, DevOps and IaC. For transformation, dbt and Matillion are quite big on both worlds (Databricks and Snowflake). As both are using SQL for transformation, it lends itself to SQL-based tools like dbt and Matillion. And they can do orchestration too, but only within transformation, not ingestion. So you need that orchestration tool I mentioned earlier (Databricks Workflow, Airflow, Prefect) to organise ingestion+transformation. Or tools like Control-M and ActiveBatch to organise ingestion + transformation as part of the whole enterprise data processes, including scheduling and monitoring.

Ralf Fischer

Leading Data, AI & Business Integration @ bridgingIT

6 个月

Hi Vincent, thanks for sharing your thoughts. Of course this is a bit high level - the adoption of Autoloader and COPY INTO by Databricks makes its ingestion process not as straightforward as Snowflake’s, mentioning the steeper learning curve associated with PySpark and managing streaming data in Databricks.

1 次回应

查看更多评论

要查看或添加评论，请登录

Vincent Rainardi的更多文章

DQ Engineering

2025年3月23日

DQ Engineering

DQ stands for Data Quality. If you don't have a background in data quality, read this first: https://www.
Data Product

2025年3月23日

Data Product

For those of you who don't know what a data product and “data as a product” are, please read this first:…

3 条评论
Snowflake vs SQL Server

2025年3月20日

Snowflake vs SQL Server

Sometimes we need to remind ourselves that Snowflake is not an OLTP database. I know today is the era of Hybrid tables…

6 条评论
Data engineer becoming solution architect

2025年3月18日

Data engineer becoming solution architect

Are you a data engineer thinking about transitioning to a cloud solution architect? Data engineer are good with…

2 条评论
Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

2025年3月17日

Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

If you work in banking or investment or any other sector in financial services, you might be wondering about the above.…
Data Warehousing Basics: Cost

2025年3月16日

Data Warehousing Basics: Cost

If you call yourself a data engineer you need to be aware of 2 additional things compared to a developer. The first one…

2 条评论
My Linkedin post & articles

2025年3月15日

My Linkedin post & articles

The list below goes back to Nov 2024. For older than that see here.

9 条评论
Data Warehousing Basics: Single Customer View

2025年3月15日

Data Warehousing Basics: Single Customer View

Imagine that you work for an insurance company who sell health insurance (HI), life insurance (LI), general insurance…

2 条评论
Data Warehousing Basics: NFR

2025年3月15日

Data Warehousing Basics: NFR

What I’m about to tell you today failed a lot of data warehousing projects which is why it’s worth paying attention so…

1 条评论
ML and AI - What's the difference?

2025年3月13日

ML and AI - What's the difference?

Machine Learning covers about 20-30 algorithms such as Logistic Regression, Decision Tree, Gradient Boosting, Random…

4 条评论

See all articles

Data Engineering: Snowflake vs Databricks

Vincent Rainardi

Data Architect & Data Engineer

领英推荐

Vincent Rainardi的更多文章

社区洞察

其他会员也浏览了

?? DATA Pill #097 - LLMs meet SQL, Confluent + Apache Flink = ?

Databricks: A Contemporary Solution for Today’s Data Engineering Obstacles

?? DATA Pill #104 - What can LLMs never do?, Kafka Connect: A Love/Hate Relationship

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

AWS Data Engineering Essentials Guidebook

How modern data-analytics architecture works with Azure Databricks

SQLMesh: The future of DataOps

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

Data Engineering with Apache Airflow, Snowflake, Snowpark, dbt & Cosmos, Astronomer

Real-Time Data Engineering Challenges in Databricks: How to Overcome Common Pain Points with PySpark

领英推荐

Vincent Rainardi的更多文章

DQ Engineering

Data Product

Snowflake vs SQL Server

Data engineer becoming solution architect

Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

Data Warehousing Basics: Cost

My Linkedin post & articles

Data Warehousing Basics: Single Customer View

Data Warehousing Basics: NFR

ML and AI - What's the difference?

社区洞察

其他会员也浏览了

?? DATA Pill #097 - LLMs meet SQL, Confluent + Apache Flink = ?

Databricks: A Contemporary Solution for Today’s Data Engineering Obstacles

?? DATA Pill #104 - What can LLMs never do?, Kafka Connect: A Love/Hate Relationship

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

AWS Data Engineering Essentials Guidebook

How modern data-analytics architecture works with Azure Databricks

SQLMesh: The future of DataOps

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

Data Engineering with Apache Airflow, Snowflake, Snowpark, dbt & Cosmos, Astronomer

Real-Time Data Engineering Challenges in Databricks: How to Overcome Common Pain Points with PySpark