Data Engineering: Snowflake vs Databricks

From data engineering perspective, Snowflake and Databricks are quite different. Snowflake is almost pure SQL, whereas Databricks is PySpark. So from loading and transformation in Databricks you’ve got to understand DataFrame and Pandas. Whereas in Snowflake for loading and transformation it’s all SQL, i.e. COPY INTO, CTAS, Snowpipe.

That, my friends, was old news. These days, Databricks shops use Autoloader. Autoloader loads files as soon as they arrive into Delta Live tables in Databricks. Simple, easy, flexible and scalable (see diagram below, source: link). It uses df = spark.readStream.format(...) And for batch load, Databricks uses COPY INTO too, just like Snowflake. It loads files into Delta Live tables. Again, simple, easy and flexible.


Talking about ingestion, one has to mention streaming (as in Kafka, not files). In Snowflake it’s using Snowpipe Streaming API (link, see below). Whereas Databricks uses spark.readStream.


Then we talk about orchestration. A topic which every data engineering project must have. How are you going to organize those loadings and transformations jobs and tasks? As in triggering them, monitoring them, etc. In Databricks they use Workflow to organise jobs and task, including PySpark and SQL notebooks, SQ queries, DLT pipeline and control flow tasks. You can trigger using a schedule/timer, or file arrival. Or continuously running.

In Snowflake the only internal tool is |Task, which we can use to organize procedures, but that’s it. So the solution is external tools, such as Airflow, Prefect, dbt, and ADF.

Of course, every company must have DevOps in data engineering. And IaC tool. These two are not dependent on whether you use Snowflake or Databricks. Either way, you can use Terraform or ARM for your IaC. You can use ADO or Jenkins+Git for DevOps.

That’s data ingestion (batch and streaming), orchestration, DevOps and IaC. For transformation, dbt and Matillion are quite big on both worlds (Databricks and Snowflake). As both are using SQL for transformation, it lends itself to SQL-based tools like dbt and Matillion. And they can do orchestration too, but only within transformation, not ingestion. So you need that orchestration tool I mentioned earlier (Databricks Workflow, Airflow, Prefect) to organise ingestion+transformation. Or tools like Control-M and ActiveBatch to organise ingestion + transformation as part of the whole enterprise data processes, including scheduling and monitoring.

Ralf Fischer

Leading Data, AI & Business Integration @ bridgingIT

6 个月

Hi Vincent, thanks for sharing your thoughts. Of course this is a bit high level - the adoption of Autoloader and COPY INTO by Databricks makes its ingestion process not as straightforward as Snowflake’s, mentioning the steeper learning curve associated with PySpark and managing streaming data in Databricks.

要查看或添加评论,请登录

Vincent Rainardi的更多文章

  • DQ Engineering

    DQ Engineering

    DQ stands for Data Quality. If you don't have a background in data quality, read this first: https://www.

  • Data Product

    Data Product

    For those of you who don't know what a data product and “data as a product” are, please read this first:…

    3 条评论
  • Snowflake vs SQL Server

    Snowflake vs SQL Server

    Sometimes we need to remind ourselves that Snowflake is not an OLTP database. I know today is the era of Hybrid tables…

    6 条评论
  • Data engineer becoming solution architect

    Data engineer becoming solution architect

    Are you a data engineer thinking about transitioning to a cloud solution architect? Data engineer are good with…

    2 条评论
  • Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

    Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

    If you work in banking or investment or any other sector in financial services, you might be wondering about the above.…

  • Data Warehousing Basics: Cost

    Data Warehousing Basics: Cost

    If you call yourself a data engineer you need to be aware of 2 additional things compared to a developer. The first one…

    2 条评论
  • My Linkedin post & articles

    My Linkedin post & articles

    The list below goes back to Nov 2024. For older than that see here.

    9 条评论
  • Data Warehousing Basics: Single Customer View

    Data Warehousing Basics: Single Customer View

    Imagine that you work for an insurance company who sell health insurance (HI), life insurance (LI), general insurance…

    2 条评论
  • Data Warehousing Basics: NFR

    Data Warehousing Basics: NFR

    What I’m about to tell you today failed a lot of data warehousing projects which is why it’s worth paying attention so…

    1 条评论
  • ML and AI - What's the difference?

    ML and AI - What's the difference?

    Machine Learning covers about 20-30 algorithms such as Logistic Regression, Decision Tree, Gradient Boosting, Random…

    4 条评论

社区洞察

其他会员也浏览了