Data Engineering: Snowflake vs Databricks
From data engineering perspective, Snowflake and Databricks are quite different. Snowflake is almost pure SQL, whereas Databricks is PySpark. So from loading and transformation in Databricks you’ve got to understand DataFrame and Pandas. Whereas in Snowflake for loading and transformation it’s all SQL, i.e. COPY INTO, CTAS, Snowpipe.
That, my friends, was old news. These days, Databricks shops use Autoloader. Autoloader loads files as soon as they arrive into Delta Live tables in Databricks. Simple, easy, flexible and scalable (see diagram below, source: link). It uses df = spark.readStream.format(...) And for batch load, Databricks uses COPY INTO too, just like Snowflake. It loads files into Delta Live tables. Again, simple, easy and flexible.
Talking about ingestion, one has to mention streaming (as in Kafka, not files). In Snowflake it’s using Snowpipe Streaming API (link, see below). Whereas Databricks uses spark.readStream.
领英推荐
Then we talk about orchestration. A topic which every data engineering project must have. How are you going to organize those loadings and transformations jobs and tasks? As in triggering them, monitoring them, etc. In Databricks they use Workflow to organise jobs and task, including PySpark and SQL notebooks, SQ queries, DLT pipeline and control flow tasks. You can trigger using a schedule/timer, or file arrival. Or continuously running.
In Snowflake the only internal tool is |Task, which we can use to organize procedures, but that’s it. So the solution is external tools, such as Airflow, Prefect, dbt, and ADF.
Of course, every company must have DevOps in data engineering. And IaC tool. These two are not dependent on whether you use Snowflake or Databricks. Either way, you can use Terraform or ARM for your IaC. You can use ADO or Jenkins+Git for DevOps.
That’s data ingestion (batch and streaming), orchestration, DevOps and IaC. For transformation, dbt and Matillion are quite big on both worlds (Databricks and Snowflake). As both are using SQL for transformation, it lends itself to SQL-based tools like dbt and Matillion. And they can do orchestration too, but only within transformation, not ingestion. So you need that orchestration tool I mentioned earlier (Databricks Workflow, Airflow, Prefect) to organise ingestion+transformation. Or tools like Control-M and ActiveBatch to organise ingestion + transformation as part of the whole enterprise data processes, including scheduling and monitoring.
Leading Data, AI & Business Integration @ bridgingIT
6 个月Hi Vincent, thanks for sharing your thoughts. Of course this is a bit high level - the adoption of Autoloader and COPY INTO by Databricks makes its ingestion process not as straightforward as Snowflake’s, mentioning the steeper learning curve associated with PySpark and managing streaming data in Databricks.