Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric
Azure Databricks, Azure Synapse Analytics and Microsoft Fabric

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

Recently Microsoft announced Fabric , their new data analytics platform. With that they introduced another offering in their already extensive landscape of data and analytics tools: Azure Data Factory, Azure Synapse Analytics, Power BI, Azure Machine Learning, Azure Databricks...just to name a few. With so many services, it can be challenging to understand how they compare. In this article I try to shed some light. I focus on Azure Databricks, Azure Synapse Analytics, and Microsoft Fabric because these platfoms are most comparable given their proposition of being a unified, end-to-end data analytics platform.

Where we stand (the current landscape)

The graphic below compares Databricks, Synapse, and Fabric on features and capabilities.

No alt text provided for this image
Databricks, Synapse, and Fabric – features and capabilities

Azure Databricks

First there was Databricks. Databricks has General Availability on Azure since March 2018. Databricks is an independent company that runs on other clouds as well, so does it even have place in a comparison of Microsoft analytics products? It definitely does. Azure Databricks is a "first-party " Azure service, meaning that it deeply integrates in the Azure ecosystem. Product development is a joint effort between Databricks and Microsoft engineers.

What started as something I like to call "Spark-as-a-Service" (Databricks is founded and led by the creators of Apache Spark) has turned into a full-fledged analytics platform that goes far beyond Spark. With the creation of open-source technologies Delta Lake (storage format) and MLFLow (MLOps framework) and the invention of the Lakehouse paradigm, Databricks has shaped the industry and has been an inspiration for Azure Synapse and Microsoft Fabric (that's a friendly way to say it).

Azure Synapse

Azure Synapse Analytics was Microsoft's first "unified analytics engine" and has General Availability since December 2020. It's the succesor of Azure SQL Data Warehouse with newly added capabilities for serverless SQL, Spark, and orchestration and data movement pipelines based on the Data Factory engine. In late 2021, Synapse enabled the Lakehouse architecture on their platform by adding support for Delta in their Serverless SQL engine.

The similarities between Synapse and Databricks are obvious. They are both PaaS (Platform-as-a-Service) offerings with decoupled storage and compute. They both offer Spark for big data analytics, a proprietary SQL engine for data warehousing, storage in Delta, a unified metastore that enables interoperability (to a degree) between their Spark and SQL engines, and orchestration capabilities to create and trigger data pipelines. Sure, there's much more to both products than just these features, but they are what most of the workloads rely on in my experience.

Although Synapse and Databricks have such a comparable proposition, their user experience is very different. Having worked extensively on both platforms, I can confidently say that Synapse is much less robust than Databricks. Synapse has had GA status for a couple of years now, but still behaves like a beta product in many cases. The amount of "Livy errors" I have faced when working with Spark in Synapse makes me advise against the service whenever I'm being asked.

Microsoft Fabric

Fabric was announced with a bang at this years' edition of the Microsoft Build event. The service is in Public Preview now and is expected to reach General Availability in the beginning of 2024 (according to a Microsoft architect I spoke recently). Much like Synapse, Fabric's proposition is to be a unified analytics engine: manage all of your data and AI workloads, end-to-end, within a single platform. Fabric can do everything that Synapse can do, and more. It is widely regarded as the evolution of Synapse.

You may be wondering: why don't we just add the missing features (e.g. MLOps) in Synapse instead of introducing a new service? That is a very natural question to ask given the overlap between Synapse and Fabric: to a large extent, both services are powered by the same technologies and engines. But there are also some major differences. While Synapse is PaaS, Fabric is SaaS (Software-as-a-Service). This means there is a complete abstraction from the underlying infrastructure components. And that means, for example, that you don't have to think about provisioning an ADLS Gen2 storage account since Fabric's storage – OneLake – is available out-of-the-box. Behind the scenes OneLake still uses ADLS Gen2 storage, but the point is that the user shouldn't have to care about that. The same goes for compute. You don't have to think about managing a Spark cluster/pool, Fabric does this for you: you simply hit run and Microsoft finds a server to execute your code and return the results.

Power BI's popularity (and Synapse's lack of it) has probably been an important driver for Microsoft's product design choices around Fabric.

With Fabric, Microsoft aims to make data analytics more accessible for business users. The Fabric user experience builds upon the Power BI user experience, which is much closer to Office 365 than it is to Azure. Power BI users will feel right at home inside Fabric. Power BI's popularity (and Synapse's lack of it) has probably been an important driver for Microsoft's product design choices around Fabric. Fabric's abstraction from Azure infrastructure components is probably also related to its branding as Microsoft Fabric as opposed to Azure Fabric. The Azure Portal is completely bypassed – Fabric users manage everything from within the Fabric user interface.

Where it's going (the future landscape)

Microsoft clearly bets its chips on Fabric. What does that mean for Synapse's future? The product page for Microsoft Fabric is currently saying the following.

"Existing Microsoft products such as Azure Synapse Analytics, Azure Data Factory, or Azure Data Explorer, will continue to provide a robust, enterprise-grade PaaS solution for data analytics. Microsoft Fabric represents an evolution of those offerings in the form of a simplified SaaS solution that can connect to existing PaaS offerings. Migration paths will soon be made available to help transition teams that are ready to switch services."

So Synapse will stay. This can be a reassurement for organizations that have made investments in the service and built their analytical stack around it. That being said, there has hardly been any development on Synapse since Fabric's announcement in May. The what's new page tells us not much is new. This aligns with what the Microsoft architect I spoke said: development efforts are focused towards Fabric, and limited to security and bug fixes for Synapse.

And what about Azure Databricks? Well, the development and innovation of the Databricks platform is driven by Databricks engineers, not by Microsoft engineers. As long as Microsoft allows them on the Azure cloud, Databricks will continue improving and extending Azure Databricks. And while Microsoft might prefer organizations using Fabric rather than Azure Databricks, they would still rather have organizations run Databricks on Azure than them running Databricks on AWS (or GCP, for that matter). As such, Azure Databricks will stay. I see a bright future for (Azure) Databricks and its users have little reason to be concerned about Microsoft Fabric.

I'm using Synapse, what now?

Synapse's development stagnation means organizations should think (or start thinking) about migrating away from Synapse. Many of those organizations will move to Fabric, because that's the 'natural' thing to do. It's what Microsoft recommends, and probably what most of the consultancy firms will advice. Personally, I would not recommend migrating to Fabric any time soon. Fabric is not an option as long as it's in preview. But even after it's released to GA, I don't consider it an option for at least six months. You should not want to be dealing with the inevitable teething troubles of a fresh product. Especially knowing the amount of issues Synapse still has, years after is has been released to GA. So Fabric is not an option now, but it definitely becomes one after it has proven to be a performant, robust, and cost-effective platform. Other data platforms (Databricks, Snowflake, BigQuery, Redshift, ...) will have a challenge competing with Fabric if it lives up to it's proposition.

But if you're an uncomfortable Synapse user now, and you don't want to wait for Fabric to mature, what are your options? I would argue Azure Databricks is your best shot. Databricks offers a stable and innovative platform today. Moreover, similarity between the platforms makes a migration from Synapse to Databricks relatively straightforward: your data can probably stay where it is and your code can be ported with minor modifications. Of course, this is generally speaking – it all depends on the details of your workloads.

Wrapping up

Microsoft's data analytics landscape is divers and will continue to be divers. Although Microsoft aims to make Fabric the unified platform for all your analytics needs, Azure Databricks will remain alongside Fabric and compete against it with the same proposition of being a unified analytics platform. Even Microsoft's 'own' succeeded-by-Fabric analytics services (e.g. Synapse) will continue to exist in the foreseeable future. Ultimately, choice on the analytics platform market is good for its consumers, even when it can be challenging for them to fully understand the competing products.


Thomas Totter

Have you tried turning it off and on again?

1 年

One think is maybe a bit misleading in the comparison: I don't see why you would account ADF to Synapse/Fabric only. Might as well put it to the Databricks side as well. And thanks for this article that pretty much confirmed the gut feeling that i had after playing around with Fabric on the very first day it was released The summary i gave my colleagues was something like: It's a dumbed down and buggy (this probably got alot better) version of Databricks, trying to get some piece of that analytics/data-cake back which they lost over the past years. And my take on why Microsoft didn't just update Synapse, but moved the whole thing to Power-BI: I see/saw it like you: Synapse was a good concept, but never left beta, while on the other hand you had that powerhouse called Databricks. They knew, they will never catch up and get those projects/devs back that are implementing Databricks today. So they just exposed the largest group of customers they had and tailored a lakehouse to their liking. Not sure if this will play out though... as I think that a lot of (new) devs that are actually getting into lakehouse/spark/delta because of Fabric now, will end up in Databricks (a year later).

Roel Peters

Co-founder of Gatekeeper | Easy hard skill assessments

1 年

That's an *excellent* comparison. I might have been disappointed more than once too many by Microsoft, so I have some reflections: - Fabric uses Livy too, and I've seen users encountering issues with it. So this certainly is not fixed. - Since there's no more active development on Synapse and its such an incredibly buggy product, Microsoft is basically bullying its users into adopting Fabric. - This is a typical Microsoft Azure story: abstractions on top of abstractions, is Fabric the end state? - A leopard does not change its spots. I wonder how long it will take before Microsoft starts adding proprietary features to Delta Lake, Spark or MLFlow, locking in its users. - With the current availability and composability of best-of-breed data tools, I do not see why enterprise customers would trust Microsoft with their data platform, given their track record.

要查看或添加评论,请登录

Jorrit Sandbrink的更多文章

社区洞察

其他会员也浏览了