Choosing between Lakehouse, Data Warehouse and Real-time Analytics in Microsoft Fabric.
Jacob R?nnow Jensen
Head of Data Platform @ AP Pension | Leadership | Data Delivery | Microsoft Data Platform | Business Intelligence | Digital Transformation | IT-Strategy | Data Warehousing | Data Architecture | Collaboration
It is very easy to ingest data into Microsoft Fabric and start working with them - whether it being in a Lakehouse, Data Warehouse, KQL database or Datamart.
Although there is a lot of overlap between these offerings, there are also some subtle differences, that I feel are important to explore - and what better time than now, where Fabric is in Public Preview and a 60 day free trial is available.
In the article Lakehouse end-to-end scenario: overview and architecture it is described how an entire Data Delivery Platform - from the extraction of raw data through dimensional modelling - can be done using only Lakehouses. However, it is also noted, that “with the flexibility offered by Fabric, you can implement either Lakehouse or Data Warehouse architectures or combine these two together to get the best of both.”
3 scenarios for choosing between Data Warehouse, Lakehouse and Datamart are described in the article Fabric decision guide - lakehouse or data warehouse. The key differentiator here (as in the rest of the mainstream narrative on this subject) seems to be the skill set of the existing developer team - with a honorary mention of the possibility for multi-table transactions with Data Warehouse.
The newly introduced Datamarts in Power BI still exists under the Fabric license, and to muddy the waters even further, a fourth path, for ingesting and transforming data in your Data Delivery Platform, is Real-Time Analytics in Microsoft Fabric which through eventstreams and a KQL database offers some of the same capabilities as the Lakehouse and Warehouse.?(Below, I have lined up the suggested architectures from the Lakehouse and Real-time Analytics documentation respectively which I feel illustrates this point).
I feel, that Microsoft Fabric, with the unified experience, the open format on OneLake, and the rise of Co-Pilots and LLMs, offers us the ability to change the way we work with data and analysis - and that the optimal combination for ingesting, modelling and serving data in Fabric should depend on the specific use case, performance, price and governance model - not on current skills or what we did in the past.
For my own use in the Public Preview, I have compiled a comparison of capabilities between the Lakehouse, Data Warehouse, Real-time Analytic and PowerBI Datamart ... and I have highlighted the points, that sets them apart.
When working with data on an ad-hoc or exploratory basis, it can be a good idea to match the technology to the preferences of the persona working with it. Together with the Premium Per User license, this can make the case for the use of the PowerBI Datamart for Citizen Developers ... as the case can be made for offering the Data Lake for Data Scientists and the Data Warehouse for Data Analysts.
For Data Engineers, my suggestion right now is to start experimenting with the different paths and technologies and monitor the performance and limitations ... and to that end, I hope that the table above can give you an idea of where to start.
A few additional comments:
领英推荐
Currently the performance in the Lakehouse seems better than in the Warehouse, but this is to be expected since performance, concurrency and scale has not been the focus of the current release. ?
At the time of GA, I would also expect the Data Warehouse artifact to also support DirectLake, but for now, this can only be tested towards a Lakehouse.?
The support for multi-table transactions in the Data Warehouse Artifact could be a key factor when choosing how to implement your gold layer if you want to ensure, that related tables (i.e. in a star schema) are updated together or not updated at all. I also feel that one should take the warning from Better together - the lakehouse and warehouse seriously: "A SQL Endpoint is not scoped to data analytics in just the Fabric Lakehouse."
The Warehouse and SQL Endpoint share the same underlying processing architecture regardless of the artifact, but the Data Warehouse is the only one that supports "ingestion separation" where ingestion jobs can run on dedicated nodes that are optimized for ETL and do not compete with other queries or applications for resources.
OneLake shortcuts?can be used to create read-only replicas of tables in other workspaces to distribute load across multiple sql engines creating an isolation boundary.
As for security, data access has its own models depending on the engine, you are accessing. Before going ahead and implementing shortcuts, you should know, that when accessing them through Power BI Datasets or T-SQL,?the calling user’s identity is not passed through to the shortcut target.?The calling item owner’s identity is passed instead, delegating access to the calling user. Additionally, there is no automated process for cleaning up cascading shortcuts.
I feel that there is still a lot to be done to prevent access management and security from becoming total chaos, and I would love to be able to handle all data sharing, shortcuts and access control centrally in Purview - not just the ones mentioned in Administration, Security and Governance in Microsoft Fabric
BTW: if you agree, you can give my suggestion a vote or a comment in the ideas section of the Fabric community site.
?
Product @ Microsoft
1 年Hi Jacob - great article, I want to highlight that KQL databases now support querying shortcuts in delta format. There are other factors such as streaming ingestion latency, time series functionalities that might affect the decision. Please check this updated guidance - https://learn.microsoft.com/en-us/fabric/get-started/decision-guide-warehouse-lakehouse