Deep-dive into Direct Lake

Deep-dive into Direct Lake

With Microsoft Fabric we have a new storage mode to choose from, Direct Lake ?– combining the best of both worlds between Import mode and DirectQuery.

Let me preface by thanking a few people who in different ways has helped me writing this article and shaped my understanding of the topic:

Marc Lelijveld , Just Blindb?k , Lars Andersen , Benni De Jagere , Bogdan Crivat

Preface - Storage modes

As a little bit of background knowledge let’s quickly recap the previous storage modes in Power BI:

?? Import: Copies the data from the source, either when scheduled, triggered (API) or manually refreshed. Upon refreshing it will import the entire data into memory in the VertiPaq format, enabling blazing fast query performance – but on “stale” data.

?? DirectQuery: Contrary to Import mode DirectQuery will translate and run the queries directly on the source in the sources native query language – ensuring real-time data fetching. However, since the data needs to move from the source and are not necessarily memory-cached it will have a slower query performance.

?? Dual: A storage mode that either acts as DirectQuery or Import mode (i.e. cached or not cached) based on the context of the query.

Often datasets are either Import or DirectQuery, but Composite models opened up the possibility to have multiple storage modes in one dataset allowing for more granular tuning. So shortly put, Import mode for fast querying and DirectQuery for real-time updates, and composites with dual mode if you want to hit a compromise between the two.

Introducing Direct Lake

Direct Lake makes it a little bit easier for us to choose between Import and DirectQuery, by offering us the performance of Import mode and the recency of DirectQuery. To make this possible Direct Lake connects directly to delta tables in OneLake caching its data directly from parquet into the dataset memory-cache.

?? Direct Lake: A storage mode for Lakehouse tables that combines an Import-like cache with a DirectQuery-like approach to data fetching on a column by column basis – ultimately offering us blazing fast queries on real-time data, cached into memory on a need-to-use basis.

Direct Lake (Storage mode)

Requirements

First of all, although Direct Lake exists as a storage mode for Power BI datasets it is a Fabric feature and requires Microsoft Fabric, specifically it needs the following:

  • A Fabric capacity
  • Available Fabric Consumption Units
  • A Fabric storage artifact (currently only Lakehouse)
  • Managed Delta tables

Fabric capacity: Since Direct Lake is relying on both Fabric artifacts and consumption it requires a Fabric capacity and neither the Premium Per User (PPU), Pro or the free tier will suffice. Furthermore the Fabric capacity must be connected to the workspace the dataset is running in.

Available Fabric Consumption Units: The Direct Lake dataset will use Fabric capacity consumption, both during refresh, caching and querying, therefore it may suffer from performance degradation if running on a capacity running close to its limits.

Fabric storage artifact: Direct Lake is built for delta tables in OneLake and can only connect to managed delta tables in a Fabric Lakehouse. Other sources are not compatible with Direct Lake, although very much so indirectly by being loaded into a Lakehouse. The whole deal about Direct Lake is to prepare the modelled data (the so-called “gold layer”) inside Lakehouse instead of inside the dataset itself.

Supported data sources for Direct Lake

Managed Delta Table: The Lakehouse is divided into two top-level folders; Files and Tables. It is possible to store both CSV files, Parquet files, and even Delta folders inside the files folder – but neither of them would be supported by Direct Lake. Only the managed Delta tables listed under “Tables” with the delta table icon can be included in the Direct Lake dataset.

Supported data file formats for Direct Lake


Inside the Direct Lake Engine

At the surface Direct Lake seems like a simple but elegant solution to a long-time challenge in the industry, speed vs. latency. In many way is it elegant and simple in its overall concept, but only works so well because of a lot of different underlying design choices in the Microsoft Fabric platform.

In this segment we will explore some of the different components that makes up the engine of Direct Lake. However, we only know what is available to us from the documentation and system logs and cannot dig into the inner workings of how exactly direct lake does what it does, but we can talk about what it does and speculate on why it does so.

File formats

A large part of the reason why Direct Lake works to begin with stems from the choice to standardize on Parquet and Delta cross all Fabric workflows. Parquet files is an open industry standard column-store format optimized for bulk operations for large data volumes – but being “just” a file it doesn’t come with the robustness and integrity of an actual database.

Delta fixes that by provided a managed metadata structure around parquet files, turning them from a group of separate files into a common table-like entity, enabling ACID compliance, delete, update and merge capabilities, time travelling and more.

Standardized file formats in Fabric

V-ordering

V-ordering is a new write time optimization making use of special sorting, row group distribution, dictionary encoding and compression on parquet files to improve read times of parquet files to up to 50% - directly influencing the query performance of Direct Lake too. To make sure V-ordering is enabled it is important that we do last-step transformations or data movements using a Fabric compute like Notebooks, Pipelines or Dataflow gen2.

Fabric compute artifacts supporting V-order optimization

A personal experiment revealed a file size compression for V-order enabled Delta tables of up to 86% compared to csv files, and 64% compared to regular parquet files.

File sizes across different file formats

Fall back

There are limitations that may prevent Direct Lake from using its standard cache operations. To prevent breaking functionality Direct Lake is utilizing DirectQuery and the SQL endpoint of the storage artifact as a fallback for these cases.

Known fallback triggers:

  • Large table sizes (1 billion rows for a F64 in my test at the time of writing this post)
  • Special data types
  • Composite models
  • Advanced security
  • Using views

Direct Lake with Fall-back (Storage mode)


Caching, Temperature & Eviction

As earlier hinted, Direct Lake achieves its import-level performance by, well, importing (caching) the data. However, there are a few improvements in how it imports which gives it its flexibility to handle the import on-the-fly.

First of all, it caches data on a more granular level, targeting columns individually, instead of whole datasets. Secondly the targeted columns are done on a query-by-query basis, meaning only needed data is pulled into the in-memory cache. Finally, because the data comes from a close proximity source in a standardized, known and optimized column-store format, the caching process is much more straightforward and thus faster than a regular import mode which has to be able to handle hundreds of different data sources in all shapes and flavors.

The engine also logs some metadata to keep score on which columns are actively used and? which aren’t. This is referred to as the columns temperature in the metadata.

Querying a Direct Lake dataset for the first time

Next time this dataset is used the query doesn’t have to fetch data from OneLake, but can now get it directly from the in-memory cache. By doing so the second query also “reheats” the columns.

Querying cached columns in Direct Lake

If we wish to check the current temperature, cache size or the last time a column was accessed we can query the DMV $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMNS from the system logs - using DAX Studio or a similar alternative - with the following query:

SELECT
	MEASURE_GROUP_NAME
	,ATTRIBUTE_NAME
	,DICTIONARY_SIZE
	,DICTIONARY_TEMPERATURE
	,DICTIONARY_LAST_ACCESSED
FROM $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMNS
ORDER BY DICTIONARY_TEMPERATURE DESC        

I'm wondering if the temperatures are used, or will be used, for column-level evictions once the Fabric capacity starts reaching its limits – but I haven’t been able to trigger this behavior in experiments yet.

Framing (refreshing)

The whole premise of Direct Lake is having direct access to the data – in its newest form. However, refreshes is still an applicable concept even in Direct Lake. In the service it is still called a refresh, even though it technically only is a meta-data refresh. When a Direct Lake dataset is refreshed it will update the metadata definitions - also called framing – redefining what data is included in the dataset and what is not. So refresh actually doesn’t touch any data and will run in seconds!

Which means that by refreshing the dataset we get an updated view of our most recent data with not delays. It still has to be triggered though. but don’t worry, with Direct Lake we are able to enable real-time refreshes of our dataset with the setting “Keep your Direct Lake data up to date”.

Dataset configuration: Keep your Direct Lake data up to date

Whenever we refresh our Direct Lake dataset two things happens:

  1. The metadata definitions are updated (aka framing)
  2. The in-memory cache is removed (aka flushed or evicted)

There are two key considerations to consider around refreshing Direct Lake datasets, The first is that by updating too frequently or asynchronously from our data pipelines we risk partial data updates and may face problems with data integrity, for example having an updated fact table with dimensions that doesn’t exists yet in our dimension tables. The second consideration is that whenever we refresh the dataset the first user who has to use the dataset will experience slower query times – and if the datasets updates constantly that may become a recipe for a bad user experience.

A new file isn’t automatically included in the dataset without a refresh of the dataset metadata (Framing)

We may consider deploying tactics like triggering the refresh from our pipeline, rather than automatically, and possibly running queries on the dataset immediately after to “reheat” up the most used columns.

Data modelling recommendations

  • Data transformations are forced upstream, no power query in Direct Lake datasets
  • Data modelling will include new considerations regarding handling delta loads, dataset updating and historical data
  • Regular data modelling best practices still applies – including the star schema

Limitations:

  • Not supported in composite models. Technically, you can add a Direct Lake dataset as a source of a composite model, but by doing so it will fall back to running DirectQuery through the SQL endpoint.
  • A write operation through the XMLA endpoint will prevent it from being edited in the web modelling service. It is a point of no return.

Conclusion

Import level performance with Direct Query speed, was the inagural promise from Microsoft when Direct Lake was released. I for one was skeptical at first, but understanding the inner workings better I realize that it works by actually running import mode but in a way closer to Direct Query – so in some way it IS a import mode dataset and direct query dataset at the same time.

Storage modes for OneLake

Performance my initial findings is that for regular datasets below 1 GB the performance will be very similar almost indiscernible from Import mode and for larger datasets (up to 30 GB tested) it is undoubtedly faster than DirectQuery and a size where import becomes unfeasible (More than 5 hour refresh time and too large depending on capacity size).

All in all, Direct Lake does seem like the best tradeoff between DirectQuery and Import mode - but if you are updating a small dataset once per night (i.e. don't care about recency/freshness) Import mode is still the go-to option in my book.

Ashish Bansal

Power BI Expert @ Adidas | Empowering professionals with Power BI & Reporting

9 个月

how to check queries sent by Direct Lake? Is there any history or logs we can refer to find out which query fallack to DQ?

回复
Jesper Frederiksen

Passionate about Microsoft Fabric, Business Intelligence, Big Data, Spark and Machine Learning

1 年

Fantastic read, Mathias! Thanks for some detailed, thorough, and most of all promising insights into Direct Lake.

Brent Smith

Report Developer

1 年

any idea what dataverse shortcuts (direct lake) might cost?

Erwin de Kreuk ??

Principal Consultant | Lead Data and AI | Data Platform MVP | Public Speaker | Innovate to Accelerate at InSpark

1 年
??Chiranjib Ghatak

Sr. Solution Architect @Wipro Benelux | TOGAF Certified | Azure | Databricks | Snowflake | Presale | Data&Analytics| Hyperscalers | GenAI

1 年

Very informative

要查看或添加评论,请登录

Mathias Halkj?r Petersen的更多文章

  • Warping through Data pipelines

    Warping through Data pipelines

    A comprehensive guide to using Data pipelines in Microsoft Fabric Want to retrieve numerous tables from databases…

    2 条评论

社区洞察

其他会员也浏览了