Some Notes on Data Lake Zoning
Before we begin, please note that all written below reflects personal opinion and experience and doesn’t represent official position of my current or previous employers.
Dear friends and colleagues, in the tens of presentations from our customers and partners I always observe the “classic” Data Lake structure like the one below.
“Classic” (pattern-driven) Data Lake Structure
Although this is a popular and widely adopted concept, I have witnessed some examples where companies and teams were deliberately making their data lifecycle much more complex or on the contrary, oversimplified, just to match this pattern. I should admit that this 3-zone concept affects my perception as well and I assume that it is similar to your view on that.
The idea behind 3-layer Zoning is brilliant and simple. We keep our raw data for some use cases (including compliancy reporting, data science, etc.), use staging as an intermediary storage for data processing and enrichment and finally store all the processed data into the Curated (or Gold) area.
Let us take a closer look at this model.
In reality our Data Lake structure looks something like the diagram below. Complexity tends to grow from Raw to Curated which is expected as consumption assumes that hundreds and even thousands of real people will have access to the information using a vast variety of tools. While on ingestion side, ?only up to hundreds can actually write data even in the largest systems.
This shows us some disbalance in this 3-layer model. Time to go through Data Lake zones one by one.
Raw / Bronze Zone
Raw is an extremely simple storage layer used by technical users and mainly technical accounts like applications and compute routines. It is also predominantly used for copy activities or data ingestion (even in more complex situations where some legacy delivery methods are involved).
Typical Raw / Bronze Zone of Data Lake
Usually, we assume that the information stored in the Raw Zone may be consumed sooner or later. We tend to think that we may use this information for correcting some mistakes, auditing as well as use it ?for compliance purposes.
Of course, to achieve it we have to know what data is there, where it is located, and we need proper tools and policies in place so we can query data once we need it. Quite often we also refer to Raw zone as to be used by our Data Science teams “in the very nearest future”.
This is why our data teams are building some layer of data governance and additional security policies for Raw Zone right from the start. Another implication is that we tend to refrain from putting a proper data lifecycle policies for the Raw Zone, we usually neither archive nor delete this data.
The truth is that (at least in my projects) this data is rarely used for the declared purposes. We hardly use raw data for correction as it will require us to reperform the processing of data and in most cases, we should receive the same results. Querying Raw data requires some in-depth understanding of it and the underlying transformations as well as significant optimizations or significant compute capacities. That in turn forces us to either create a full thread of development on the Raw Zone itself (which usually does not make a lot of sense) or just rely on some out-of-the-box solutions (which will not necessary be capable of scanning entire dataset). It will also require us to expose Raw Zone to some privileged users who can bypass Curated Zone.
While usage of raw data is rare, a number of efforts we need to put to fulfill these “theoretical” tasks are pretty significant.
Moreover, overcomplicating Raw Zone affects all the ingestion and data onboarding processes and sometimes forces data engineers to avoid it completely.
Some ways to mitigate the issues described above:
1.??????Rely on Logs when planning compliancy. Use application and processing logs as well as Data Lineage tools for most of compliancy checks.
2.??????Implement proper data lifecycle policy. At least archive Raw Data so you do not duplicate storage cost.
3.??????Try not to overcomplicate governance of Raw Zone – decrease the amount of compute used (if any), simplify policies, restrict access to automated processes only.
4.??????If absolutely necessary, create a simple automated data extraction process without any complex calculations / transformations. You can also think of creating a “replay” functionality which will reprocess raw data stored within archives.
Staging / Silver Zone
Staging area in such deployments acts as a temporary storage which we are creating and using for our processing environments. Usually, it makes little sense to retain data in staging area, so it is mainly governed by data processing tools and access is granted to the technical accounts mainly. Staging area may be a little bit more complex due to the necessity of isolating data areas dedicated to certain compute. There can be special folders or sub zones for logs, failed processing, etc.
Some representation of Silver / Staging Zone
As stated above, we need a staging area to allow our compute engines to store data temporarily during its processing. Basically, whenever you start Azure Data Factory or Azure Databricks you may create one on the fly.
A few things are ?usually included in this zone:
-?????????Staging Folders. Dedicated to particular compute clusters / applications / pipelines, etc.
-?????????Log Folders. All the processing logs are stored there.
-?????????Lookup Data Folders. Lookups and Data Enrichment quite often happen at this stage. Our processing compute performs lookups against databases, APIs or collects additional data from Streams and Queues. In this case we may need to have some storage to keep intermediary results of additional sub processing.
-?????????Failures Folder. Failures quite often occur while processing our data. We need to have a place to store all the records / files failed to be processed.
Depending on the size and shape of the system, multiple compute clusters may require access to this zone. Although we are providing access to this zone for automated processes only (or sometimes mainly), we still need to ensure the proper encapsulation of data and processing. This can be achieved either on the individual folder structure within the same Storage Account or using one Storage Account per one compute cluster / process / domain.
I prefer the latter as it simplifies access management and enables automation in a better way. Imagine I create processing cluster on the fly (like, for instance, parametrized ADF Pipeline from template). In this process I can also create a Storage Account and wipe it out once it is not required any longer. With static processes it allows us to tailor the lifecycle rules as well as achieve better granularity of governance.
As these accounts are or at least should be purely temporary we do not need to store data for a long time there. Once data is processed, it should be deleted. This will help your organization to save some amount of money on data storage.
I would like to note that refraining from complex governance at this stage is crucial. A significant number of developers would try to avoid going through Staging controlled by your policies in case they are too complex. As the data is just stored there temporarily, we do not really need a complex governance there.
Some things to consider when designing a Staging Area:
-?????????Consider it as temporary technical storage – do not overcomplicate it.
-?????????Perform a proper lifecycle (especially if you still rely on Raw data) – wipe out processed data regularly. Watch out for some processing taking longer like failures processing.
-?????????Consider simple distribution of storage accounts. Ideally you can create them on the fly, especially in case you have some dynamic processing environment.
Curated / Gold Zone
Simply put, this is where data lives.
Curated area is usually the most complex out of all three. It inevitably combines some processing like post-processing activities, derivatives creation, etc. with a complex security and access model, optimizations made for specific use cases, etc.
What can we do with that? Think a little bit differently. We do not necessarily need to think that Curated Zone = Storage Account. We may adjust the things depending on the situation we are in and what we need to achieve.
Curated / Golden Zone
I think we all have observed “gold” areas which are far away from “gold” state with lots of reasons for that.
First and foremost, data products we create and use in the “gold” zone may be complex and derived from each other. In this situation all these “golden” products reside in one zone which in turn overcomplicates the matter.
Second, in case we use “golden” zone for some querying and processing, we will inevitably face the need to create additional folders and structures (like failed records, logs, etc.). If the data we operate with is under some sort of classification, these extra data points should be in the same zone.?
Third the data volume tends to grow over time and not all this data is equally useful in all the query / processing patterns. There also may be specific restrictions for accessing archival data or data within a certain timeframe. In some cases, data is time series in nature or is structured within a very well-defined process (e.g transactional data). Even if you use Delta Lake, the amount of data piles up in your account affects your performance sooner or later.
Sometimes tokenization or data prefiltering should be applied. Even though some of prefiltering can happen within Staging area, it will be stored in the Curated zone afterwards.
One last but important thing. Storage Accounts are extremely performant but even they have some limits. Depending on what you try to do simultaneously in your storage accounts (like heavy writes combined with direct reads from Power BI, combined with data load into Synapse, etc.), performance of a particular Storage Account may not be ideal. Especially in a situation where your entire “Golden Zone” sits in this storage account.
We should not forget that quite often messy and vast “golden” storage account is the place where all the security access is granted. This access should also be audited, checked and sooner or later revoked??
And now just imagine that you need to failover all of the above into another region or migrate it somewhere else altogether. Lots of fun, I guess.
While you're screaming your head off on a roller coaster, it's possible you might lose your memory too. CHAD SLATTERY/GETTY IMAGES
Basically, effort wise and complexity wise our 3 layers will look something like this.
One thing which always makes me smile in this model are the arrows going from one layer to another like some sort of magic moves, merges, validates, cleans-up and aggregates the data. Compute and its complexity are usually totally ignored on this level of abstraction.
In addition, as you can see most of the efforts are linked to Gold / Curated layer which is quite logical. We need our data to be consumable and this might not be that easy due to the reasons described above. In some scenarios though, we can lean on much simpler models. Let us take a closer look at a few of them.
N.B. In lots of cases “classic” 3-layers model still makes sense and can be used. Please consider the points raised in the first part of this article where I have described some of the challenges and ideas per Zone.
Some Notes on Internal Structure of Zones within Data Lake
Before we begin with scenarios, I would like to clarify one thing. In the scenarios below in most cases I use a simple box to identify “a Zone” within Data Lake.
I would like to elaborate what this Zone symbol encapsulates and how to land it to proper data storage on Azure.
As you may have noticed from the previous chapter, Zones have quite a complex internal structure. This is usually dictated by the consumption pattern of a particular zone and if it is used as a part of data processing.
In the simplest case Zone is just a Storage Account with some Storage Containers within. However, depending on the complexity of the use case Zone can consist of multiple Storage Accounts with tens or even hundreds of containers and folders within each account. Again, it clearly depends on the use case and complexity of data you are serving and processing.
Usually, the structure of the zone is the following.
Structure of the Zone
Each Zone consists of one or more storage accounts. Zone is a virtual container defining all the Storage Accounts within this Zone:
-?????????Subscription and Resource Group where the Zone is located.
-?????????Policies (Azure Policies, Security Baseline, Access Policies).
-?????????Governance (Compliancy, Rules and Governance Policies).
-?????????Lifecycle applied to data in the Zone.
Storage Account is a top-level item in the physical storage layout. It is ultimately responsible for applying and being compliant with the requirements defined by Zone. The cost is also assigned to Storage Account. It defines the performance tier, type of the account, replication rules, disaster recovery and high-availability parameters as well as some other important attributes.
Each storage account has its own performance cap assigned to it as well as some limitations of the amount of data stored (although, these limitations are pretty high numbers).
Containers within Storage Accounts are among available alternatives. We also support file shares, queues and tables, but the most frequently used are Containers. Containers enable some logical grouping of data within account. Container also can maintain its own encryption scope.
领英推荐
On the next level we have so called Logical Folders within Storage Container. Let us say you have Storage Container called “Products”. In some cases, this will be enough, and you can just plan internal structures of the technical folders. But in other situations, you may need to have additional level of logical abstractions, such as “Product Categories”, etc, residing ithin the Storage Account. Maybe you evenhave to maintain some deeper levels such as “Product Types”, etc.
In a way, purpose-wise, Logical Folders are similar to Storage Containers but with some important differences when it comes to access management. The point is that Containers can be used to assign RBAC privileges, however for Folders only ACLs are applicable, the diagram is below.
RBAC or ACL?
You may ask a good question. When to prefer Folders over Containers?
The answer is pretty straightforward, requirements for access management.?
It is very easy to overcomplicate access management policies but in such a case, chances are good that nobody will use your Data Lake structures and some shadow IT will flourish. However, in most organizations a principle of the least privileges is set in stone (which is a right btw).
So, in case access can be granted to entire Storage Container for a large group of users, I would prefer to use this approach and not not overcomplicate access with ACLs. If it is not possible, I would move to the logical folders instead.
Last but not the least is the level of Technical Folders that can include:
-?????????Dataset
-?????????Incremental Deliveries (like CDC)
-?????????Corrections
-?????????Staging / Temporary Storage
-?????????Processing Failures
-?????????Logs
-?????????Sample Dataset
-?????????Filtered / Tokenized Dataset
The actual layout of the folders depends on your needs in every particular case. It can be also defined by Zone and enforced by deployment templates and policies.
Important note, in some cases depending on responsibilities and access management for users and tools, some of these folders may be moved into separate Storage Accounts like Logs residing within the Security Storage Account, etc..
Important Note on Storage Account Performance
Before we jump to the specific use cases, I would tell couple of words on the performance of Azure Storage Accounts and its connection with the other options. Below is the short version of https://albero.cloud decision tree for Storage. It describes the main differences between the SKUs of the storage as well as some of the differences between the performance layers.
Some performance aspects and limits of various Storage Accounts
Please note that different SKUs (Premium and Standard) support different redundancy options. The same applies to performance tiers (Hot or Cool). Different services also have different limits on the size of blobs, maximum number of blobs in the Storage Account, Maximum throughput of accounts as well as some limits on individual blobs.
The important highlight is that we cannot always blindly believe in these figures. The truth is performance of an individual Storage Account depends on multiple factors. So please use these numbers for reference only. It can help you to design workarounds on different accounts and plan layout of your Zones properly, but this doesn’t prevent you from doing testing and planning optimizations.
Remember – always test, constantly monitor and be ready to troubleshoot and optimize!
Now we are finally ready to explore some specific Data Lake design patterns.
Various Cases of Non-Standard Data Lake Layout
Scenario 1: Simplest Case
In the simplest case your workload can fit into one single zone. Especially once transformations are not required. In a situation when you just read the data in the same form as ingested or when you just use Data Lake as a temporary storage for loading data into some other systems, for instance Data Warehouses.
Usually in such situations we mainly provide access to technical accounts and only a few users can access data in Data Lake. Also, in this scenario we are not adding any extensive Data Governance tools and practices as complexity may affect our time to market and ability to execute.
Simple One Zone Data Lake
In such scenarios we can usually rely on proper monitoring and log analysis as well as properly organized access management, RBAC or ACL if some lower-level granularity of access management is required, as well as networking security with private endpoints.
Lifecycle in this scenario is supported by lifecycle rules on the level of Azure Storage Account.
Scenario 2: Time Focused
Another case is when our Data Lake is fully time focused. In the time-centric systems it is typical to have data distributed and consumed based on the characteristics of time when specific record or data point was created.
Usually, datapoints are the most useful in a short period of time like an hour or a day. For such data SLA for availability and processing latency is usually extremely tight. We have to process and make this data available very fast.
After this period data may still be required for immediate consumption with some low-latency SLA and high availability in mind.
Some different archival or historical data layers may be different depending on how fast data should be retrieved, in which form, etc.
Quite often we see real-time ingestion as a part of the data ingestion process and not only periodical or batch ingest. SLA for real-time processing is considerably higher than for batch one, so they should be divided into two different lifecycles with their own ways of processing, storing, and serving data.
Time-Series Data Lake
The first lifecycle is just processing incoming streaming data as it arrives. This may be achieved using Spark Structured Streaming, Flink or other tools alike (like Azure Stream Analytics, for instance). Then it ?offloads data into near real-time serving zone at a very fast pace. This near real-time serving zone is defined by the way data is served, it can be a data lake for storing and serving micro-batches of the events which just have happened. It can be equally direct push in Power BI or serving through database like Azure Cosmos DB (of course the latter is a bit outside of Data Lake scope).
Thus, the main purpose of the Near Real-Time Zone is to serve the data as soon as it is processed. Ideally this happens within a maximum few-minute timeframe.
Real-time data engines can also immediately package events and offload them in micro-batches into Operational Zone where this data can be processed together with some incoming batches and lookups.
Second lifecycle refers to periodical or batch ingestion. This data arrives at the Periodical Ingestion layer and is stored in the Operational Zone after processing. The main purpose of Operational Store in this case is to serve as a main source of truth for the data with the certain period of validity.
This period of validity is purely defined by the use case and can be a day, a month, or a year, depending on the business requirements.
The main point is that once the end of validity for data is reached, it is being offloaded into Historical Zone using some post-processing techniques. You may ask – why cannot we just use lifecycle policies to achieve same behavior?
You can. However, in some cases you will need special Historization post-processing tools and separate accounts. For instance:
1.??????Your consumption pattern for historical data is different from consumption pattern for operational data. For example, you need to group some files in a meaningful way to achieve appropriate processing SLA for historical data. Small files are quite good (performance-wise) in Operational Store but may significantly affect query times in Historical Storage where large files may be more optimal. Especially in case you are planning to perform some retrospective analysis using parallel processing engines.
2.??????You have requirements for aggregating historical data in some meaningful way – for instance precreating some reporting for faster access or, for example, creating some Search Indexes on Historical data, or some other smart way of indexing and preparing it for purpose.
3.??????You do not want to overcomplicate access management for historical data and different policies apply to operational and historical data (larger volume, more information or anything else).
In this sense lifecycle rules as a feature of Azure Storage Account can be considered just as the simplest version of historization compute.
One more important note is that Staging Zone in this setup is a Temporary Storage used by compute engines to store results of processing.
Scenario 3: Process Focused
Processed focus is a pretty standard approach in numerous industries such as logistics, retail and others. When the process is very well defined and is not a subject of frequent change but at the same time produces quite a lot of data per processing step. For example, this can be a large-scale design or production cycle or container ship delivery or anything alike.
In such a situation, each step may produce large-scale files which won’t be a good fit for databases. So, we can establish a Process Focused Data Lake.
Process-Focused Data Lake
In a way such a layout reminds us of a three-layer structure but also can be compared with the data processing in message queues. Indeed, we create some structures assigned to the process stage used by the Processing Compute to store the results of the processing of each stage. The difference between Temporary storage and Processing Stage is that in Processing Stage Zones we are actually storing final result of each step (which is meaningful for the business) while Temp Storage is just a technical account used by the certain Compute Cluster.
There is also a difference between Process Focused structure and Time Focused Structure of Data Lakes. In the Processed Focused Data Lake you do not necessary need to have separate streams for real-time and periodical processing as both are used in the same step of the process. Thus, the latency of processing in the worst-case scenario is limited by the speed of arrival and processing of the slowest data set / data point.
Our main process starts with reading data from ingestion layer. Once each stage is completed, the results are stored in the separate Processing Stage Zone. This allows you to separate compute and access especially for confidential data as well as scale-out in case your process produces extreme amounts of data. Once processing is done, the results are stored in Completed Processing zone and may be later historized using Azure Storage Account lifecycle policies.
Theoretically, all the processing should be done using some applications and technical accounts assigned to these apps. Thus, users should (in ideal unicorn world scenario) not have access to the stages unless via some applications. There is often a request for some intermediary information on processing stages. Thus, we can utilize some technical database or another means of informing the users of the current processing stages.
Quite often we also need to have some separate structures for storing datasets or datapoints which failed to process. Usually, they require special treatment and special skillsets so a separate zone in our Data Lake can be used for this purpose together with all the power of reactive programming enabled by Azure Storage Accounts.
Reconsidering 3-layers Model for Data Lake Zoning
In the general case, with all discussed above, I would propose to reconsider 3-layers model in the following way:
1.??????Split Data Lake into two main scopes, with and without access of human actors.
2.??????Consider two main Zones precisely Ingestion and Consumption Zones.
3.??????Stop thinking of Ingestion Zone as something accepting batch writes as in most organizations it is not true anymore for quite a while.
4.??????Stop thinking of Consumption Zone as something serving batch data as in multiple organizations is it rapidly changing right now.
5.??????Transform our perception of Staging Zone into Temporary Storage used (even if shared) by the automated data processing tools.
6.??????Review access of our data processing tools to enrich sources as a part of Data Lake access management model (even as these sources are well outside of the technical or implementation scope).
7.??????Apply minimum required Data Governance and Compliancy policies to Ingestion Zone and Temporary Storage as overcomplicating it turns our data engineers unhappy and makes our delivery processes slow and inefficient.
8.??????Approach to a design of Consumption Zones with pragmatical and goal-focused way with the security, performance and business continuity in mind.
9.??????Apply rigorous Data Governance and Access Policies on the Consumption Zone. Implement / adopt tools which can not only scan and find discrepancies in the policies, but which can notify you and help you to act.
10.??When planning access model always start with the revoke access procedure in mind. This may be unexpectedly complex and may prevent you from enforcing security on the crucial part of your data.
Thank you for reading this far and hope you find this small article helpful.
Good luck with you Data Projects!
Freelance Data Engineer | Apache Spark, Delta Lake, Databricks, Microsoft Azure Expert | Building Scalable Data Solutions
2 年Adam Rogalewicz
Great read Andrei! I have a vision of a SaaS UI that lets you create and monitor this more precise Data Lake environment by defining requirements for each data product, user/group, etc. And using Bicep or Terraform to generate the DL assets. Making it easier for orgs to translate their needs into a physical architecture.
Thorough and interesting read Andrei Zaichikov, thank you!? When you write “start with the revoke access procedure in mind” - what is your reasoning behind it? I’d really appreciate to hear your thoughts?
FSI Principal Program Architect at Microsoft
2 年Thank you Andrei. Amazing job which I definitely will use in my design solutions. I know that many clients are looking for this explanation, so happy to share. Great job!