Data Lake -Part 2: Observability, portability, security, and governance guardrails to support analytics workload.
Observability, portability, security, and governance are crucial to success for any data lake project. Balancing these aspects without over-engineering is key. This article discusses achieving this balance and getting it right. An experienced data architect can help assess the requirements and maturity level of the IT team and define a solution along with a road map to deliver business value consistently.
Note: This is part 2 of the article on Data Lake, please refer Part 1 for better context.
Another key aspect that matters for data lake success is metadata and data lineage management. However, metadata management of all IT assets has to be driven at the organisational level, hence kept aside for this discussion.
We can learn from solution patterns from other matured industry domains like
In ABC analysis: This is a method of Inventory Categorisation: It groups items into three categories: A, B, and C.
?
In the case of the Aviation industry, let us focus on fuel related information.
?The pilot in the cockpit needs real time data about fuel status on each tank, fuel flow rate, and low-level warning?at sub-second granularity.
The ATC would need only know the fuel-related situation minimum or emergency. But for all the aircrafts under its control, where the granularity can be few seconds if not in minutes.
The maintenance team would need flight data related to fuel at much more granular metrics for each component, valves, pumps, etc.? including temperature, pressure, vibration, etc. but with acceptable latency of days. here the telemetry capture happens at the most granular level but analysed and reported on the needed basis. Similarity, the data lake shall collect data at this granularity, but it shall be processed and moved to Stage or Refined area only based on the requirement.
In other words, the granularity at which, various telemetry and other digital exhaust should be metered from data operation perspective, has to be in a balance between cost and its value.
The data pipeline should be able to change granularity of a metrics collection interval of data from seconds to hourly basses. This enables the consumer to zoom into much granular detail and swiftly pinpoint issues and avoid noise when things are running smoothly.
Initially the data pipeline may need to be observed every second to stabilize and optimize. Once the data pipeline is matured (well within control limits) a lower granular say 15 min or hourly metric collection would be sufficient.?
The granularity of metrics collection can be dynamically changed to low or high, based on need, on failure retry the observability can be turned on to a per second mode and once job is successful turn back to hourly mode for the next cycle.
Observability stakeholders
Data lake observability has various stake holders and aligning on an operating rhythm with clear ownership and touch points between teams is critical to success.
领英推荐
Key factors to consider:
Great care must be used ?when in defining the metrics, reporting /dashboard and action workflow, here are few factors to consider.
Design Thinking workshops.
Design thing workshops are effective tool to gather a consolidated observability requirements across various stake holders
Design Thinking (DT) workshops are a curated list of tools and techniques that work in gathering observability requirements.? This list can be used to align with all stakeholders define collective view of the observability requirements.
Key factors for design thinking workshops successes are.
-?????? Leadership team’s commitment and clarity of business objective, i.e. what and why the project is done.
-?????? Organisational culture and its ability to adapt to the changes and collaborate.
-?????? Expertise of the DT workshop facilitator
In other words, a design thinking workshop by itself may not ensure project success, it just helps in “failing fast” or improve the odds of successes, i.e. Let us say a project with a 50 % chance of successes can be improved to 60-70% chance with design thinking workshops. After couple of iterations of design thinking sessions, if the observability requirements like metrics, reports and corrective action workflow are not defined precisely, it is better to go back and assert the data lake requirement, use case, and its value then come back later.
A sample consolidated high level data lake observability requirement list may look something like this.
Domain specific metrics like clicks per hour on a web portal or, IT asset per tenant per hour? for ITSM data etc. are good data points that help correlate business usage metrics with Data lake metrics.
?
Portability
It is the ability to migrate workloads across cloud providers or alternate tools and solutions even within the same cloud provider as better best-of-the-breed solutions are chosen.
Here, one must take a pragmatic view by balancing the value of vendor specific solution vs using a generic common minimum denominator for the capability.
? e.g Use Standardisation file format like parquet for bulk data, Avro for record wise. But for job orchestration use vendor provide tools like AWS Glue or GCP dataflow, but care must be taken to keep business rules in SQL like higher level languages as much as possible.? Only deviate when performance becomes more valuable than maintainability.? Similarly, keep business logic and schema transformation rules separate.? This may also add performance overhead like duplicate record scans etc.?? however, this decoupling generally pays off with improved maintainability.?
Security and Governance
? The data lake security is paramount since a breach on a data lake can expose to more damage than the breach of one or two source application. On other hand, a data lake can’t be a means to bypass security from its source application.? i.e. A user who is not entitled to accesses data from the original data source should not be able to access in data lake either. the job of the data lake is to, without altering the privilege level,? compress the process and enable governance /auditing requirement from week to minutes? using automation and templated guardrails.
Summary
?For a successful data lake implementation, "getting it right" on Observability, portability and security & governance are as vital as data modeling , data cataloging, data quality and data pipeline lineage.
Principal AI Engineer @ProRata.ai
10 个月Pertinent points. Well written!
Great article
SYSTEMS AND DATA RESILIENCY
1 年Thank you for this good information!