登录查看更多内容

Data Lake -Part 2: Observability, portability, security, and governance guardrails to support analytics workload.

Madhusoodanan K Madhavan

发布日期: 2024年3月9日

Observability, portability, security, and governance are crucial to success for any data lake project. Balancing these aspects without over-engineering is key. This article discusses achieving this balance and getting it right. An experienced data architect can help assess the requirements and maturity level of the IT team and define a solution along with a road map to deliver business value consistently.

Note: This is part 2 of the article on Data Lake, please refer Part 1 for better context.

Another key aspect that matters for data lake success is metadata and data lineage management. However, metadata management of all IT assets has to be driven at the organisational level, hence kept aside for this discussion.

We can learn from solution patterns from other matured industry domains like

“ABC” analysis from inventory management systems.
Flight telemetry management in Aviation domain, i.e. the Cockpit for pilots, Air Traffic controller (ATC) dashboard and data analytics by aircraft maintenance and service teams.

In ABC analysis: This is a method of Inventory Categorisation: It groups items into three categories: A, B, and C.

In the case of the Aviation industry, let us focus on fuel related information.

?The pilot in the cockpit needs real time data about fuel status on each tank, fuel flow rate, and low-level warning?at sub-second granularity.

The ATC would need only know the fuel-related situation minimum or emergency. But for all the aircrafts under its control, where the granularity can be few seconds if not in minutes.

The maintenance team would need flight data related to fuel at much more granular metrics for each component, valves, pumps, etc.? including temperature, pressure, vibration, etc. but with acceptable latency of days. here the telemetry capture happens at the most granular level but analysed and reported on the needed basis. Similarity, the data lake shall collect data at this granularity, but it shall be processed and moved to Stage or Refined area only based on the requirement.

In other words, the granularity at which, various telemetry and other digital exhaust should be metered from data operation perspective, has to be in a balance between cost and its value.

The data pipeline should be able to change granularity of a metrics collection interval of data from seconds to hourly basses. This enables the consumer to zoom into much granular detail and swiftly pinpoint issues and avoid noise when things are running smoothly.

Initially the data pipeline may need to be observed every second to stabilize and optimize. Once the data pipeline is matured (well within control limits) a lower granular say 15 min or hourly metric collection would be sufficient.?

The granularity of metrics collection can be dynamically changed to low or high, based on need, on failure retry the observability can be turned on to a per second mode and once job is successful turn back to hourly mode for the next cycle.

Observability stakeholders

Data lake observability has various stake holders and aligning on an operating rhythm with clear ownership and touch points between teams is critical to success.

领英推荐

The Rise of Data Productization: A New Way of Thinking?

Trust In SODA 5 个月前

Data Trends 2023

Matillion 2 年前

June Data Dive: Accelerate Insights with WhereScape

WhereScape Data Automation 10 个月前

Key factors to consider:

Great care must be used ?when in defining the metrics, reporting /dashboard and action workflow, here are few factors to consider.

Design Thinking workshops.

Design thing workshops are effective tool to gather a consolidated observability requirements across various stake holders

Design Thinking (DT) workshops are a curated list of tools and techniques that work in gathering observability requirements.? This list can be used to align with all stakeholders define collective view of the observability requirements.

Key factors for design thinking workshops successes are.

-?????? Leadership team’s commitment and clarity of business objective, i.e. what and why the project is done.

-?????? Organisational culture and its ability to adapt to the changes and collaborate.

-?????? Expertise of the DT workshop facilitator

In other words, a design thinking workshop by itself may not ensure project success, it just helps in “failing fast” or improve the odds of successes, i.e. Let us say a project with a 50 % chance of successes can be improved to 60-70% chance with design thinking workshops. After couple of iterations of design thinking sessions, if the observability requirements like metrics, reports and corrective action workflow are not defined precisely, it is better to go back and assert the data lake requirement, use case, and its value then come back later.

A sample consolidated high level data lake observability requirement list may look something like this.

Domain specific metrics like clicks per hour on a web portal or, IT asset per tenant per hour? for ITSM data etc. are good data points that help correlate business usage metrics with Data lake metrics.

Portability

It is the ability to migrate workloads across cloud providers or alternate tools and solutions even within the same cloud provider as better best-of-the-breed solutions are chosen.

Here, one must take a pragmatic view by balancing the value of vendor specific solution vs using a generic common minimum denominator for the capability.

? e.g Use Standardisation file format like parquet for bulk data, Avro for record wise. But for job orchestration use vendor provide tools like AWS Glue or GCP dataflow, but care must be taken to keep business rules in SQL like higher level languages as much as possible.? Only deviate when performance becomes more valuable than maintainability.? Similarly, keep business logic and schema transformation rules separate.? This may also add performance overhead like duplicate record scans etc.?? however, this decoupling generally pays off with improved maintainability.?

Security and Governance

? The data lake security is paramount since a breach on a data lake can expose to more damage than the breach of one or two source application. On other hand, a data lake can’t be a means to bypass security from its source application.? i.e. A user who is not entitled to accesses data from the original data source should not be able to access in data lake either. the job of the data lake is to, without altering the privilege level,? compress the process and enable governance /auditing requirement from week to minutes? using automation and templated guardrails.

Summary

?For a successful data lake implementation, "getting it right" on Observability, portability and security & governance are as vital as data modeling , data cataloging, data quality and data pipeline lineage.

Tinniam V Ganesh (TV)

Principal AI Engineer @ProRata.ai

10 个月

Pertinent points. Well written!

1 次回应

Russ Blaisdell

1 年

Great article

1 次回应

Marci Formato

SYSTEMS AND DATA RESILIENCY

1 年

Thank you for this good information!

1 次回应

查看更多评论

要查看或添加评论，请登录

Madhusoodanan K Madhavan的更多文章

Data Lake -Part 3: Simplifying Secured Data Analytics and AI Workflows

2024年4月21日

Data Lake -Part 3: Simplifying Secured Data Analytics and AI Workflows

This article discusses a common challenge: securely managing data analytics and AI projects that involve sensitive…
Data Lake : What, why and why not

2024年2月26日

Data Lake : What, why and why not

This is a multi-part article to share the experience about Data Lake capability & cost evaluation across various…

6 条评论

Data Lake -Part 2: Observability, portability, security, and governance guardrails to support analytics workload.

Madhusoodanan K Madhavan

领英推荐

Madhusoodanan K Madhavan的更多文章

社区洞察

其他会员也浏览了

Exploring Data Mesh – PoV

How a "Data Fabric" can simplify the management of your data?

Why Data Pipeline Failures Happen and How to Prevent Them

Unlocking the Power of Data Lakehouse: Redefining Data Management in 2024

Part 2 - The Transformation Journey - Data → Information

Navigating the Data Deluge: A Deep Dive into the Data Lifecycle

Idea Generation For Data Projects

Beyond Data Transfer: The Often-Underestimated Challenges of Data Migration

Empower Your Data Transformation Journey with Real-Time Alerting System in DataFuze

Why Automated Data Analysis is the Future of Decision-Making

领英推荐

Madhusoodanan K Madhavan的更多文章

Data Lake -Part 3: Simplifying Secured Data Analytics and AI Workflows

Data Lake : What, why and why not

社区洞察

其他会员也浏览了

Exploring Data Mesh – PoV

How a "Data Fabric" can simplify the management of your data?

Why Data Pipeline Failures Happen and How to Prevent Them

Unlocking the Power of Data Lakehouse: Redefining Data Management in 2024

Part 2 - The Transformation Journey - Data → Information

Navigating the Data Deluge: A Deep Dive into the Data Lifecycle

Idea Generation For Data Projects

Beyond Data Transfer: The Often-Underestimated Challenges of Data Migration

Empower Your Data Transformation Journey with Real-Time Alerting System in DataFuze

Why Automated Data Analysis is the Future of Decision-Making