登录查看更多内容

Future Proof Data lakes

Stefano Rocco

Investor at waterstream.io at bla

发布日期: 2022年3月16日

A Data Lake is a scalable, centralized repository that can store raw data. Data lakes differ from data warehouses as they can store both structured and unstructured data, which you can process and analyze later. This removes much of the overhead associated with traditional database architectures, which would typically involve lengthy ETL and data modeling when ingesting the data.

The idea of a Data Lake is to have a single store of all data in the enterprise ranging from raw data to transformed data which is used for various tasks including:

Reporting
Visualization
Analytics
Machine Learning / Artificial Intelligence

Data Lakes offer organizations a solution for collecting big, small or smart data, which can then be manipulated and mined for insights by data scientists, analysts, and developers.

Data lakes come in a variety of shapes and sizes. Many businesses have chosen to house their data lakes on Hadoop Distributed File System (HDFS) to store large files across the Hadoop cluster

Although data platforms such as on-premises Hadoop are commonly used for data lakes a shift is underway. Organizations are starting to transition to cloud-based data platforms such as Amazon Web Services (AWS), Microsoft Azure, Snowflake, Google Big Query, etc. to meet modern data lake requirements.?

Cloud based? data lakes promise to offer easier, more affordable, and flexible data platforms compared to on-premises Hadoop.

Implementing a Data Lake requires the followings:

Scalable polyglot big data storage
Scalable polyglot data processing
Batch and real time data processing
OLAP and OLTP process analytics
Encryption for data at rest and in motion
Data product versioning
Data product schema
Data product de-identification
Unified data access control and logging
Data pipeline implementation and orchestration
Data product discovery, catalog registration and publishing
Data governance and standardization
Data product lineage
Data product monitoring/alerting/logging
Data product quality metrics (collection and sharing)
In memory data caching
Federated identity management
Compute and data locality

Challenges of building in house traditional data lakes typically involve:

Complexity of building data pipelines: you commonly need to manage both the hardware infrastructure - spinning up servers, orchestrating batch ETL jobs, and dealing with outages and downtime.
Maintenance Costs: aside from the upfront investment needed to purchase servers and storage equipment, there are ongoing management and operating costs when operating an on-premise data lake, mostly manifesting in IT and engineering costs.
Scalability: if you want to scale up your data lake to support more users or bigger data, you’ll need to manually add and configure servers. You need to keep a close eye on resource utilization, and any additional servers create additional maintenance and operating costs.

Advantages of moving data lakes to the cloud are:

Focus on business value, not infrastructure: use the cloud to store big data in the cloud and eliminate the need to build and maintain infrastructure, so you can use engineering resources to develop new functionality, which you can connect to business value.
Lower engineering costs: you can build data pipelines more efficiently with cloud-based tools. The data pipeline is often pre-integrated, so you can get a working solution without investing hundreds of hours in data engineering.
Use managed services to scale up: the cloud provider can manage scaling for you. Some data lake cloud services provide completely transparent scaling, so you don’t need to add machines or manage clusters.
Agile infrastructure: cloud services are flexible and offer on-demand infrastructure. If new use cases come up for your data lake, you can re-think, re-engineer and re-architect your data lake more easily.
Up-to-date technologies: cloud-based data lakes update automatically and make the latest technology available. You can also add new cloud services as they become available, without changing your architecture.
Reliability and availability: cloud providers work to prevent service interruptions, storing redundant copies of data on different servers. Availability spans several data centers.

Moving workloads from traditional on-premises systems to the cloud is a big decision for most enterprises. Concerns center around control over systems, their costs, availability, end-to-end performance, and overall security. Even after these are addressed, it's not practical to move every application to the cloud overnight.

Many companies adopt a hybrid approach, with some apps and data in the public cloud, and others remaining on-premises. And even if your hybrid cloud is meant to be a temporary situation, there is one good reason to stick with it for the long haul: data.?

Moving data to the public cloud can be extremely risky, and expensive. Furthermore managing public cloud costs is even harder when an organization has a multi-cloud strategy.

In particular following are some of the key concerns that needs to be addressed:

Data gravity: a tendency to leave data where it currently resides. This is due to costs. You need to provision double the capacity while you move it. Speed of transfer, replication while transferring, or concerns about data loss. This is especially true of big data, where there are huge amounts of data at stake.
Regulatory requirements and local laws:? impose restrictions and requirements, and may even prohibit outright the migration of data to the cloud. Furthermore laws vary by country.
Value of Data: many organizations just aren't willing to push it all to the cloud. The risks of data theft and loss are simply too great.
Game of Clouds: the complexities of cloud service migration mean that many customers stay with a provider that doesn’t meet their needs, just to avoid the cumbersome and risky process of moving.

While hybrid multi cloud does bring more enterprises into the cloud, it should be made clear that it is not always a replacement for on-premise deployments. Companies might believe that implementing multiple clouds eliminates the need for them to continue to run an on-premise architecture. Hybrid multi cloud can extend the on-premise capabilities of enterprises. Users will continue to keep their information secure on a physical architecture while taking advantage of the power that a cloud environment provides.

The biggest way hybrid multi cloud will impact cloud computing isn’t just the ability for enterprises to run a workflow in any environment. It’s also a sign that the cloud will become more open in the future.

An open hybrid cloud gives businesses flexibility, control, and choice and leaves you open to innovation. It makes the most of your existing resources, while allowing systems, applications, and data to work together seamlessly across public, private cloud and on-premise too.

Open hybrid cloud opens doors and opportunities for enterprises to better manage, control and decide about CAPEX vs OPEX, build vs buy, cloud bursting vs scale up vs scale out, innovate vs renovate strategies and much more.

The next-generation DPDHL Data Lake architecture will be designed in order to fully leverage a hybrid open multi cloud based environment.

The Hub & Spoke Architecture

Traditionally, in the context of information systems, the Hub & Spoke (H&S) is an approach to data integration and more in general to data management. In the broader context of big data H&S is also an approach to analytics.

The spoke-hub distribution paradigm was introduced as a form of transport topology optimization in which traffic planners organize routes as a series of "spokes" that connect outlying points to a central "hub".

H&S is a conceptual model representing a strong foundation for distribution and specialization of data and computing power still guaranteeing centralized control and management of resources.

H&S represents a flexible model for businesses who need to embrace and deal with incremental innovation in a very liquid and polyglot world of big data and analytics.?

The hub represents the centralized source for common data and services such as identity management, logging, monitoring, or others. Each spoke represents a specialized and “isolated” source of processing and analytics such as GPU computing, Data Science Workbench, dedicated spark clusters, etc.

H&S also represents a flexible model for businesses who want to leverage the cloud. A dedicated Google Tensorflow cluster is “just” a spoke instance.

H&S repensents a way for businesses who need to comply with local regulations or the problem associated with data gravity, game of cloud, etc.

Multi H&S enables businesses to achieve geographic scale in a consistent, flexible and coherent way.

Nevertheless it is not gold that glitters. H&S based architectures require a lot of discipline concerning automation, orchestration and isolation.?

Isolation, in particular, in this context refers to two fundamental concepts:

The ability to perform resource isolation as a technique to avoid resource contention among collocated VMs on a single server or on a group of servers (cluster).
The ability to co-locate and couple together products and resources by functional affinity as independent deployable units. In the microservices world this technique is also known as Architectural Quanta.

The hub and spoke model also helps organizations to leverage and organize their hybrid (on premise/ cloud) infrastructure into multiple connected environments depending on their needs.

The H&S model works very well with “new” architectural principles/patterns, it actually encourages them, such as Data Mesh and Architecture Quanta.

Data Mesh Architecture

Companies tend to fall into the trap of confusing simply moving IT systems to the cloud with the transformational strategy needed to get the full value of the cloud.

Lifting and shifting by moving legacy applications to the cloud will not automatically yield the benefits that cloud infrastructure and systems can provide. In some cases it can result in IT architectures that are more complex, cumbersome and costly than before.

The tradeoff between the need for integration and the need for isolation makes things even more complex and harder to accomplish especially when designing data aware systems such as data lakes.

The full value of cloud comes from approaching these options not as one-off tactical decisions but as part of a holistic strategy to pursue digital transformation.

Although the cloud is not a prerequisite it certainly acts as a force multiplier. Companies that view cloud capabilities in this way can create a next-generation IT capable of enabling business growth and innovation in the rapidly evolving digital era.

A “new” architectural approach based on a modern, distributed architecture for analytical data management has merged with the name of Data mesh.?

Data mesh enables end users to easily access and query data where it lives without first transporting it to a data lake or data warehouse. The decentralized strategy of data mesh distributes data ownership to domain-specific teams that manage, own, and serve the data as a product.

领英推荐

10 big data technologies you must know

Naveen Joshi 7 年前

Data Lake And Data Warehouse

Saigon Technology - Accelerate Software Development 1 年前

Proposal for a Management Architecture for Large…

INNOVANT 1 年前

The main objective of data mesh is to eliminate the challenges of data availability and accessibility at scale. Data mesh allows business users and data scientists alike to access, analyze, and operationalize business insights from virtually any data source, in any location, without intervention from expert data teams.

The data mesh paradigm arises from the insight that centralized, monolithic data architectures suffer from some inherent problems:

A lack of business understanding in the data team
The lack of flexibility of centralized data platforms
Slow data provisioning and response to changes

Data mesh aims to solve these problems by making organizational units (called “domains”) responsible for managing and exposing their own data to the rest of the organization.

Data Mesh is a cultural and technical concept that distributes data ownership across product domain teams, with a centralized data infrastructure and decentralized data (as) products.

Data Mesh Core Principles

There mainly are four high-level foundational principles in the implementation of Data Mesh driven architectures:

Domain Driven Data Ownership
Data As a Product
Federated Governance
Self-serve infrastructure as a Platform (mmm… I’m not totally sure about it)

Domain Driven Data Ownership

Domain Driven Design is a standard language emerged to ensure that information can be efficiently used across the business; this standard or “ubiquitous” language is central to the idea of domain-driven design (DDD) as a means for removing barriers between developers and domain experts.

DDD is a software design approach focusing on modeling software to match a domain according to input from that domain's experts.

The DDD methodology is often related to Microservices whereas each domain model context is very well defined and the share nothing paradigm is a constituent principle for such a kind implementation.

With regards to data the analytics world, compared to the operational world, is taking a different approach. While in the operational world sharing data directly, in their raw original format,? across different domains generally is a bad practice, in the analytics domain it actually is a strongly recommended practice.

Isolation, including data isolation, is a core architectural principle for the design and implementation of highly scalable microservices. Isolation enables engineers to independently and atomically develop, deploy and scale functionalities reducing the risk and the impact of affecting and therefore breaking neighbor services.

Sharing data in their raw original format across different domains breaks the data isolation principle and it may have an impact on the overall capacity of the system to independently scale and evolve in long medium/long term.

Unlocking the power of data in the so-called data democratization process requires a certain degree of flexibility in the analytics world.?

Software architects should find the right balance between the need to isolate in contrast with the need to share. Software architects should therefore put in place all the best design principles, for instance anti-corruption layers, data virtualization, etc. in order to better accommodate the need to easily and efficiently evolve and scale the system in face of ever changing/new requirements.

Data Products vs Data As Products

Data Product means that any digital product or feature can be considered a “data product” if it uses data to facilitate a goal. For example, the home page of a digital newspaper can be a data product if the news items featured in the home page I see are dynamically selected based on my previous navigation data.

One of the principles of the data mesh paradigm is to consider data as a product.

Data as product is a subset of data product is a subset of all possible data products. “data as a product” is the result of applying product thinking into datasets, making sure they have a series of capabilities including discoverability, security, explorability, understandability, trustworthiness, etc.

In other words data as a product expects that the analytical data provided by the domains is treated as a product, and the consumers of that data should be treated as customers

Types of Data Products

In general we can distinguish data sources into three categories:

Raw Data: collecting and making available data as it is from source systems.
Consolidated Data: aggregation, compaction and cleansing of raw data.
Derived Data: filtering, join, aggregation, of raw data or consolidated data into a new data sources

Consolidated and Derived data can be further be distinguished into the following two categories:

lookup: integration and support data for final reports
algorithms: input data for ML or AI algorithms
decision making: final data available for decision making support

Federated Governance

Data mesh implies being able to decentralize data, organizing it along domain-driven lines, with each domain owning its own data that it treats as a product that is consumed by the rest of the organization.

Federated data governance in a data mesh describes a situation in which data governance standards are defined centrally, but local domain teams have the autonomy and resources to execute these standards however is most appropriate for their particular environment.

Nethertheless data governance is more about a process than a technology itself although technology might help to support the process itself .?

The term “federated governance” very often misleads organizations regarding the adoption of a technology rather than another in order to solve a very difficult problem related to governance of data.

Data governance is not about changing a technology but it is mostly about changing organizational culture.

Federated Governance is about giving? data providers (the organizational unit responsible for the ownership of data) the responsibility to provide data as products: discoverability, security, explorability, understandability, trustworthiness, etc.

Data governance is about establishing data management practices and processes that ensure that the data provided by each domain is of the highest quality, from a consumer perspective.

Data governance is not about technology (technology is a means) but it is about organizational processes and best practices.

The Architecture Quanta

Architectural quantum is an independently deployable component with high functional cohesion.

An example of quanta may be Hive, HiveServer2 and Hive Metastore which are functionally dependent on each other.

A Data lake spoke in is a specific and specialized architecture quanta that can independently scale and evolve from the rest of the system.?

Consumers can share or have their own dedicated spoke. Consumers can use different versions of a spoke. Consumers can run spokes on premise or in cloud depending on their own personal needs or on local regulations. Portability therefore is an important requirement for spokes. Portability will be discussed in the Orchestration & Automation chapter.

Architects deal with architectural quanta, the parts of a system held together by hard-to-break forces.?

One of the keys to building evolutionary architectures lies in determining natural component granularity and coupling between components to fit the capabilities they want to support via the software architecture.

Containers-based architectures enable and facilitate the adaptability of architecture quanta. The deployment unit can be a “mini data lake”, a dedicated or specialized unit spoke (H&S) servicing a specific business purpose.

要查看或添加评论，请登录

Stefano Rocco的更多文章

Evolutionary Systems

2018年3月18日

Evolutionary Systems

Evolutionary Systems are a type of system, which reproduce with mutation whereby the most fit elements survive, and the…

4 条评论
Microservices Contracts

2017年12月2日

Microservices Contracts

Reasoning about data is the hardest part of Microservices, however dealing with Microservices Communication comes…

5 条评论
Microservices: A Blueprint Architecture

2017年10月30日

Microservices: A Blueprint Architecture

Recently I wrote a couple of articles about Microservices Integration, Part 1 and Part 2 where I introduced some of the…

8 条评论
Microservices Integration - Part 2

2017年10月23日

Microservices Integration - Part 2

This is the second part of an article I wrote about Microservices Integration. In the previous post I raised some…

6 条评论
Microservices Integration - Part 1

2017年10月15日

Microservices Integration - Part 1

Reasoning about Integration, especially about data integration, is one of the hardest part while designing…

20 条评论

See all articles

Future Proof Data lakes

Stefano Rocco

Investor at waterstream.io at bla

The Hub & Spoke Architecture

Data Mesh Architecture

领英推荐

Data Mesh Core Principles

Domain Driven Data Ownership

Data Products vs Data As Products

Types of Data Products

Federated Governance

The Architecture Quanta

Stefano Rocco的更多文章

社区洞察

其他会员也浏览了

The Evolution of Data Engineering: From Batch Processing to Real-Time Insights

Difference Between Data Lakehouse and Delta Lake

Are you planning to learn Azure Data Engineering jobs?

Data Technology Trend #8: Data Next

Data Lake / Mesh / Data Fabric and Everything in Between (The Active Metadata)

Navigating the Depths of Data Lakes: A Comprehensive Overview

Azure Data Engineer

Build and manage GCP services Data Mesh architecture

Optimize Data lake layout using Clustering in Apache Hudi

Big Data and Hadoop

The Hub & Spoke Architecture

Data Mesh Architecture

领英推荐

Data Mesh Core Principles

Domain Driven Data Ownership

Data Products vs Data As Products

Types of Data Products

Federated Governance

The Architecture Quanta

Stefano Rocco的更多文章

Evolutionary Systems

Microservices Contracts

Microservices: A Blueprint Architecture

Microservices Integration - Part 2

Microservices Integration - Part 1

社区洞察

其他会员也浏览了

The Evolution of Data Engineering: From Batch Processing to Real-Time Insights

Difference Between Data Lakehouse and Delta Lake

Are you planning to learn Azure Data Engineering jobs?

Data Technology Trend #8: Data Next

Data Lake / Mesh / Data Fabric and Everything in Between (The Active Metadata)

Navigating the Depths of Data Lakes: A Comprehensive Overview

Azure Data Engineer

Build and manage GCP services Data Mesh architecture

Optimize Data lake layout using Clustering in Apache Hudi

Big Data and Hadoop