Future Proof Data lakes
A Data Lake is a scalable, centralized repository that can store raw data. Data lakes differ from data warehouses as they can store both structured and unstructured data, which you can process and analyze later. This removes much of the overhead associated with traditional database architectures, which would typically involve lengthy ETL and data modeling when ingesting the data.
The idea of a Data Lake is to have a single store of all data in the enterprise ranging from raw data to transformed data which is used for various tasks including:
Data Lakes offer organizations a solution for collecting big, small or smart data, which can then be manipulated and mined for insights by data scientists, analysts, and developers.
Data lakes come in a variety of shapes and sizes. Many businesses have chosen to house their data lakes on Hadoop Distributed File System (HDFS) to store large files across the Hadoop cluster
Although data platforms such as on-premises Hadoop are commonly used for data lakes a shift is underway. Organizations are starting to transition to cloud-based data platforms such as Amazon Web Services (AWS), Microsoft Azure, Snowflake, Google Big Query, etc. to meet modern data lake requirements.?
Cloud based? data lakes promise to offer easier, more affordable, and flexible data platforms compared to on-premises Hadoop.
Implementing a Data Lake requires the followings:
Challenges of building in house traditional data lakes typically involve:
Advantages of moving data lakes to the cloud are:
Moving workloads from traditional on-premises systems to the cloud is a big decision for most enterprises. Concerns center around control over systems, their costs, availability, end-to-end performance, and overall security. Even after these are addressed, it's not practical to move every application to the cloud overnight.
Many companies adopt a hybrid approach, with some apps and data in the public cloud, and others remaining on-premises. And even if your hybrid cloud is meant to be a temporary situation, there is one good reason to stick with it for the long haul: data.?
Moving data to the public cloud can be extremely risky, and expensive. Furthermore managing public cloud costs is even harder when an organization has a multi-cloud strategy.
In particular following are some of the key concerns that needs to be addressed:
While hybrid multi cloud does bring more enterprises into the cloud, it should be made clear that it is not always a replacement for on-premise deployments. Companies might believe that implementing multiple clouds eliminates the need for them to continue to run an on-premise architecture. Hybrid multi cloud can extend the on-premise capabilities of enterprises. Users will continue to keep their information secure on a physical architecture while taking advantage of the power that a cloud environment provides.
The biggest way hybrid multi cloud will impact cloud computing isn’t just the ability for enterprises to run a workflow in any environment. It’s also a sign that the cloud will become more open in the future.
An open hybrid cloud gives businesses flexibility, control, and choice and leaves you open to innovation. It makes the most of your existing resources, while allowing systems, applications, and data to work together seamlessly across public, private cloud and on-premise too.
Open hybrid cloud opens doors and opportunities for enterprises to better manage, control and decide about CAPEX vs OPEX, build vs buy, cloud bursting vs scale up vs scale out, innovate vs renovate strategies and much more.
The next-generation DPDHL Data Lake architecture will be designed in order to fully leverage a hybrid open multi cloud based environment.
The Hub & Spoke Architecture
Traditionally, in the context of information systems, the Hub & Spoke (H&S) is an approach to data integration and more in general to data management. In the broader context of big data H&S is also an approach to analytics.
The spoke-hub distribution paradigm was introduced as a form of transport topology optimization in which traffic planners organize routes as a series of "spokes" that connect outlying points to a central "hub".
H&S is a conceptual model representing a strong foundation for distribution and specialization of data and computing power still guaranteeing centralized control and management of resources.
H&S represents a flexible model for businesses who need to embrace and deal with incremental innovation in a very liquid and polyglot world of big data and analytics.?
The hub represents the centralized source for common data and services such as identity management, logging, monitoring, or others. Each spoke represents a specialized and “isolated” source of processing and analytics such as GPU computing, Data Science Workbench, dedicated spark clusters, etc.
H&S also represents a flexible model for businesses who want to leverage the cloud. A dedicated Google Tensorflow cluster is “just” a spoke instance.
H&S repensents a way for businesses who need to comply with local regulations or the problem associated with data gravity, game of cloud, etc.
Multi H&S enables businesses to achieve geographic scale in a consistent, flexible and coherent way.
Nevertheless it is not gold that glitters. H&S based architectures require a lot of discipline concerning automation, orchestration and isolation.?
Isolation, in particular, in this context refers to two fundamental concepts:
The hub and spoke model also helps organizations to leverage and organize their hybrid (on premise/ cloud) infrastructure into multiple connected environments depending on their needs.
The H&S model works very well with “new” architectural principles/patterns, it actually encourages them, such as Data Mesh and Architecture Quanta.
Data Mesh Architecture
Companies tend to fall into the trap of confusing simply moving IT systems to the cloud with the transformational strategy needed to get the full value of the cloud.
Lifting and shifting by moving legacy applications to the cloud will not automatically yield the benefits that cloud infrastructure and systems can provide. In some cases it can result in IT architectures that are more complex, cumbersome and costly than before.
The tradeoff between the need for integration and the need for isolation makes things even more complex and harder to accomplish especially when designing data aware systems such as data lakes.
The full value of cloud comes from approaching these options not as one-off tactical decisions but as part of a holistic strategy to pursue digital transformation.
Although the cloud is not a prerequisite it certainly acts as a force multiplier. Companies that view cloud capabilities in this way can create a next-generation IT capable of enabling business growth and innovation in the rapidly evolving digital era.
A “new” architectural approach based on a modern, distributed architecture for analytical data management has merged with the name of Data mesh.?
Data mesh enables end users to easily access and query data where it lives without first transporting it to a data lake or data warehouse. The decentralized strategy of data mesh distributes data ownership to domain-specific teams that manage, own, and serve the data as a product.
领英推荐
The main objective of data mesh is to eliminate the challenges of data availability and accessibility at scale. Data mesh allows business users and data scientists alike to access, analyze, and operationalize business insights from virtually any data source, in any location, without intervention from expert data teams.
The data mesh paradigm arises from the insight that centralized, monolithic data architectures suffer from some inherent problems:
Data mesh aims to solve these problems by making organizational units (called “domains”) responsible for managing and exposing their own data to the rest of the organization.
Data Mesh is a cultural and technical concept that distributes data ownership across product domain teams, with a centralized data infrastructure and decentralized data (as) products.
Data Mesh Core Principles
There mainly are four high-level foundational principles in the implementation of Data Mesh driven architectures:
Domain Driven Data Ownership
Domain Driven Design is a standard language emerged to ensure that information can be efficiently used across the business; this standard or “ubiquitous” language is central to the idea of domain-driven design (DDD) as a means for removing barriers between developers and domain experts.
DDD is a software design approach focusing on modeling software to match a domain according to input from that domain's experts.
The DDD methodology is often related to Microservices whereas each domain model context is very well defined and the share nothing paradigm is a constituent principle for such a kind implementation.
With regards to data the analytics world, compared to the operational world, is taking a different approach. While in the operational world sharing data directly, in their raw original format,? across different domains generally is a bad practice, in the analytics domain it actually is a strongly recommended practice.
Isolation, including data isolation, is a core architectural principle for the design and implementation of highly scalable microservices. Isolation enables engineers to independently and atomically develop, deploy and scale functionalities reducing the risk and the impact of affecting and therefore breaking neighbor services.
Sharing data in their raw original format across different domains breaks the data isolation principle and it may have an impact on the overall capacity of the system to independently scale and evolve in long medium/long term.
Unlocking the power of data in the so-called data democratization process requires a certain degree of flexibility in the analytics world.?
Software architects should find the right balance between the need to isolate in contrast with the need to share. Software architects should therefore put in place all the best design principles, for instance anti-corruption layers, data virtualization, etc. in order to better accommodate the need to easily and efficiently evolve and scale the system in face of ever changing/new requirements.
Data Products vs Data As Products
Data Product means that any digital product or feature can be considered a “data product” if it uses data to facilitate a goal. For example, the home page of a digital newspaper can be a data product if the news items featured in the home page I see are dynamically selected based on my previous navigation data.
One of the principles of the data mesh paradigm is to consider data as a product.
Data as product is a subset of data product is a subset of all possible data products. “data as a product” is the result of applying product thinking into datasets, making sure they have a series of capabilities including discoverability, security, explorability, understandability, trustworthiness, etc.
In other words data as a product expects that the analytical data provided by the domains is treated as a product, and the consumers of that data should be treated as customers
Types of Data Products
In general we can distinguish data sources into three categories:
Consolidated and Derived data can be further be distinguished into the following two categories:
Federated Governance
Data mesh implies being able to decentralize data, organizing it along domain-driven lines, with each domain owning its own data that it treats as a product that is consumed by the rest of the organization.
Federated data governance in a data mesh describes a situation in which data governance standards are defined centrally, but local domain teams have the autonomy and resources to execute these standards however is most appropriate for their particular environment.
Nethertheless data governance is more about a process than a technology itself although technology might help to support the process itself .?
The term “federated governance” very often misleads organizations regarding the adoption of a technology rather than another in order to solve a very difficult problem related to governance of data.
Data governance is not about changing a technology but it is mostly about changing organizational culture.
Federated Governance is about giving? data providers (the organizational unit responsible for the ownership of data) the responsibility to provide data as products: discoverability, security, explorability, understandability, trustworthiness, etc.
Data governance is about establishing data management practices and processes that ensure that the data provided by each domain is of the highest quality, from a consumer perspective.
Data governance is not about technology (technology is a means) but it is about organizational processes and best practices.
The Architecture Quanta
Architectural quantum is an independently deployable component with high functional cohesion.
An example of quanta may be Hive, HiveServer2 and Hive Metastore which are functionally dependent on each other.
A Data lake spoke in is a specific and specialized architecture quanta that can independently scale and evolve from the rest of the system.?
Consumers can share or have their own dedicated spoke. Consumers can use different versions of a spoke. Consumers can run spokes on premise or in cloud depending on their own personal needs or on local regulations. Portability therefore is an important requirement for spokes. Portability will be discussed in the Orchestration & Automation chapter.
Architects deal with architectural quanta, the parts of a system held together by hard-to-break forces.?
One of the keys to building evolutionary architectures lies in determining natural component granularity and coupling between components to fit the capabilities they want to support via the software architecture.
Containers-based architectures enable and facilitate the adaptability of architecture quanta. The deployment unit can be a “mini data lake”, a dedicated or specialized unit spoke (H&S) servicing a specific business purpose.