Enterprise Data Lake. How not to turn it into a Data Swamp

Enterprise Data Lake. How not to turn it into a Data Swamp

Data Lake vs Data Warehouse

Modern Data Lake and traditional Data Warehouse approaches in enterprise data storing and analytics are completely different but not antagonistic. An organization can take benefits using both approaches together or just one of them depending on the goals. So let’s see “VS” as comparison, not as contradiction.

What is the matter at a glance view from the business side, without technical architecture details?

Purposes

Common Purposes of both

Main: One-point data source for Decision Management Support.

What does it mean?

  • Decision Management Support includes data gathering, data mining, analysis, reporting, modeling, prediction and planning - everything what helps an organization to make and maintain its decisions.
  • One-point means that all services and domains of an organization have one point-of-entry to all data to support their decisions.

Why do we need it?

  • All analytics and decisions across the whole organization are based on the same data. Without this consolidated data, plans and forecasts from different departments can be based on different data entities with different content and consistency which leads to different estimations, and finally, with these different pieces of the puzzle of points of view, it is impossible to figure out the entire enterprise-level picture.

Secondary: Split Operational Transactional data and data for Analytical purpose.

Why do we need it?

  • Operational and Analytical data have different purposes, structure and life-cycles.
  • Keeping them together harms the performance of both.

Differences in approaches

No alt text provided for this image

Architecture

Data Warehouse Architecture

All incoming requests are processed by Operational Transactional System and then exported to Data Warehouse in the form of ready-to-use for Analytics Services.

No alt text provided for this image

Fig 1. Data Warehouse Architecture.

Data Lake Architecture

All incoming and outgoing requests which are going through the Data Lake are saved there.

No alt text provided for this image

Fig 2. Data Lake Architecture.

When an Operational System consists of physically or logically different modules (like, for example, Management Module and Manufacturing-Production Module), it can be worthy to store inter-services requests in the Data Lake, too.

No alt text provided for this image

Fig 3. Data Lake Architecture with inter-services requests.

Data Lake + Data Warehouse Architecture

For better analytics outcome it’s possible to use both approaches together. In this case, pure incoming and outgoing requests data from Data Lake are complemented with the processes and prepared data from an Operational System exported into Data Warehouse.

No alt text provided for this image

Fig 4. Data Lake + Data Warehouse Architecture.

Summary

As we see from the comparison table and architecture diagrams above, the idea of Data Warehouse to prepare and store the data in a convenient form for analysis and then use it with minimum transformation.

In opposite, the idea of Data Lake to store the data as is and prepare them for analysis by the service which does this analysis. The advantages of Data Lake are increasing possible number of Analytics Services, fast implementation of new Analytics Services and possibility to use the data which was not used previously in the new Analytics Services. It means that with Data Lake when you have new business goals, needs or ideas, you can invent new analytics and build new services very fast.

Main reasons why Data Lake approach appeared:

Rise in the amount and variety of data. For enterprise systems:

  • Rise in the types of client-side requests. If previously there were structured requests from the Portals and 3d party systems, now there are also requests from Mobile devices and a huge variety of types of Wearable and IoT devices.
  • Rise in the amount of client-side requests.?

Emergence and spread of NoSQL Databases and Cloud technologies as a response to the first challenge.

There is one more Data Lake advantage: high fault-tolerance. As we store the request data in the first part of the processing, even if processing server/service is overloaded, the system remembers the request and can respond appropriately to the client and process it lately inside the system.

How not to turn your Data Lake into a Data Swamp

What can go wrong with Data Lake? It looks pretty easy - just store all the data you receive and send and everything will be fine. Yes, just pay attention to the following points:

Don’t transform the data before storing in Data Lake. All transformation processes for any purposes have to be done by services which use the data lately.

  • Why can someone do the transformation? - To optimize and improve the data in the Data Lake. Especially, it’s important for people who worked with Data Warehouses, as it’s very difficult to change the paradigm.
  • What is the outcome of such an unnecessary transformation? - Different services process and transform the data in different ways. So when you use these data for analysis, there could be a mess of partly-transformed data which will be difficult to lead to a common form or restore the primordial form.

Attribute the data.

So it’s better not to transform the data, but to improve consistency, quality, and finally analysis by adding some data to simplify succeeding data tracking and mapping.

  • At first timestamp every request you store. Datetime issue is one of the most difficult for the systems spread across different services, storages and devices. It’s much easier to figure out when the request was issued with the exact Data Lake storage time than only with the client-side time or service-processed time.
  • Set attributes which can help to identify where the request came from: service, node, etc.
  • ID of the request with which it will be possible to track across the system including Operational Data. If outgoing request is a response for some particular incoming request, it should be annotated with ID of incoming request. If there is a chain of the incoming, outgoing and inter-services requests, then besides their own IDs requests, they should be annotated with chain ID.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了