Enterprise Data Lake. How not to turn it into a Data Swamp
Elena Makurochkina 'Mark'
Data-Driven Decisions / Data Governance / Process Improvement / Complex Systems Integration
Data Lake vs Data Warehouse
Modern Data Lake and traditional Data Warehouse approaches in enterprise data storing and analytics are completely different but not antagonistic. An organization can take benefits using both approaches together or just one of them depending on the goals. So let’s see “VS” as comparison, not as contradiction.
What is the matter at a glance view from the business side, without technical architecture details?
Purposes
Common Purposes of both
Main: One-point data source for Decision Management Support.
What does it mean?
Why do we need it?
Secondary: Split Operational Transactional data and data for Analytical purpose.
Why do we need it?
Differences in approaches
Architecture
Data Warehouse Architecture
All incoming requests are processed by Operational Transactional System and then exported to Data Warehouse in the form of ready-to-use for Analytics Services.
Fig 1. Data Warehouse Architecture.
Data Lake Architecture
All incoming and outgoing requests which are going through the Data Lake are saved there.
领英推荐
Fig 2. Data Lake Architecture.
When an Operational System consists of physically or logically different modules (like, for example, Management Module and Manufacturing-Production Module), it can be worthy to store inter-services requests in the Data Lake, too.
Fig 3. Data Lake Architecture with inter-services requests.
Data Lake + Data Warehouse Architecture
For better analytics outcome it’s possible to use both approaches together. In this case, pure incoming and outgoing requests data from Data Lake are complemented with the processes and prepared data from an Operational System exported into Data Warehouse.
Fig 4. Data Lake + Data Warehouse Architecture.
Summary
As we see from the comparison table and architecture diagrams above, the idea of Data Warehouse to prepare and store the data in a convenient form for analysis and then use it with minimum transformation.
In opposite, the idea of Data Lake to store the data as is and prepare them for analysis by the service which does this analysis. The advantages of Data Lake are increasing possible number of Analytics Services, fast implementation of new Analytics Services and possibility to use the data which was not used previously in the new Analytics Services. It means that with Data Lake when you have new business goals, needs or ideas, you can invent new analytics and build new services very fast.
Main reasons why Data Lake approach appeared:
Rise in the amount and variety of data. For enterprise systems:
Emergence and spread of NoSQL Databases and Cloud technologies as a response to the first challenge.
There is one more Data Lake advantage: high fault-tolerance. As we store the request data in the first part of the processing, even if processing server/service is overloaded, the system remembers the request and can respond appropriately to the client and process it lately inside the system.
How not to turn your Data Lake into a Data Swamp
What can go wrong with Data Lake? It looks pretty easy - just store all the data you receive and send and everything will be fine. Yes, just pay attention to the following points:
Don’t transform the data before storing in Data Lake. All transformation processes for any purposes have to be done by services which use the data lately.
Attribute the data.
So it’s better not to transform the data, but to improve consistency, quality, and finally analysis by adding some data to simplify succeeding data tracking and mapping.