Thinkers Lounge – Guiding Principles for a Plug-and-Play Data Lake
Referring to Anbu’s blog, this is the first article from the Data Engineering stream in the Thinkers Lounge.
By going through the article, readers can expect to understand why designing a Data Lake solution to function in Plug & Play mode is obligatory in data engineering.
Albeit Data Engineering team focus has always been on Data and associated operations, the contemporary Data Lakes are premeditated as centralized repository of data storage. This has crafted data extraction as the core of any Data Lake solution and strapped down Data Harvest to lower priority. However, this is contrary to Business objective which strives to start small, vet the efficacy by harvesting some meaningful insights and then expand with new data sources. Imagine if data storage would have been the primary Business objective then any Cloud based expandable storage like Amazon S3, Azure ADLS would have suffice. So there exists a gap and to bridge this, the design principles of Data Lake solution require infusion of fresh notions and latest technologies.
The First and foremost is designing the operational model of Data Lake. This warrants a 3-tier architecture comprising of Ingestion, Curation and Harvest modules. In the conventional data lake, designing the Ingestion layer has been the most intricate process. Firstly, given the heterogeneous nature of data; designing a progressive change data capture (CDC) solution is arduous. Secondly, contriving multiple data transmission channels at several junctures in a process flow is laborious. Finally, the extraction solution entails augmenting the application tier with computational power to manage additional workload. Still during the post unplanned outage, the application tier must pull its own weight to clear the backlog while ascertaining meta data and data resiliency aspects. All these intricacies necessitate designing the data extraction from the database tier. Many commercially available Replication & CDC technology products perform this task seamlessly. These CDC products plug their configuration to the operational database for detecting and capturing insertions, updates, and deletions that are applied to the tables.
Once the ingestion conduit is plugged, the next task is to design the Persistence layer for data play. To attain near real-time data fetch, the design characteristics require milliseconds of read latency even with Tera bytes of data. So, a farm of NoSQL database is prerequisite to form the foundation of persistence layer. With multi ingestion channels pumping data into the NoSQL farm, it is the Curation layer which acts as the core of the solution. This layer is responsible for enriching and co-mingling data across applications and channels to generate data insights and provide analytical outputs. With Machine Learning adoption accelerating rapidly, feeding data to ML models even requires substantial considerations. This entails designing a robust non-intrusive Framework using open-source cluster-computing framework. The framework should be expandable in accommodating new data consumers enabling seamless integration without changing the underlying data platform. Finally, it needs to support wide range of data processing capabilities like Streaming, micro batching, mass load, OTL (one-time load), etc. Hence, one tier while loosely coupled with the other tier, operates independently and needs to be expandable with elasticity and horizontal scaling features.
This is what we have today on the design considerations of a plug & play Data Lake solution. Stay tuned for our next article in the series.