LakeHouse
Ajmal karuthakantakath
Head of Developer Platforms and Tools at Toyota Financial Services Corporation.
Traditional data lake approach is too simplistic. It is built with the notion of dumping the data into a flat storage such as S3, Hdfs, cloud object store etc. This leaves the architectural decision on how to get insight out of the data to the future. Some of the primary reason to create a analytical platform in the first place is to cater following key business needs: (a) provide quick actionable insight via real-time analytics (using Snowflake, Red-shift, druid, click-house etc) (b) support traditional batch and near real time reporting needs (Tableau, Looker reports etc) (c) run machine learning model on the data (Apache Spark, Flink ML libraries, Jupyter notebook, SageMaker etc ) (d) meet regulatory and compliance and audit needs (e) provide data and metrics insight for production support e.g. application observability, traceability needs ( Grafana, Prometheus etc.) Data lake pattern starts to break down as these key capability lists increase. To satisfy the above needs, the traditional data-lake approach forces the team to copy data to various places through complex custom-built ETL pipelines. Following are some of the issues with this approach (a) consistency issue on the data, the data that was copied to the real-time insight layer may not be the same as in the machine learning layer or regulatory compliance layer and business/customer/developer start to lose confidence on the data. (b) developers who are supporting the production software do not get accurate information from prometheus/grafana dash-boards. (c) latency issues as there are many loosely built custom pipelines. (d) bugs start to creep into the ETL pipeline on top of it a software vulnerability on a specific library that popped up forces all the ETL pipeline teams to update their quickly (e) on a fine day, a new GDPR requirement came through where customer personal information in all layers should be removed on consent (f) A new security requirement came through to mask a new personal identifiable field e.g. date of birth in all data items. (g) The Finance team needs to consolidate data across various sources, they will have to spin off another ETL pipeline and consolidate data. This is a perfect storm where 10 to 100s of engineers are needed to address all these needs, a program management committee is needed to manage and report the progress or roadblocks., developers running around to meet timelines. These additional resources could be working to build a business competitive edge. This is an example of accidental architectural complexity and started with wrong software abstraction, an innocent data-lake. Uber engineering went through the same concern and they have come out with the concept called LakeHouse, here data-lake is an intelligent layer and (a) provides transactional guarantee on the Upsert operations (b) a low latency platform (c) provides a good bunch of integration library to various source data system, works well with Apache Debezium (d) LakeHouse emit transactional streams of changes as Kafka event stream to the downstream system as and when data mutates in the LakeHouse using change data capture concept, this allows downstream systems to absorb the delta of changes instantaneously in a low latency fashion rather than downstream system that depend on a custom build ETL pipeline which does snapshot of data movement and may take hours to complete (e) consistent data flows to all the downstream system. downstream systems need to listen to the change stream Kafka topic. (e) A new GDPR requirement is easy to implement at the LakeHouse layer and changes propagate downstream accordingly. (f) offer schema enforcement and governance. Uber engineering team has open sourced this effort with code name Apache Hudi. This Podcast explains the concept of LakeHouse in depth
There are alternate solution available from DataBricks called Delta Lake details are here
AWS solution called LakeFormation details are here
Father | Veteran | Helping to build & protect wealth for families!
2 年Ajmal, thanks for sharing! It is an interesting perspective.