How companies implementing "DATA LAKE"

How companies implementing "DATA LAKE"

Data Processing applications era started in 1959 with launching of COBOL language (Common Business Oriented Language) and evolved further all thanks to RDBMS (Relational Data Base Management System). Entire world started using RDBMS and started facing challenges with "Bigdata") and evolved further all thanks to RDBMS (Relational Data Base Management System). Entire world started using RDBMS and started facing challenges with "Bigdata".

Many developer communities appreciated Google for its white paper releases on "Google File System" and MapReduce framework/programming model to tackle Big Data problem. World adopted the same strategy and created opensource project "Hadoop". Hadoop is first attempt of Data Lake implementation. Hadoop also slowly losing its charm and world moving towards Cloud. Modern data management has many components now and I would like to discuss how many companies levering it.

Dataware House VS Hadoop (First Data Lake Solution)


What is DATA LAKE?

Data Lake is central repository which can store Structure, Unstructured and Semi Structured data in raw format and process it at scale.

How was DATA LAKE initially imagined?

Initial DATA LAKE design

  • Just like Dataware houses, collect data from different data sources and store them in distributed storage like HDFS (Hadoop Distributed File System).
  • Process data using MapReduce (Hive & Pig) and Spark to generate business reports. Spark took MapReduce over a period of time.
  • Many BI tools provided connectors to connect to Data Lake.
  • Processed data also stores in Data Lake and empowered Data Science (Machine Learning & AI).

However, Data Lake technology misses two critical features:

  • Transaction and Consistency
  • Reporting Performance

So, Data Lake started integrating RDBMS/Data warehouses for reporting & BI purposes. ML/AI works on Data Lake, but BI/Reporting shifted to warehouses. This is current architecture of Data Lake for many organizations.

Data Lake implementation with RDBMS for BI/Reporting

How Data Lake Transition Happened?

In recent days, Hadoop as a platform lost its excitement and cloud infrastructure became more economical and started seeing wider developments. Present data lake implementation includes many cloud components.

  • Many tools such as Informatica, Azure Data Factory, Talend etc. helping companies to load the data-to-data lake from different sources.

Data Ingestion

  • Data Processing and transformation is taking care heavily by Spark.

Spark for Data Processing

  • Last most critical stage, data consumption and experimentation where we need JDBC connections, APIs etc. to access data.

  • Altogether, data lake solving many data challenges. However, we need more capabilities to complete its functionality. For example,

  1. Security and Access Controls
  2. Scheduling and Workflow management
  3. Data Catalog and metadata
  4. Data Life Cycle and Governance
  5. Operations and Monitoring

#dataenginneering #datalake #Spark


Leo Delmouly

Co-founder @Streambased

5 个月

Great write-up! A new trend is emerging around the convergence of operational and analytical systems via Kafka, effectively turning it into a "streaming datalake." With solutions like Confluent's Tableflow and Streambased, you can now query streaming data directly at the source, bypassing complex and costly ETL/ELT processes. This approach not only ensures total consistency but also unlocks a much greater volume of data for prediction and advanced analytics.

Satish Dhawan

Data Scientist | Senior Recruiter

8 个月

Hi Suresh, we have a community of Power BI, Tableau, and other BI technology professionals. If interested, you can join our group and share your experience with us. https://www.dhirubhai.net/groups/8164518/

要查看或添加评论,请登录

社区洞察

其他会员也浏览了