登录查看更多内容

How companies implementing "DATA LAKE"

Suresh Bonam

Data management expert at Colgate-Palmolive

发布日期: 2024年3月22日

Data Processing applications era started in 1959 with launching of COBOL language (Common Business Oriented Language) and evolved further all thanks to RDBMS (Relational Data Base Management System). Entire world started using RDBMS and started facing challenges with "Bigdata") and evolved further all thanks to RDBMS (Relational Data Base Management System). Entire world started using RDBMS and started facing challenges with "Bigdata".

Many developer communities appreciated Google for its white paper releases on "Google File System" and MapReduce framework/programming model to tackle Big Data problem. World adopted the same strategy and created opensource project "Hadoop". Hadoop is first attempt of Data Lake implementation. Hadoop also slowly losing its charm and world moving towards Cloud. Modern data management has many components now and I would like to discuss how many companies levering it.

Dataware House VS Hadoop (First Data Lake Solution)

What is DATA LAKE?

Data Lake is central repository which can store Structure, Unstructured and Semi Structured data in raw format and process it at scale.

How was DATA LAKE initially imagined?

Just like Dataware houses, collect data from different data sources and store them in distributed storage like HDFS (Hadoop Distributed File System).
Process data using MapReduce (Hive & Pig) and Spark to generate business reports. Spark took MapReduce over a period of time.
Many BI tools provided connectors to connect to Data Lake.
Processed data also stores in Data Lake and empowered Data Science (Machine Learning & AI).

However, Data Lake technology misses two critical features:

Transaction and Consistency
Reporting Performance

So, Data Lake started integrating RDBMS/Data warehouses for reporting & BI purposes. ML/AI works on Data Lake, but BI/Reporting shifted to warehouses. This is current architecture of Data Lake for many organizations.

领英推荐

Data Science Tools To Consider Using

Doug Rose 1 个月前

Top 10 Big Data Tools & Technologies To Watch Out In…

ITIO Innovex Pvt. Ltd. 7 个月前

The Essential Skills Every Aspiring Data Engineer…

Anurodh Kumar 6 个月前

Data Lake implementation with RDBMS for BI/Reporting

How Data Lake Transition Happened?

In recent days, Hadoop as a platform lost its excitement and cloud infrastructure became more economical and started seeing wider developments. Present data lake implementation includes many cloud components.

Many tools such as Informatica, Azure Data Factory, Talend etc. helping companies to load the data-to-data lake from different sources.

Data Processing and transformation is taking care heavily by Spark.

Last most critical stage, data consumption and experimentation where we need JDBC connections, APIs etc. to access data.

Altogether, data lake solving many data challenges. However, we need more capabilities to complete its functionality. For example,

Security and Access Controls
Scheduling and Workflow management
Data Catalog and metadata
Data Life Cycle and Governance
Operations and Monitoring

#dataenginneering #datalake #Spark

Leo Delmouly

Co-founder @Streambased

5 个月

Great write-up! A new trend is emerging around the convergence of operational and analytical systems via Kafka, effectively turning it into a "streaming datalake." With solutions like Confluent's Tableflow and Streambased, you can now query streaming data directly at the source, bypassing complex and costly ETL/ELT processes. This approach not only ensures total consistency but also unlocks a much greater volume of data for prediction and advanced analytics.

1 次回应

Satish Dhawan

Data Scientist | Senior Recruiter

8 个月

Hi Suresh, we have a community of Power BI, Tableau, and other BI technology professionals. If interested, you can join our group and share your experience with us. https://www.dhirubhai.net/groups/8164518/

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

How companies implementing "DATA LAKE"

Suresh Bonam

Data management expert at Colgate-Palmolive

How was DATA LAKE initially imagined?

领英推荐

How Data Lake Transition Happened?

更多精彩文章

社区洞察

其他会员也浏览了

ETL IS DEAD

WHAT IS SQOOP

Efficient Data Ingestion with Glue Concurrency: Using a Single Template for Multiple S3 Tables into a Transactional Hudi Data Lake

Building a Real-Time Data Pipeline with Apache Kafka, ClickHouseDB, and AWS S3 for Data Integration and Normalization

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

Data technologies

Last 10 years of Data Engineering

Unleash NO-Code ETL Pipeline With Azure Datafactory (ADF)

AWS and Open Source Big Data and Analytic Frameworks

Dawn of HTAP databases will spell the end for ETL and Data Warehouses

How was DATA LAKE initially imagined?

领英推荐

How Data Lake Transition Happened?

Apache Spark VS DATABRICKS

2024年4月6日

Apache Spark Optimizations (Cont....)

2024年4月1日

Apache Spark Adaptive Query Execution & More Optimizations

2024年3月30日

Data Governance. How is it different from Data Management?

2024年3月26日

Function as a Service (Faas) -Small Code, Multiple Use cases

2024年3月19日

Non-Cloud (On-premises) data pipelines; Beginners guide

2024年3月16日

Battle of Cloud based Data Integration Tools: Azure ADF VS AWS Glue

2024年3月14日

ETL VS Data Orchestration: Modern data management solutions

2024年3月13日

社区洞察

其他会员也浏览了

ETL IS DEAD

WHAT IS SQOOP

Efficient Data Ingestion with Glue Concurrency: Using a Single Template for Multiple S3 Tables into a Transactional Hudi Data Lake

Building a Real-Time Data Pipeline with Apache Kafka, ClickHouseDB, and AWS S3 for Data Integration and Normalization

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

Data technologies

Last 10 years of Data Engineering

Unleash NO-Code ETL Pipeline With Azure Datafactory (ADF)

AWS and Open Source Big Data and Analytic Frameworks

Dawn of HTAP databases will spell the end for ETL and Data Warehouses