登录查看更多内容

What is a Data Lake?

Prof. Ahmed Banafa

No.1 Tech Voice to Follow & Influencer on LinkedIn|Award Winning Author|AI-IoT-Blockchain-Cybersecurity|Speaker|56k+

发布日期: 2021年5月7日

“Data Lake” is a massive, easily accessible data repository for storing "big data". Unlike traditional data warehouses, which are optimized for data analysis by storing only some attributes and dropping data below the level aggregation, a data lake is designed to retain all attributes, especially when you do not yet know what the scope of data or its use.

Data Lake vs. Data Warehouse

Data warehouses are large storage locations for data that you accumulate from a wide range of sources. For decades, the foundation for business intelligence and data discovery/storage rested on data warehouses. Their specific, static structures dictate what data analysis you could perform. Data warehouses are popular with mid- and large-size businesses as a way of sharing data and content across the team- or department-siloed databases. Data warehouses help organizations become more efficient. Organizations that use data warehouses often do so to guide management decisions—all those “data-driven” decisions you always hear about.

A data lake holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.

Now that data storage and technology is cheap, information is vast and newer database technologies don’t require an agreed upon schema up front, discovery analytics is finally possible. With data lakes, companies employ data scientists who are capable of making sense of untamed data as they trek through it. They can find correlations and insights within the data as they get to know it.

Five key components of a data lake architecture:

1.Data Ingestion: A highly scalable ingestion-layer system that extracts data from various sources, such as websites, mobile apps, social media, IoT devices, and existing Data Management systems, is required. It should be flexible to run in batch, one-time, or real-time modes, and it should support all types of data along with new data sources.

2.Data Storage: A highly scalable data storage system should be able to store and process raw data and support encryption and compression while remaining cost-effective.

3.Data Security: Regardless of the type of data processed, data lakes should be highly secure from the use of multi-factor authentication, authorization, role-based access, data protection, etc.

4.Data Analytics: After data is ingested, it should be quickly and efficiently analyzed using data analytics and machine learning tools to derive valuable insights and move vetted data into a data warehouse.

5. Data Governance: The entire process of data ingestion, preparation, cataloging, integration, and query acceleration should be streamlined to produce enterprise-level Data Quality. It is also important to track the changes to key data elements for a data audit.

Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports it. However, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried.

The data lake promises to speed the delivery of information and insights to the business community without the hassles imposed by IT-centric data warehousing processes.

Data Lake Advantages

Data Lake gives business users immediate access to all data.
Data in the lake is not limited to relational or transactional
With a data lake, you never need to move the data
Data Lake empowers business users and liberating them from the bonds of IT domination
Data Lake speeds delivery by enabling business units to stand up applications quickly
Helps fully with product ionizing & advanced analytics
Offers cost-effective scalability and flexibility
Offers value from unlimited data types
Reduces long-term cost of ownership
Allows economic storage of files
Quickly adaptable to changes
The main advantage of data lake is the centralization of different content sources
Users, from various departments, may be scattered around the globe can have flexible access to the data

Data Lake Disadvantages

Unknown area of Data Processing
Data governance
Dealing with Chaos
Privacy issues
Complexity of Legacy Data
Metadata Lifecycle Management
Desolate Data Islands
The Issue of Integration
Unstructured Data may lead to Ungoverned and Unusable Data, Disparate and Complex Tools
Increases storage & computes costs
There is no way to get insights from others who have worked with the data because there is no account of the lineage of findings by previous analysts
The biggest risk of data lakes is security and access control. Some data can be placed into a lake without any oversight, as some of the data may have privacy and regulatory need

The Future

There are many organizations that are making this approach a reality, the internal infrastructures developed at Google, Amazon, and Facebook provide their developers with the advantages and agility of the data lake dream. For each of these companies, the data lake created a value chain through which new types of business value emerged:

Using data lakes for web data increased the speed and quality of web search
Using data lakes for clickstream data supported more effective methods of web advertising
Using data lakes for cross-channel analysis of customer interactions and behaviors provided a more complete view of the customer
Data lakes can give retailers profitable insights from raw data, such as log files, streaming audio and video, text files, and social media content, among other sources, to quickly identify real-time consumer behavior and convert actions into sales. Such 360-degree profile views allow stores to better interact with customers and push on-the-spot, customized offers to retain business or acquire new sales.
Data lakes can help companies improve their R&D performance by allowing researchers to make more informed decisions regarding the wealth of highly complex data assets that feed advanced predictive and prescriptive analytics.
Companies can use data lakes to centralize disparate data generated from a variety of sources and run analytics and ML algorithms to be the first to identify business opportunities. For instance, a biotechnology company can implement a data lake that receives manufacturing data, research data, customer support data, and public data sets and provide real-time visibility into the research process for various user communities via different user interfaces.

Regardless of where you are now, take some time to look to the future. We’re on a journey towards connecting enterprise data together. As business is increasingly becoming pure digital, access to data will become a critical priority, as will speed of development and deployment. The data lake is a dream that can match those demands. The global data lake market was valued at $7.9 billion in 2019 and is expected to grow at a compound annual growth rate (CAGR) of 20.6 percent by 2024 to reach $20.1 billion. #TrendingOnLinkedIn

Ahmed Banafa, Author the Books:

领英推荐

The Future of Data Management: A Deep Dive into Data…

Sidd TUMKUR 4 个月前

Revealing Contemporary Data Frameworks: From…

Dr. RVS Praveen Ph.D 11 个月前

Real-Time Data Analytics Platform - 3/3 Solution…

Elsayed Rashed 1 年前

Secure and Smart Internet of Things (IoT) Using Blockchain and AI

Blockchain Technology and Applications

Read more articles at: Prof. Banafa website

References

https://www.bmc.com/blogs/data-lake-vs-data-warehouse-vs-database-whats-the-difference/

https://www.guru99.com/data-lake-architecture.html#21

https://www.dataversity.net/data-lakes-what-they-are-and-how-to-use-them/

https://www.gartner.com/newsroom/id/2809117?

https://datascience101.wordpress.com/2014/03/12/what-is-a-data-lake/

https://en.wiktionary.org/wiki/data_lake

https://searchaws.techtarget.com/definition/data-lake

https://www.forbes.com/sites/edddumbill/2014/01/14/the-data-lake-dream/

https://www.platfora.com/wp-content/uploads/2014/06/data-lake.png

https://www.b-eye-network.com/blogs/eckerson/archives/2014/03/beware_of_the_a.php

https://usblogs.pwc.com/emerging-technology/the-future-of-big-data-data-lakes/

https://siliconangle.com/blog/2014/08/07/gartner-drowns-the-concept-of-data-lakes-in-new-report/

https://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml

https://www.ibmbigdatahub.com/blog/don%E2%80%99t-drown-big-data-lake

https://www.wallstreetandtech.com/data-management/what-is-a-data-lake/d/d-id/1268851?

https://emcplus.typepad.com/.a/6a0168e71ada4c970c01a3fcc11630970b-800wi

https://hortonworks.com/wp-content/uploads/2014/05/TeradataHortonworks_Datalake_White-Paper_20140410.pdf

Gervais Johnson, AI Innovation and Neurodiversity Advocate

Product Strategy, AI, and Agile Leader

3 年

Very good. I think we are seeing a paradigm shift in data storage and usage. Centralized data storage and management including DBMS like relational frameworks will be replaced by distributed data across the Cloud / iron. Like the WWW today. This will require new ways of doing data analytics (AI/ML/Tools) and insight creation. With the advent of more powerful machines like Quantum Computing we can perform highly sophisticated and compkex data algorithms and analysis that was not possible earlier.

Joe Rounceville

Experienced Enterprise and Solution Architect (Fortune 100 - Innovation Focused)

3 年

That are your thoughts on having an event based core for your company (like Kafka or AWS Kinesis)? It seems like it gives you the benefits of a lake, but offers more, in that you can treat it like a real-time pub/sub pipeline if you want, a curated warehouse if you want, an analytics hub, a place to integrate API endpoints, or batch jobs, a way to offload processing without losing track of who is the system of record, etc. It seems like it's kind of like the data lake concept applied to *all* integration problems, not just the warehouse (schema on write) problem. All data (in the form of "events") goes into the streams, but you only pull out what you want to augment / add value to, and if there's value for others to consume the augmented data, you can push *that* back into the streams too.

Margaret Rouse

Explaining the value of IT one definition at a time...

3 年

Love this explanation! I tend to think of data lakes as giant junk drawers, but now you've got me thinking about the need to govern them!

1 次回应

Srini K.

President AI, Technology & Sustainability @ Rackspace (FAIR) - Lifelong Learner - Advocate for Responsible AI - Sustainability

3 年

Love this post. One of the key advantages of a Data Lake is the ability to Extract Load and Transform when needed. This is a huge advantage and it helps implement a more flexible Supply Chain of Data.

2 次回应

查看更多评论

要查看或添加评论，请登录

Prof. Ahmed Banafa的更多文章

AI Factory: The Future of Scalable Artificial Intelligence

2025年3月23日

AI Factory: The Future of Scalable Artificial Intelligence

Artificial Intelligence (AI) has transformed industries by automating processes, optimizing decision-making, and…

2 条评论
Semiconductor Industry Trends in 2025: Innovations, Challenges, and Market Dynamics

2025年3月22日

Semiconductor Industry Trends in 2025: Innovations, Challenges, and Market Dynamics

The semiconductor industry is undergoing a transformative phase as we enter 2025. Driven by advancements in artificial…

1 条评论
The Learning Concept Model (LCM) in AI: A New Paradigm for Intelligent Systems

2025年3月12日

The Learning Concept Model (LCM) in AI: A New Paradigm for Intelligent Systems

Artificial intelligence (AI) has transformed various industries, from healthcare and finance to autonomous systems and…
AI Without Regulations or Guardrails: A Risky Path Forward

2025年2月25日

AI Without Regulations or Guardrails: A Risky Path Forward

Artificial Intelligence (AI) has rapidly evolved from a niche field of computer science to a transformative force that…
Quantum Computing’s First Real-World Applications in 2025

2025年2月4日

Quantum Computing’s First Real-World Applications in 2025

Quantum computing has long been considered a futuristic technology with transformative potential, but 2025 marks a…

4 条评论
The Rise of Physical AI: Bridging Artificial Intelligence with the Tangible World

2025年1月7日

The Rise of Physical AI: Bridging Artificial Intelligence with the Tangible World

Artificial Intelligence (AI) has long been synonymous with virtual environments, where algorithms analyze data…

2 条评论
Agentic AI: The Rise of Autonomous Intelligence

2024年12月15日

Agentic AI: The Rise of Autonomous Intelligence

The concept of Agentic AI—artificial intelligence systems with the ability to act independently, make decisions, and…

5 条评论
Biocomputers: Harnessing Biology for Computing Power

2024年11月27日

Biocomputers: Harnessing Biology for Computing Power

Computers have come a long way from the mechanical calculators of the 19th century to the silicon-based machines that…

1 条评论
The Rise of Digital Twins: Transforming the Physical World Through Virtual Replicas

2024年11月19日

The Rise of Digital Twins: Transforming the Physical World Through Virtual Replicas

Digital twins, once a concept confined to science fiction, are now reshaping industries by creating dynamic, digital…
Revolutionizing Energy: Advanced Battery Technology and Storage Solutions

2024年11月13日

Revolutionizing Energy: Advanced Battery Technology and Storage Solutions

The shift toward sustainable energy is accelerating, and at the heart of this transformation are cutting-edge battery…

See all articles

What is a Data Lake?

Prof. Ahmed Banafa

No.1 Tech Voice to Follow & Influencer on LinkedIn|Award Winning Author|AI-IoT-Blockchain-Cybersecurity|Speaker|56k+

领英推荐

Read more articles at: Prof. Banafa website

Prof. Ahmed Banafa的更多文章

社区洞察

其他会员也浏览了

Real-Time Data Analytics Platform - 3/3 Solution Architecture

Big Data Platforms vs. Traditional Data Warehousing: What Are the Real Differences?

Open Data Lakehouses: The Future of Data Storage and Analysis

An Approach to Architecting a Lower Cost, Fast and Self-Service Data Lakehouse

Big Data & Data Lakes

Real-time Data Analytics Platform - 1/3 Architecture & Design Considerations

Data Lake vs. Data Warehouse: Which to Choose and When?

How a Data Mesh layout can eliminate bottlenecks in a data lake?

Learn how Lyftrondata Data Virtualization can enhance your data performance

领英推荐

Read more articles at: Prof. Banafa website

Prof. Ahmed Banafa的更多文章

AI Factory: The Future of Scalable Artificial Intelligence

Semiconductor Industry Trends in 2025: Innovations, Challenges, and Market Dynamics

The Learning Concept Model (LCM) in AI: A New Paradigm for Intelligent Systems

AI Without Regulations or Guardrails: A Risky Path Forward

Quantum Computing’s First Real-World Applications in 2025

The Rise of Physical AI: Bridging Artificial Intelligence with the Tangible World

Agentic AI: The Rise of Autonomous Intelligence

Biocomputers: Harnessing Biology for Computing Power

The Rise of Digital Twins: Transforming the Physical World Through Virtual Replicas

Revolutionizing Energy: Advanced Battery Technology and Storage Solutions

社区洞察

其他会员也浏览了

Real-Time Data Analytics Platform - 3/3 Solution Architecture

Big Data Platforms vs. Traditional Data Warehousing: What Are the Real Differences?

Open Data Lakehouses: The Future of Data Storage and Analysis

An Approach to Architecting a Lower Cost, Fast and Self-Service Data Lakehouse

Big Data & Data Lakes

Real-time Data Analytics Platform - 1/3 Architecture & Design Considerations

Data Lake vs. Data Warehouse: Which to Choose and When?

How a Data Mesh layout can eliminate bottlenecks in a data lake?

Learn how Lyftrondata Data Virtualization can enhance your data performance