登录查看更多内容

Big Data & Data Lakes

BBI

Where AI Meets Innovation

发布日期: 2023年11月30日

Data is generated in vast quantities by automated systems such as those that run factories, manage vehicles, monitor energy usage in your home, control vending machines, data from e?commerce and more. Most organizations have a wealth of data at their disposal that they didn’t think could be useful in the past.

That is called Big Data, which can be describes as: data that is available in large volumes and in many different formats, and it may be generated over different time periods.

?The terms Big Data and Data Lake are used in conjunction, even interchangeably but they are not the same. Big Data is a technology concept while Data Lake is a business concept.

In this article, we’ll explain the both terms in detail with their strengths, weaknesses and use cases.

Big Data

Big Data refers to the large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume reflects the ever-increasing amount of data, the velocity refers to the increased speed of receiving and processing of data, and the variety refers to the different types of data and data formats (known as the "three v's" of big data).

Data Lake

Big data is often stored in a Data Lake, while data warehouses are commonly built on relational databases and contain structured data only, data lakes can support various data types and typically are based on Hadoop clusters, cloud object storage services, NoSQL databases or other big data platforms.

A Data Lake is a scalable data storage repository that can quickly ingest large amounts of raw data and make it available on-demand. Users accessing the data lake can explore, transform, and analyze subsets of data as they need it to meet their specific requirements. You can store your data as-is, without transformation, and run different types of analytics from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

One of the most important characteristics of a data lake is its ability to store all types of data from any source:

·????? Structured data that is clearly defined and formatted, such as the data found in relational databases.

·????? Unstructured data adheres to no specific format, such as social media or data generated by IoT devices.

·????? Semi-structured data falls somewhere between structured and unstructured data, such as CSV and JSON files.

Reference logical architecture of a data lake

Data lakes and Data Warehouses are both widely used to store data for analytics, but they are not interchangeable terms. A Data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. The data structure, and schema are defined in advance to optimize for fast SQL queries, where the results are typically used for operational reporting and analysis. Data is cleaned, enriched, and transformed so it can act as the “single source of truth” that users can trust.

But the Data Lake is different, because it stores relational data from line of business applications, and non-relational data from mobile apps, IoT devices, and social media. The structure of the data or schema is not defined when data is captured. This means you can store all of your data without careful design or the need to know what questions you might need answers for in the future. Different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights.

Data lakes are not improving everything for AI and analytics teams. They also require these teams to take over new types of tasks. They are typically responsible for data cleansing and ensuring consistency. They have to do an initial analysis and writing transformation code. In the case of repetitive analytics tasks, e.g., ones that support operational processes, the AI and analytics teams must ensure the ongoing maintenance of the code. If feeds change, they have to update the cleansing and consistency code.

However, data lakes are only useful if put into the right hands. Companies wanting to benefit from data lakes must have access to skills (often held by the quickly?emerging roles of data scientists and data engineers) that can apply understanding of how the business works to the data when they need to look for answers. While data scientists and analysts extract meaning and insights from data, data engineers support the data scientists by diving into raw data and making it accessible for them.

领英推荐

Topic- The Top of the Best Practices to Implement in…

Databuzz Ltd 2 个月前

Big Data Platforms vs. Traditional Data Warehousing:…

Databuzz Ltd 2 个月前

What is a Data Lake?

Prof. Ahmed Banafa 3 年前

Pros of Data Lake

·????? Allows an organization to store any (structured and unstructured) data

·????? Any imaginable data is readily available when needed

·????? Ad hoc questions can be asked at any time

·????? Based on low cost distributed platforms

·????? Data first, ask any questions later

·????? Low entry investment

Cons of Data Lake

·????? Requires new in?house talent to deliver continuous value.

·????? They can become a data graveyard.

·????? Experts are needed to extract information.

·????? Raw data is tempting to misinterpret.

Conclusion

In conclusion, Big Data and Data Lake are two interrelated terms but have completely different meanings.

Big Data is simply the data that is humongous in size. The data that is in the order of petabytes and more is considered as Big Data. Not only the size, but there are a few more parameters that define Big Data. The sources that are generating this data, the different formats of it, and the speed with which it is generated, all these factors when combined define Big Data. Big Data in the simplest of words is huge amounts of DATA.

On the other hand, Data Lake is a repository for Big Data. It stores data of all types i.e. structured, unstructured, and semi-structured, that has been generated from different sources. It stores data in its rawest form. A data lake is different from the data warehouse. Data warehouses store data in a well-structured form. Data present in a data lake may or may not be utilized in the future but the data in a data warehouse is meant for utilization.

Ultimately, Big Data is huge data and Data Lake is the storehouse for it. Data lakes?play an important role in any business growth but any organization should consider some factors to decide using the Data Lake like data volume, frequency of changes, reporting needs, and sources structure.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

Monthly TechTalk

9,174 位关注者

BBI

1 年

Mostafa Ibrahem Ali Abdelhafez

要查看或添加评论，请登录

BBI的更多文章

See all articles

Big Data & Data Lakes

BBI

Where AI Meets Innovation

Big Data

Data Lake

领英推荐

Pros of Data Lake

Cons of Data Lake

Conclusion

Monthly TechTalk

9,174 位关注者

BBI的更多文章

社区洞察

其他会员也浏览了

Data Lake vs. Data Warehouse: Which to Choose and When?

Choosing the Right Solution: Data Lakehouse Vs. Data Lake Vs. Data Warehouse

Architecting Data Pipelines with Azure Data Lake and Azure Synapse

Data Technology Trend #2: Strategic

What is a Data Lakehouse? How is it Different from a Data Warehouse and a Data Lake?

Best Practices for Designing Your Data Lake

Maximizing Your Data's Potential: A Guide to Choosing the Right Data Storage Solution

Data Pipelines: A Blueprint for Streamlined Data Flow in Azure

Unlocking the Power of Data Lakes: A Comprehensive Guide

Data lakes, what are they and how to use them?

Big Data

Data Lake

领英推荐

Pros of Data Lake

Cons of Data Lake

Conclusion

Monthly TechTalk

9,174 位关注者

BBI的更多文章

Data Integration APIs

Data Engineering

Data Warehouse Modernization - Part 2: Architectures

Data Warehouse Modernization - Part 1: An Introduction

Data Warehouse vs Data Vault

Operational Data Sources vs. Operational Data Stores

社区洞察

其他会员也浏览了

Data Lake vs. Data Warehouse: Which to Choose and When?

Choosing the Right Solution: Data Lakehouse Vs. Data Lake Vs. Data Warehouse

Architecting Data Pipelines with Azure Data Lake and Azure Synapse

Data Technology Trend #2: Strategic

What is a Data Lakehouse? How is it Different from a Data Warehouse and a Data Lake?

Best Practices for Designing Your Data Lake

Maximizing Your Data's Potential: A Guide to Choosing the Right Data Storage Solution

Data Pipelines: A Blueprint for Streamlined Data Flow in Azure

Unlocking the Power of Data Lakes: A Comprehensive Guide

Data lakes, what are they and how to use them?