Big Data & Data Lakes
Data is generated in vast quantities by automated systems
That is called Big Data, which can be describes as: data that is available in large volumes and in many different formats, and it may be generated over different time periods.
?The terms Big Data and Data Lake are used in conjunction, even interchangeably but they are not the same. Big Data is a technology concept while Data Lake is a business concept.
In this article, we’ll explain the both terms in detail with their strengths, weaknesses and use cases.
Big Data
Big Data refers to the large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume reflects the ever-increasing amount of data, the velocity refers to the increased speed of receiving and processing of data, and the variety refers to the different types of data and data formats (known as the "three v's" of big data).
Data Lake
Big data is often stored in a Data Lake, while data warehouses are commonly built on relational databases and contain structured data only, data lakes can support various data types and typically are based on Hadoop clusters, cloud object storage services, NoSQL databases or other big data platforms.
A Data Lake is a scalable data storage repository that can quickly ingest large amounts of raw data and make it available on-demand. Users accessing the data lake can explore, transform, and analyze subsets of data as they need it to meet their specific requirements. You can store your data as-is, without transformation, and run different types of analytics from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions
One of the most important characteristics of a data lake is its ability to store all types of data from any source:
·????? Structured data that is clearly defined and formatted, such as the data found in relational databases.
·????? Unstructured data adheres to no specific format, such as social media or data generated by IoT devices.
·????? Semi-structured data falls somewhere between structured and unstructured data, such as CSV and JSON files.
Data lakes and Data Warehouses are both widely used to store data for analytics, but they are not interchangeable terms. A Data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. The data structure, and schema are defined in advance to optimize for fast SQL queries, where the results are typically used for operational reporting and analysis. Data is cleaned, enriched, and transformed so it can act as the “single source of truth” that users can trust.
But the Data Lake is different, because it stores relational data from line of business applications, and non-relational data from mobile apps, IoT devices, and social media. The structure of the data or schema is not defined when data is captured. This means you can store all of your data without careful design or the need to know what questions you might need answers for in the future. Different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights.
Data lakes are not improving everything for AI and analytics teams. They also require these teams to take over new types of tasks. They are typically responsible for data cleansing
However, data lakes are only useful if put into the right hands. Companies wanting to benefit from data lakes must have access to skills (often held by the quickly?emerging roles of data scientists and data engineers
领英推荐
Pros of Data Lake
·????? Allows an organization to store any (structured and unstructured) data
·????? Any imaginable data is readily available when needed
·????? Ad hoc questions can be asked at any time
·????? Based on low cost distributed platforms
·????? Data first, ask any questions later
·????? Low entry investment
Cons of Data Lake
·????? Requires new in?house talent to deliver continuous value.
·????? They can become a data graveyard.
·????? Experts are needed to extract information.
·????? Raw data is tempting to misinterpret.
Conclusion
In conclusion, Big Data and Data Lake are two interrelated terms but have completely different meanings.
Big Data is simply the data that is humongous in size. The data that is in the order of petabytes and more is considered as Big Data. Not only the size, but there are a few more parameters that define Big Data. The sources that are generating this data, the different formats of it, and the speed with which it is generated, all these factors when combined define Big Data. Big Data in the simplest of words is huge amounts of DATA.
On the other hand, Data Lake is a repository for Big Data. It stores data of all types i.e. structured, unstructured, and semi-structured, that has been generated from different sources. It stores data in its rawest form. A data lake is different from the data warehouse. Data warehouses store data in a well-structured form. Data present in a data lake may or may not be utilized in the future but the data in a data warehouse is meant for utilization.
Ultimately, Big Data is huge data and Data Lake is the storehouse for it. Data lakes?play an important role in any business growth but any organization should consider some factors to decide using the Data Lake like data volume, frequency of changes, reporting needs, and sources structure.
Mostafa Ibrahem Ali Abdelhafez