Database Vs Data Warehouse Vs Data Lake

Database Vs Data Warehouse Vs Data Lake

In this article, we are going to discuss the difference between databases, data warehouses, and data lakes. So, to need to understand the difference between data organizations one should know the difference b/w structured and unstructured data.

In simple words, structured data is a type of data that has a known schema and also has a fixed neat structure, and most importantly could be fit in a fixed field table, for example, data stored in Excel files. On the other hand, unstructured data has no fixed schema or structure. Let’s take an example of a newsletter, which is having images along with the text. So, to store such kind of data, it becomes difficult for the traditional DBMS to accommodate it in a fixed schema structure.

?

So, what's the database then? databases are typically structured data storage with a defined schema. In a database, items are organized as a set of tables with columns and rows. Where a column represents the attribute of the object, and a row contains the entire property set of an object. Examples of a database are Mysql, oracle, PostgreSQL. Databases are designed to store transactional data which may or may not have any analytical importance. The Databases are used by the organizations which need to store only the frequent transactional data. A data warehouse in contrast to a database designed for analytical purposes. A data warehouse exists on top of several databases and uses data from all these databases and creates a unified schema to perform data analytics.

No alt text provided for this image

A Datawarehouse transforms the data collected from several databases and keeps only that information which is crucial for data analysis. The main design of a data warehouse revolves around the management's decision-making facilitation. Data in a Datawarehouse is carefully related to all of the other data in the data warehouse. In addition, data in a data warehouse tends to be highly standardized and cleaned.?

No alt text provided for this image

A data lake is a centralized repository for structured and unstructured data storage. The main use of data lake originated just because of the increase in the generation of unstructured data through big data applications. We can’t store unstructured data in a data warehouse because in a Datawarehouse we need a unified structure for efficient data analysis. Data lakes maintain the data in its raw format until and unless the data is not required for use. There is no need to perform any transformation prior to storing the data in a data lake. Processing can be done on export so that schema is defined on reading.

?

So, the decision on which service you should use totally depends on your need for data storage. If your need is to just store the daily transactional data with little analysis, then go for a DBMS. If your need is to serve the only analytical purpose, then opt for a Datawarehouse and if you require to perform analytical operations on unstructured data then your solution is a Data lake.

?



Ravindra Pawar

Java | Spring | Security | AWS | SpringBoot | Microservices | AI | ML | Devops | GenAI | QuantumComputing

3 年

Thanks for sharing, very informative

要查看或添加评论,请登录

Utkarsh Sharma的更多文章

  • reCAPTCHA: The Turing Test We Use Daily

    reCAPTCHA: The Turing Test We Use Daily

    It is amazing that we use some things so frequently that we forget to understand the mechanism behind them, like for…

  • Enable Machines to Feel: Sentiment Analysis

    Enable Machines to Feel: Sentiment Analysis

    Have you ever got a text from someone and couldn't tell if they were kidding or not? Unless we clearly tell the person…

  • Introduction to Time Series Analysis

    Introduction to Time Series Analysis

    Time series is a sequence of data points organized in time order. Forecast of data by analyzing time-based data is Time…

    1 条评论
  • Dimensionality Reduction by PCA using Orange

    Dimensionality Reduction by PCA using Orange

    The curse of dimensionality haunts every data scientist dealing with a dataset containing a large number of attributes.…

    1 条评论
  • Model Drift in Machine Learning

    Model Drift in Machine Learning

    “Change is the only constant in life.”- Heraclitus (Greek philosopher).

  • Principal Component Analysis????

    Principal Component Analysis????

    What is PCA? Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce…

    3 条评论
  • Curse of Dimensionality

    Curse of Dimensionality

    Yes, data scientists and the data handling community do suffer from this well-known curse. So, is it really a curse or…

  • Market Basket Analysis:- What will I buy next?

    Market Basket Analysis:- What will I buy next?

    Have you ever wondered, while entering a shopping store that how they organize or stack the things in a particular…

  • What do Data Engineer Do?

    What do Data Engineer Do?

    So, to define it very shortly a data engineer is that person who is responsible to collect the data from various…

    4 条评论
  • A beginner’s Guide to data mining : RapidMiner

    A beginner’s Guide to data mining : RapidMiner

    RapidMiner studio is a data science and data mining platform that lets users extract transform and load data to draw…

社区洞察

其他会员也浏览了