Data Lake & Data Mesh

Data Lake & Data Mesh

Global data creation is projected to exceed 180 zettabytes in the next five years.?

It was always a struggle to create a single source of truth to analyze the data. May be having data centrally at one location can help us answer business questions quickly and easily.?Business Intelligence can give you deep insights to the data, but to get there you need a unified and standardized view of the data. This is where Data warehouse comes into rescue.?

Data warehouse can store huge amount of data from different sources and can solve the problem as long as the structure of the data is well defined.?

As the data is growing we have variety of sources generating the enterprise data. This data does not have well defined schema, it can be structured, semi-structured or unstructured. This poses a problem to the existing solutions we spoke There comes the data lake .

Data Lake

Data lake is a huge data storage having variety of data from different sources may be salesforce, IOT devices, Web, rest endpoint in any format may it be pictures, videos, XML'S, CSV's, JSON's or that matter any sort of data.?The Data Lake works on the concept of ‘store first and think later’ which makes it different from Data Warehouse. Other way to see this is as data lake is ELT and Data Warehouse is ETL. In Data Lake you store the data first, without too much thinking of the format and transformation and later based on the business needs you do the transformation.?

Since we are not following any standard schema in Data Lake the quality of the data is not great unlike with Data Warehouse.?Data Lake is built thinking about quantity whereas Data Warehouse is centered around quality.

Data Lake Architecture


With Data Lakes we create pipelines and bring all the data to the central data lake location. This can be combined with "Delta Lake" architecture to have different layers which would address problem rewinding the data failure.

So we solved the problem of huge storage having multi structured/un structured data. But that raises another problem :)

This approach of Data Lake takes, brings us few other major challenges :?

  • #1 : If you want to make data centralized, you?need to bring it from various resources and store it at one large storage location. Bringing all this data to a central location itself is a big and expensive task.
  • #2 :As the number of sources increase, the querying the central data store becomes slow, and it fails to scale.
  • #3 :When we are talking about data, moving it across different regions/countries can have impact from the data privacy standpoint.?

Data Mesh?

Global data creation is projected to exceed 180 zettabytes in the next five years.?It’s very difficult to imagine to have all the data stored at one location. Difficult to quickly process for needs and very costly to store it. Data Mesh coined by @Zhamak comes into the rescue. Data Mesh is the modern way of defining the distributed way of storing the data.?It makes data more accessible, secure, discoverable and interoperable.?

No alt text provided for this image


@Zhamak defines the 4 principles of the data mesh,

  1. Domain-driven ownership : The first principle is about giving ownership of the data with domain teams. They should be responsible for data governance, who can access it and how the data should be accessed.
  2. Data as a product : The domain teams would also be responsible for the products/views created out of the data. The domain team would be responsible for maintaining and updating the resulting data products.
  3. Self-service infrastructure : The third principle talks about ease of using and maintaining the data products. For domain teams the infrastructure should be easy to use and maintain (using common tools and infrastructures).
  4. Federated governance : Last but not the least, there needs to have a defined policy around accessibility and privacy of the data. This is around data governance, who can access the data, what can be accessed. This goes from schema, table to column and properties level. You can define different privileges, permissions and roles to achieve this.

With the principles explained above, we can address the issues posed by Data Lake architecture.

#1 :The data mesh defines a distributed approach towards data architecture. This means the ownership of the data is distributed and decentralized. Which makes respective teams to access the data quickly and easily.

#2 :With decentralized ownership the data is enabled to scale and respond to the business needs.?

#3 :With decentralized data ownership the individual domains are responsible for data security and quality.?

As data is growing exponentially, we need modern way of addressing the data storage, governance, security and getting meaningful insights to data with ease and quick way. Data Mesh is a great steps towards achieving that.

Thanks,

Raja Saurabh Tiwari


要查看或添加评论,请登录

Raja Saurabh Tiwari的更多文章

  • The Hidden Cost of AI

    The Hidden Cost of AI

    Artificial Intelligence (AI) is revolutionizing industries, enhancing automation, and creating new possibilities for…

    3 条评论
  • Agentic AI - My take

    Agentic AI - My take

    Introduction In recent months, Agentic AI has emerged as a focal point in the technology sector, captivating both…

    16 条评论
  • Large Language Models vs Small Language Models

    Large Language Models vs Small Language Models

    Before directly jumping to LLM, a quick recap on AI and Machine Learning. We all have been seeing the below image which…

    2 条评论
  • So what makes a good data science profile

    So what makes a good data science profile

    Let's start with some stats Data science was named the fastest-growing job in 2017 by LinkedIn, and in 2018 Glassdoor…

    3 条评论
  • Don't let your fear win

    Don't let your fear win

    Once Krishna and Balarama got late playing in the forest. They decided to rest in there over the night and thought to…

    1 条评论
  • Analytics of Data Scientists in Kaggle

    Analytics of Data Scientists in Kaggle

    Kaggle has recently published a report on the Kaggle users on various aspects. The trend shows analysis of people…

  • Text Analysis - Word Cloud

    Text Analysis - Word Cloud

    Text Analysis : Text analysis one of the richest area in the Machine Learning space. Text analysis is the process of…

  • Machine Learning (Without CODE)

    Machine Learning (Without CODE)

    Machine learning is very fascinating for data science practitioners and everyone and there's a continuous effort…

    2 条评论
  • Statistics vs. Visualization (#Data Science)

    Statistics vs. Visualization (#Data Science)

    Understanding the statistical properties of the data is one of the key aspect of data science or Machine Learning…

  • AutoML - first glance

    AutoML - first glance

    "Machine Learning and AI attempts to automate manual work..

社区洞察

其他会员也浏览了