Data Platforms in an Enterprise

Data Platforms in an Enterprise

In any modern day enterprise data platforms plays a crucial role as it is a basic infrastructure required to extract value out of the data. Data platforms in a broader context can help in both the operational data use cases and analytical data use cases. In this article i will be talking about the data platforms from an analytical use cases perspective.

High level overview of the data platform

If you look at any enterprise it will be doing business in various markets/countries , organized into various domains/functions such as Finance, Supply chain, Marketing etc. In addition there will be global systems such as ERP, CRM etc. Taking all the above into account how can we come up with a data platform solution that can efficiently serve the analytical use cases satisfying both the functional and non functional requirements. To begin with we will dive deeper into each block without any implementation details.

Data Sources:

Data sources can be classified as Global ones that comes out of the ERP, CRM and Master data that are universally recognized across the enterprise and third party data from external data providers. Local Market specific data are the ones that comes from the distributors and point of sale from specific markets and these data help the regional business leaders to come up with data driven decisions in their specific regions or business units.

Global Datalake:

I have used the term data lakes just for a convention. Ideally these could be the lakehouse which can leverage one of the table formats (delta lake, iceberg or hudi). Global or universal data lakes are the ones which give you the notion of a centralized structure when there is a talk of de-centralized architectures such a "Data Mesh" getting much visibility. Whatever i have discussed in the high level architecture can be realized using the data mesh principles such as data as a product, domain driven ownership, self serve data infrastructure and federated governance.

Why do we still have a centralized global data lake? It can bring a lot of benefits in terms of scalability, cost savings and quicker time to market. By Centralizing sourcing data from all the global data sources using a platform team can help in onboarding new data sources quickly. The platform team can build connectors using reusable frameworks that can ingest from a variety of data sources. This can reduce the cognitive load for a lot of teams.

If we Zoom in the global data lakes a little bit.

The Raw Layer will contain the data as it was available in the source preserving the same format (may be a row oriented one) and it is an append only layer. The staging layer is the one which is derived by applying the data quality rules (mostly the technical validation). In a data product world, the staging layer can be called as a "Source Oriented Data Product". The same also can be called as the "Bronze Layer" in the medallion architecture vocabulary.

Domain specific data lakes:

Each enterprise would have organized itself in terms of functions or domains as per their business capabilities. And each of these functions have data associated with them, which is more familiar to the people associated with them.

If we zoom in a little bit on the domain oriented data lakes

The domain oriented data lakes source the data from the staging layer of the global data lakes. Business validation rules and data modelling are applied to the stage data to create a derived data or curated data. The data modelling options are upto you to decide based on the access patterns and it could be either normalized or denormalized. In the data product world it can be called as a derived data product and in the medallion architecture vocabulary it is the "Silver Layer".

Market Specific Data Lakes:

Every market has some local data that can be only useful for that particular market analysis. So this construct of market specific data lakes comes into play, where there will be 2 paths of getting in data.

If we look at the market specific data lake a little deeper.

The data from local data sources for a specific market will go through the same transition from Raw to curated layers. In addition data from domain specific data lakes will be merged into the curated layers to get some more derived data. The idea behind this is to avoid duplicate copies of data and this can happen without moving data.

Global Data Products:

Global data products are the ones which can be leveraged of the enterprise wide data. These are application specific and have a logic for a focussed audience.

These products are optimized for use cases by doing proper aggregation. In the data products world these can be termed as consumer focused data products. In the medallion architecture it is the "Gold Layer".

Putting it all together

Though i have presented an implementation agnostic view of the platform, one consideration i would suggest you to have is have registered the data in a technical data catalog as the data traverses across the layers and lakes. This will enable to help discovering data and work with data as it goes downstream.

This is just a high level over view or thought process of having a data platform that can support analytical use cases. There can be many more improvements and variations to this one. Also this can be seen as a hybrid data mesh.




Saptarshi Das

Big Data/ Databricks/ Azure Data Architect at Cognizant | ex PwC India | ex TCS

2 个月

Great write up!??

回复

要查看或添加评论,请登录

ArunKumar R的更多文章

  • Who will be the Kubernetes of AI agents?

    Who will be the Kubernetes of AI agents?

    AI agents are getting more and more popular. But there is a long way to go before we unlock the value of agents.

  • Why every company needs a Chief AI Officer?

    Why every company needs a Chief AI Officer?

    There are only two types of companies in this world. Those that are great at AI and everybody else.

  • How much to supervise AI agents?

    How much to supervise AI agents?

    AI agents are systems for taking actions. Unlike chatbots, they use large language models to orchestrate complex…

    2 条评论
  • Four villains of decision making

    Four villains of decision making

    The track record of humanity making decisions is not so good. The decisions range from career choices, hiring, mergers…

  • AI transformation - Balancing innovation and risk

    AI transformation - Balancing innovation and risk

    Every company is embarking on the journey of digital transformation and AI transformation is an important constituent…

  • AI Gateway

    AI Gateway

    Artificial intelligence has become a hot topic over the past couple of years. It’s transforming the enterprise…

  • Master Data Management - Implementation styles

    Master Data Management - Implementation styles

    Master data management (MDM) is a business practice that ensures that an organization's data is accurate, consistent…

  • How to be assertive without being a jerk?

    How to be assertive without being a jerk?

    Communicating confidently without offending people and being assertive is a tough act. Many people in an effort to…

  • Data culture

    Data culture

    As you embark on efforts concerning a company’s data platform or systems, a crucial first step involves evaluating the…

  • Confident Humility

    Confident Humility

    Too much of confidence will be seen as arrogance and too much of humility will be seen as weakness or lack of…

    1 条评论

社区洞察

其他会员也浏览了