Data Platforms in an Enterprise
In any modern day enterprise data platforms plays a crucial role as it is a basic infrastructure required to extract value out of the data. Data platforms in a broader context can help in both the operational data use cases and analytical data use cases. In this article i will be talking about the data platforms from an analytical use cases perspective.
High level overview of the data platform
If you look at any enterprise it will be doing business in various markets/countries , organized into various domains/functions such as Finance, Supply chain, Marketing etc. In addition there will be global systems such as ERP, CRM etc. Taking all the above into account how can we come up with a data platform solution that can efficiently serve the analytical use cases satisfying both the functional and non functional requirements. To begin with we will dive deeper into each block without any implementation details.
Data Sources:
Data sources can be classified as Global ones that comes out of the ERP, CRM and Master data that are universally recognized across the enterprise and third party data from external data providers. Local Market specific data are the ones that comes from the distributors and point of sale from specific markets and these data help the regional business leaders to come up with data driven decisions in their specific regions or business units.
Global Datalake:
I have used the term data lakes just for a convention. Ideally these could be the lakehouse which can leverage one of the table formats (delta lake, iceberg or hudi). Global or universal data lakes are the ones which give you the notion of a centralized structure when there is a talk of de-centralized architectures such a "Data Mesh" getting much visibility. Whatever i have discussed in the high level architecture can be realized using the data mesh principles such as data as a product, domain driven ownership, self serve data infrastructure and federated governance.
Why do we still have a centralized global data lake? It can bring a lot of benefits in terms of scalability, cost savings and quicker time to market. By Centralizing sourcing data from all the global data sources using a platform team can help in onboarding new data sources quickly. The platform team can build connectors using reusable frameworks that can ingest from a variety of data sources. This can reduce the cognitive load for a lot of teams.
If we Zoom in the global data lakes a little bit.
The Raw Layer will contain the data as it was available in the source preserving the same format (may be a row oriented one) and it is an append only layer. The staging layer is the one which is derived by applying the data quality rules (mostly the technical validation). In a data product world, the staging layer can be called as a "Source Oriented Data Product". The same also can be called as the "Bronze Layer" in the medallion architecture vocabulary.
Domain specific data lakes:
Each enterprise would have organized itself in terms of functions or domains as per their business capabilities. And each of these functions have data associated with them, which is more familiar to the people associated with them.
If we zoom in a little bit on the domain oriented data lakes
The domain oriented data lakes source the data from the staging layer of the global data lakes. Business validation rules and data modelling are applied to the stage data to create a derived data or curated data. The data modelling options are upto you to decide based on the access patterns and it could be either normalized or denormalized. In the data product world it can be called as a derived data product and in the medallion architecture vocabulary it is the "Silver Layer".
领英推荐
Market Specific Data Lakes:
Every market has some local data that can be only useful for that particular market analysis. So this construct of market specific data lakes comes into play, where there will be 2 paths of getting in data.
If we look at the market specific data lake a little deeper.
The data from local data sources for a specific market will go through the same transition from Raw to curated layers. In addition data from domain specific data lakes will be merged into the curated layers to get some more derived data. The idea behind this is to avoid duplicate copies of data and this can happen without moving data.
Global Data Products:
Global data products are the ones which can be leveraged of the enterprise wide data. These are application specific and have a logic for a focussed audience.
These products are optimized for use cases by doing proper aggregation. In the data products world these can be termed as consumer focused data products. In the medallion architecture it is the "Gold Layer".
Putting it all together
Though i have presented an implementation agnostic view of the platform, one consideration i would suggest you to have is have registered the data in a technical data catalog as the data traverses across the layers and lakes. This will enable to help discovering data and work with data as it goes downstream.
This is just a high level over view or thought process of having a data platform that can support analytical use cases. There can be many more improvements and variations to this one. Also this can be seen as a hybrid data mesh.
Big Data/ Databricks/ Azure Data Architect at Cognizant | ex PwC India | ex TCS
2 个月Great write up!??