Where Should You Deploy Analytical Data Model

Where Should You Deploy Analytical Data Model

Quite often then not, you are required to build data models for reporting purpose. These models are but not limits to, enterprise data models or fit-for-purpose models. With now everything move to cloud and data models are increasingly important in business intelligence. Let's talk about where is the usual place in cloud in theory vs where an organization will prefer to keep them in.

I am lucky enough to experience various solution for this over years of practices and I would like to share it here. before I go further, no one has done enough to understand all the cloud platform solution including myself, hence certain level of bias will exist due to this. Nor it will be confusing to list down every single one, so let's talk about the main stream cloud--Microsoft Azure and Power Platform. In realty, AWS and GPS has the similar roadmap for this as well.

To give you a better idea, I would like to starts from how it evolves over years. So back in the date, when data storage is expensive and computing power is not as strong as now, you have really not much choice but to create the data model in OLAP tools and keep them refreshing on a regular basis. Typical tools like this is TM1, where model is saved as multi-dimensional cubes and store it in database memory for consumption. This is ideal way for analytical layer but it is costly, hence the use case is only in planning and analysis task. As you can see, in this case, it is more practical to deploy the data model in TM1 to leverage the in memory analytical engine in TM1.

Then it comes the tabular model using star schema, which has it more uses cases as it is light weight and agile. you have more choice to deploy the data model elsewhere other than the data source like data warehouse. here comes the confusion, where is the best place to deploy the data model. The generate rule of thumbs for keeping analytical data model where the analytical engine is best at in data transformation. In details, there are some factors you should consider.

  1. Performance. Evaluate which platform will give you the best performance for analytical purpose.
  2. Cost. Analyze the computing cost associated with each platform.
  3. Integration. Think about how well each tool integrates with your existing systems and workflows.
  4. Scalability. Ensure the platform can scale with your growing data needs.
  5. Security and accessibility. Balance the consumption of data model and data security.

Since this is more of common case to build enterprise or fit-of-purpose tabular model, Let's focus on several common scenarios here for the data models deployment.

Data model deployed to database

This applies to both relational and non-relational database, although the later one requires different data modelling approach to address the data structure difference. For common case, SQL database can deploy a data model as view that join multiple tables together and act as a data models. The view can be just a view or materialized view depends on the capability of the data platform. e.g. Snowflake primarily known as a cloud data warehouse, designed for data storage, management, and analytics at scale. You can scale up and down of the storage and computing power independently of it to suits your analytical need. Although database like this can scale up and down to fit the analytical need, yet, database is usually more pricy to handle this and user access control is hard to achieve as it involved separation of schema and well user role control.


Data model in date warehouse created using view


Data Model deployed to data preparation tool

Behind the scene, data preparation what transfer the data from raw in source to cleansed and consumable data. This is the area where date computing power takes a heavy load due to the transformation. You might think this should be the place to deploy the analytical model. Let's hold the thought and see what is the exactly data preparation do first.

In both batch and stream processing scenario, data will be ingested from source and transformed like aggregations, normalization and index etc. The computing resource is usually spike for certain amount of time and reside back to normal. hence there are distributed data processing engine like Apache Spark that use in-memory and parallelism processing for big data handling like batch processing. The popular tool for this is Data Brick and Azure Synapse.

There is nothing wrong to use parallelism engine to run an analytical data model except for its cost. it may fit those small or ad hoc data model that is created on the fly using data pipeline or notebook. However, the consumption pattern of the data model is irregular in data size and time, which make is hard to scale up or down like data preparation time when the computing resource spike. It will usually results in a high consumption cost.


Data model deploy to data preparation tool

Data model deployed to BI tools

This seems a more popular options, given the advance development of those BI tools in the market. They are more specialized in the last leg of the journey--data modelling and visualization. Be mindful, not all of them can handle data model well. Some BI tools like Tableau has no data modelling capability, while Power BI and Lookers has sophisticate data modelling functions to that, provide that you are willing to learn certain coding languages for that.

Despite the benefit, some argues that the data modelling in a BI tools is not an enterprise level modelling. No doubt this is right. But remember how the bulky PC replaced by replaced by modelized laptop, this is the same in analytical data modelling. As more developers are adapted to agile development framework, the need to build a massive and complicate data models in the cloud is replaced by smaller but fast iterated one. Thanks to the cloud computing, long gone the day that you have to install something heavy in local environment in order to function as it should be.

Actually, company like Microsoft is smart enough to provide a different product for that. Back in the date, you can use Azure Analysis Service to provide enterprise grade analytical layer, which now is likely to be replaced by Power BI premium license, or even Fabric. Data model can support Git which makes it easier for source control and collaboration.



Data model deployed to Fabric Lakehouse

The latest trend is now focusing on integration of front end development with back end data warehousing and preparation. That lead to the new architect of data model deployment using Fabric.


Fabric Architect

In Fabric, since it can work end to end task from data preparation, warehousing to consumption, it blurs the boundary of these three areas by bring them under the same platform. Fabric allows data model to be built and kept within the same platform, without worrying about the upstream or downstream integration, a typical balancing act to an architect will do ensure the cloud resource is properly utilized, meaning no over-provision in each step of process or bottleneck caused by under-provision, which is always a challenge as organization data grows and requirement changes. Some organizations may has the good initiative to build a enterprise data platform for consumption, but fail to realize that few service providers can offer a full stack of integrated solution without juggling the cost of resource among different SaaS solutions. I explain a bit more below.

An important attribute of analytical data model is that it allows to connect to different data sources, either in a cleansed or raw manner. A lot of organizations will have some legacy data tools, for storage preparation or consumption purpose to begin with, then they are introduced to the "better" tool to deal with these processes, in a fragmented way. Gradually, they will end up having certain processes optimized better than the other and it become architect nightmare to fix it. Yet, arguably hardly any company in the market can provide a cost effective way significantly better than other. So the only way to solve this is to allow integration happens among them in a seamless way.

Thanks to the solution of "One Lake" in Fabric, you are now able to connect different date sources for the data model in an flexible way, you can connect to S3 bucket from Amazon through a copy activity, or Databrick catalog through a SQL endpoint as shortcut to name a few, which meets the requirements of multiple sources of data model. Another sheer beauty of it is there is no more juggling of resource cost from different process amongst various SaaS providers, you have the flexibility to migrate them into Fabric as it is a combination of data warehousing, transformation and visualization tool. Its simplified pricing model reflect its intention of integration of the upstream and downstream process.

So before the introduction of Fabric, Microsoft has done a lot of good work with Azure and Power platform separately. Over years, it realized that power platform, especially Power BI outperform Azure in the popularity with a lot of new features. Now finally they comes to realize the requirements determine everything, hence consolidation of Azure data process and Power BI by rebranding it to Fabric. Now back to the ultimate question of where the analytical data model for consumption should be kept, as you already know, as close to downstream as possible and sits inside analytical engine. But with Fabric, the barrier between them is blur with single pricing , although still exist behind the scene. So for developer and architect, there is no need to juggle about the resource and cost allocation, which is a good call for the business.

Hope this article dispels some myth around analytical data model deployment for consumption purpose. Again, let me know if you have different ideas you would like to share about.

要查看或添加评论,请登录

Gavin Lin的更多文章

社区洞察

其他会员也浏览了