Data requirements and evolving implementations
Shyam Singhal (Ph.D.)
Digital & Agile Transformation | Product Innovation and Development | Practice Head | Delivery Head | Excellence Head | Ex-Microsoft, Hewlett-Packard, Accenture
I have been thinking about it for some time now and was wondering what has changed over time that has resulted in so many solutions in “data†space.
We have had business transaction data or operational data even earlier. In order to decipher insights from collected data, then came analytical processing. To make it analytical processing friendly came the need to de-normalize it. However, the “Normalization†or “De-normalization†were the technical aspects to facilitate the data record / maintenance and had nothing to do with business / end needs.
If I look at the business / end-user needs then those can be summarized as,
a)????The transactional / operational data is managed and maintained
b)????Insights could be derived from collected data to improve business value and / or customer experience, i.e., some positive impact in some part(s) of the value chain
c)?????As far as possible insights / data management could be self-served
With operations across geographies and online operations, came the need for,
d)????Distributed data access, and
e)????Scalability
As far as ‘Data Integrity’, ‘Security’, and ‘Governance’ are concerned, these attributes were present from the day the data collection started; however, became more pronounced since online transaction had started. These attributes or requirements might not have been stated explicitly, and perhaps, are still not; however, those are an integral part of any data.
Therefore, any data solution today must fulfill points #a-#e, including the attributes mentioned in above paragraph. However, segregation of transactional / operational data from analytical data is not a business requirement, rather it is a technical or solution implementation requirement as envisaged by the solution architects at that time or in current time.
I understand that #c was not there in the original list and got added later (I don’t know when); however, the requirement is an obvious extension of the basic requirements.
Over the decades of journey businesses and people engaged in data collection and management have seen Data Warehouses, Big Data, Data Lakes, Data Lakehouse, Multi Modal Data Lakes (thanks to cloud), Data Fabric, Data Mesh, Data Virtualization…the list just goes on, and in future might become even more lengthier.
The point is, you take any technology / solution the fundamental requirements / needs remain the same (points #a-#e, and three attributes mentioned in following paragraph of those points). However, with each subsequent iteration of data evolvement, in terms of size / volume, types of data, accessibility of data, and representation of data, the solutions have also evolved. Though, it is worth mentioning that no single data solution could qualify as panacea.
Over a period, the data creation or collection has increased exponentially thanks to data in form of devices emitting data, web / apps analytics needs, GPS etc. The fact of the matter is, data is collected to generate certain insights from it, apart from having a system of record or compliance. If that were true, then how we process data to serve the needs as stated above (#a-#e) are simply solution requirements, which have evolved because of volume and variety of data. Further compounded by geographical spread and statutory requirements, as enforced by various governments and related agencies.
For example, we have had data ingested by devices directly to cloud leading to need to powerful compute resources, and storage. Which over a period got an element of edge computing to ease out that load on cloud and make it pocket friendly to its users. The heterogenous data led to the need of a ‘store’ to quickly store the data and manage it efficiently. And so on.
Over time, we found that easy and secure accessibility to raw or processed data directly to its end users or users involved in generating insights is becoming a norm, or at least make more sense, as any intermediary solution results in loss of agility. What it also meant was that data should not be replicated to avoid silos and data integrity issues. Though, again, latter is a technical requirement, and not a business requirement.
As evident these complications are from solutioning perspective though fundamental requirements have not changed (points #a-#e). Therefore, the discussions around whether we go for Virtualization, Data Fabric, Data Mesh, Data Lake…are something that an end user must not bother about. Whether it is one or some combination of available techniques and technologies that make sense from a solutioning perspective, why should it matter to me as an end user? As an end user I am least bothered, how my requirements are fulfilled (if those are fulfilled ethically).
领英推è
The problem arises when solution experts try to hide behind buzz words / jargons to fulfill their objectives. I believe that is where the ‘Ethical’ part comes into play. Now, without getting sucked into ethics and its implications etc., let’s briefly see what each of these technology trends mean to us.
Broadly, there are three parts to any data journey,
1)????Creation / Ingestion or Accessibility of Data,
2)????Data Management, or governance, and
3)????Data Consumption / Exposure
Therefore, a technology / tool / framework could either be addressing a part, or all parts of this journey. Nonetheless, those revolve around these three aspects of data journey.
A Data Fabric based solution would need to provide how the multiple (possibly disparate) data stores are integrated, such that those can fulfill the data requirements in an integrated manner, transparently. Whether behind the scenes, it runs certain ETL processes, enforce certain policies or create certain batch jobs to provide the required data view / access, all those are transparent to the end user.
Similarly, it may enforce certain policies for data access, rendering, compliance to statutory requirements in form of GDPR, HIPAA, FCRA etc., the fact of the matter is the data is secured when in rest, and when in motion, and is only accessible to authorized user as per assigned privileges. Of course, it would involve policies around retention, archival, deletion etc., however, once again, those are transparent to the end user.
Lastly, the data needs to be consumed to generate insights. Perhaps, those are provisioned based on roles, and granularity, and takes into consideration the support to underlying technology and extensibility of that interface too. For example, it may canned views, insights, catalogs as per roles, like, Business Analyst, Developer, Scientist etc., and support Low / No Code technologies.
As evident it addresses all three parts of ‘Data Journey’; however, it is not very prescriptive of how each of those are managed. The ISVs can operate in one or all three parts and may provide either a partial or complete solution through the suggested framework.
The problem with earlier solution approaches and even with this approach is that we are dependent on technologists to fulfill business requirements. Invariably those technologists are not domain experts and lack functional knowledge to understand the working, challenges, dependencies, risks, and problems that a business is trying to answer.
Now, the obvious alternative to this approach would be to attach the accountability of data to its origin / function / domain such that they serve the need of its customers by anticipating, and gathering needs of its customers, and providing a ready to consume data (‘Data as a Product’ in Data Mesh parlance) through consolidation, summarization, augmentation etc. In nutshell, they ‘process’ the data for easy consumption of various users. Therefore, there seems to be a shift in data ownership, and its processing from ‘Technologists’ to its ‘Owners’.
Interestingly, this is the approach that ‘Data Mesh’ has suggested. I believe, it is a good approach, as now even if the actual processing is still being done by the technologists (for the time being, until some low / no code solution enables functional owners to do that); however, those are being done in consultation / collaboration with the functional experts. And, over a period, even those technologists could gain that functional expertise / knowledge.
Though, segregation of data by domain has its own challenges in form of having a risk of creating silos; however, it is assumed that those would collaborate and would work seamlessly through internal integrations and/or automations. A big assumption to fulfill, given human psychology, nonetheless, required for it to work efficiently. However, even ‘Data Fabric’, works on a principle of distributed databases with seamless integration, which can easily be adopted for Data Mesh.
The ‘Self-Serve Infrastructure as a Platform’ and ‘Federated Governance’ under ‘Data Mesh’ points cater to point #3 and point #2 (in part) of ‘Data Journey’, as detailed above. Therefore, the real change the ‘Data Mesh’ has proposed is in the form of ‘Data Ownership’ and ‘Data Formulation and Packaging’. But, then ‘Data Fabric’ did not prescribe any centralized data store either. Both approaches would require transparent and seamless integration of distributed and potentially disparate databases to serve the needs of its customers
The thinking of a ‘Single Source of Truth’ has also been challenged, thanks to affordable and high performant compute powers of modern cloud. Now, the focus is to replicate / localized and refresh as per system requirements to break away from shackles of centralized data store and associated constraints.
Shyam, You took me through the evolution of data technologies from DW to Data mesh and many more jargon introduced by various firms. But still the moot question remains, are they addressing the user requirements effectively ??