If it isn't in Purview - it doesn't exist!
How CluedIn and Microsoft Purview deliver a single source of truth
Microsoft Purview is making more and more sense, every time I use it.
You have to walk before you can run. While it might take longer to get to the finish line, the end result is almost always worth it. Just ask the determined Tortoise and the hasty Hare!
There are many facets to becoming a data-driven business, and more often than not it starts with knowing what data you actually have in the first place. Here at CluedIn, the tool we use to discover the data we have is Microsoft Purview. Our mantra these days is “If it isn't in Purview, it doesn’t exist”.
The initial step before activating Purview is to register which systems you have, but if you happen to use the systems that Purview supports out of the box, then it provides a good catalogue of assets. This is then enhanced by automated scans that will look for new tables and assets in your lakes, databases and shortlist of systems. Microsoft will no doubt add more support for new systems over time, but for now, our rule at CluedIn is that our Azure Data Lake is where we drop our data. This means that we don't need support for HubSpot, Zendesk and other systems in Purview as we have a 100% push model, and are using CluedIn as the source system to push data to the lake.
So on a daily basis I get into work and Purview has discovered new files, new tables and more. From here, the journey begins.
CluedIn has a native integration with Azure Purview. This integration, amongst other benefits, has a real-time sync with the assets in Purview - meaning the moment that Purview picks up a table, CluedIn knows about it and is ready to ingest the data into its Master Data hub. Purview scans for metadata, provides me with a schema and attempts to detect the types of the data. Sometimes it gets it right, sometimes it needs help - but it is a good start. The beauty of the CluedIn to Purview integration is that CluedIn can pick up the lineage of the data from Purview and can take it to the next step, such as taking an asset or a file and turning it into a number of records in the Master Data Management (MDM) platform.
Why is this important? Because once we have data at the record level, we can start to solve some of the challenges that can only be answered at this level of granularity. Like as building a single view of a record, cleaning data, measuring data quality, and data enrichment.
It is important to establish that CluedIn is not the source of data, it is the source of truth. In fact, we at CluedIn believe that having an MDM as the source of data is the wrong approach entirely and will lead to issues down the path. Is Purview the source of truth? Is it the source of data? In our opinion the answer to both is no.
Operational systems are the source of data. Purview is a metadata-driven, indexed view of your systems, marked up with helpful metadata to provide an asset catalogue.
Should Purview be the place where you find a true source of all your customers? Well....maybe. But can it do it alone? No. At CluedIn, we think that Purview should essentially be the catalogue of every asset that you have, even after they have been processed by CluedIn. With this in place, Purview can then (by proxy) maintain pointers to locations that have the source of truth.
CluedIn obtains the source of truth thanks to platforms like Purview, after all, CluedIn is not the place to register ALL the assets you have, its purpose is to provide a view of the data that needs to be addressed by an MDM platform. You can easily have files in Purview that won't be in CluedIn, but in our opinion, you shouldn't have data in CluedIn that is not registered in Purview, just like you shouldn't have data in Azure Databricks that is not registered in Purview.
As you have probably guessed, Purview is a piece of the puzzle, it can't go it alone. CluedIn is exactly the same, it needs other pieces to "complete" the story. The example I always like to use is the business glossary. The Purview glossary allows you to tag assets with business terms. This provides an answer to the question "what assets do I need to use?". If you’re lucky, this will all be in one asset and you’re good to go.
In our experience however, business terms require you to look at the data at the record level to get the real answer. For example, if you want a list of banking customers you’re most likely to get this from a master list of your customers which has been filtered down to the records which have the industry set to Banking. Don’t forget that this will also mean that Data Stewards will have to fix all the different ways of representing banking - e.g. "Banks", "Banking", "Banking Services".
This is where CluedIn and Purview really complement each other. CluedIn needs to know what files/assets to bring together in order to answer the question. This can be achieved by the owners of the Purview catalogue tagging the assets with a glossary term from Purview as a hint to the CluedIn users that the answer to this question lies within the tagged files. It is then the job of the CluedIn user to integrate, clean, standardise and get the data down to the record level so that it can act as the source of truth.
But the journey is not over yet. There is nothing wrong (quite the opposite) with now taking those records in CluedIn and publishing them back out to some type of "sink" such as Synapse, SQL Server or SAP so that Azure Purview can register and scan those systems. In this way, CluedIn now knows the answer AND Purview users can know it too.
There is so much more to Microsoft Purview and we’re discovering it on a daily basis. For me, it gives further credence to our decision to build a native integration to Purview, and makes me excited for the opportunities this offers in the future.