How we gave our team access to data that was ready for insight with CluedIn and Azure Purview?
It is a dream of most companies today to provide their business with data products. These data products are often described as data that is easy and ready to consume, trustable and prepared for insight. At CluedIn, we are eating our own dogfood and we build our own internal CluedIn solution like we guide our partners and customers to as well. We often say at CluedIn that if “you don't know that CluedIn is there, that is a good thing”. Why? Because I don't want our team to have to learn a new tool. I want them to stay in the tools they are comfortable with, but have easy access to data to use within the tools they feel comfortable in. I want them to feel like, as far as they know, data was perfect in the first place (far from the truth).
The challenge we had was that we wanted to essentially internally market what data we had at CluedIn that was ready for insight. CluedIn has its Glossary, which is about describing data, not data assets. The main reason for this is that CluedIn is used to bring data assets together, it doesn't think in assets in thinks in records. This doesn't take away from the fact that teams are comfortable working in data assets e.g. "Can you send me our customers in Excel?".
At CluedIn it has always been easy to send data to downstream consumers such as our Synapse cluster, SQL Server Databases or even directly to Azure Data Factory so it can push data to well over 30 different sinks, but how do our different teams know what datasets are actually available to them to consume? Do they just check their Synapse cluster every day? No.
We needed a centralised repository of datasets, that had been curated, integrated, enriched, governed and cleansed through CluedIn - that was not CluedIn.
Enter Azure Purview. CluedIn has a native integration to Azure Purview in which we synchronise the different Glossaries (Purview for the assets, CluedIn for the record-level), but we also register all data that is placed into CluedIn and moved out of CluedIn as well. Essentially, if you upload a file from your Data Lake directly into CluedIn, we will register that into Purview that a file came from Data Lake, was placed into CluedIn and also we write all the smarts and value that CluedIn provides such as Data Quality scoring, Sensitive Records and more back to Purview for its visual Data Lineage.
From here, we had a central governing body for all data movement and data assets whether we placed that data through CluedIn or not. This is where we enabled the inbuilt Azure Data Share capabilities of Microsoft Azure and Purview so that all teams had a tool agnostic and public repository of data assets that they could discover. What's even more interesting for us, is that this gives us a mechanism to share datasets outside of our business with other organizations as well. This is all managed through Azure Active Directory, so we can control our sharing policies with other tenants with ease, and can retract this access with the same ease.
It also allows us to easily expose these datasets with REST API's so that different parts of the business can consume this as streams of live data or living and breathing data products. Naturally, our instinct here was to use the native GraphQL layer that comes with CluedIn, but once again this would require people within our company to know that CluedIn exists within the business. We think it is important that MDM is something that is transparent but abstracted within a business and is part of the data pipeline, instead of an after thought, and hence the move to use the REST API's that are exposed natively using Azure Data Shares in Microsoft Azure was the right approach. The datasets are pulled and hosted on Azure Data Lake Gen 2, SQL Server, Blob Storage, Synapse and others, hence CluedIn (the company) exposes its datasets on this.
领英推荐
You can offer datasets as snapshots OR you can actually grant them access directly to the dataset in the source. This requires that you have an instance of Azure Data Explorer that you can use for this, but we use this on a daily basis anyway, so we were happy that this was supported.
The beautiful part of this, is that from here, the consumer of the data receives a lovely email from Microsoft themselves, telling them what datasets they have access to. In this way, we actually push and advertise to our teams, what datasets are available for them to consume for either business intelligence or something a bit more sophisticated, like generating a net promoter score or calculating customer churn prediction.
With this small addition of Azure Data Shares at CluedIn, all team members have a bucket of Data Shares in their Azure subscription that are synched hourly and if those teams need to know how this data came to be, data quality scores, and more, then they are one click away to Purview which gives them this information. Naturally, CluedIn has been the workhorse in the background, transparently making sure that the data we are delivering to the Data Shares are high quality, Governed, Owned and ready for insight.
?
?
?