Merging data can be simple if you have the right tool kit...
Blending data is hard. Let’s start with that realisation. Without trying to sound too facetious, too many companies are taking a too simplistic approach to solve duplicate data. I wanted to take the time to explain how our merging engine works to show you to what extent we take merging seriously and how this plays a big role in processes like merging or de-duplication.
The first process of merging data starts long before the actual merge.
Data doesn’t blend naturally, not within the one system and definitely not across system to system. If there was from day one, the perfect universal id that was proliferated throughout all systems in both structured and unstructured data then we would not be in this predicament, but we are. With this fact in place, we can now say that we need to prepare the data to be able to blend as confidently as possible, but your data will never blend perfectly, there will always be outliers. Your job is to have a system in place to identify what you need to do to make it blend properly.
So, the first stage of blending data is realising that you have a data quality issue.
There will be many cases where there will be the perfect joining of datasets. There will be many where you think it joins perfectly but in fact, the data was entered incorrectly. And finally, there will be plenty of cases where you need to have the data from 3, 5, or 25 different systems available before you can properly merge data from other systems
With this in mind, we now know that the first stage of merging data is knowing what quality of data you are working with. CluedIn does this by measuring 12 different metrics in data quality. This allows us to set the benchmark of “how ready is this data to blend?”. It is only with 100% data quality that we can confidently automate an engine that will blend data. 100% data quality doesn’t exist. Even one row of meticulously crafted data will not be 100% quality. So then imagine 25 years of bad data quality - yes, this is what you are making your core company decisions based off.
The reason why these benchmarks are important is that any decisions that we make from here on, we have tracked and full lineage that they were made off quality metrics that were in place at the time of decision. This traceability allows a data auditor to trace back that two records merged based off 70% quality in data - very easy for an auditor to poke holes in this process. However, a good merging engine would set high thresholds before it would confidently merge two records, especially in an automatic way.
Let’s start with the first process in the merging engine once data is at an acceptable quality level.
The first simple and naive approach is to take two records, side by side, and then compare. The amazing thing is, is that this alone is how most companies are solving this today. In fact, there are very popular products on the market today that are also quite expensive and leaders of analyst firms and they only take this approach to fuzzy merging. I reiterate, amazing. But we can’t forget that this approach is a critical part of the overall process. This mechanism of string distance functions plays an important role in establishing an ensemble of techniques that will yield overall precision in a merging engine. Imagine for a moment that we had two records such as John Smith and Jon Smith. This technique can tell us that these names are so close, but to be honest, Jon is a very popular derivate of John. This technique would say that Denmark and Danmark are very close, however, a native Dane would know that Danmark is, in fact, the Danish spelling of Denmark and hence it is good enough as being the same value. These techniques can have holes poked in them from every different angle, but companies are using this “logic” as the main fuzzy merging engine in their pipeline of data quality.
A next approach is a clustering approach i.e. let the machine do the work.
This approach yields really good results and also scales conceptually and in practice. The approach is simple. Let history tell you the way you should merge data. We can stand here as a business and set 100’s if not 1000’s of rules in place of what would cause two records to merge, but this doesn’t and will never scale to the amount of data that companies are dealing with today. Enter “Machine Learning”.
The clustering approach identifies “features” or simply properties of entities that would usually fuzzily merge two records. As an example, imagine that you have two records and that if the birthday was the same, the location was the same and the name was the same - that there is a certain threshold on accepting that these are the same person. Well, you would have to make sure that this was not a common name. This does not and will never scale. Hence the clustering approach will help because it simply uses the existing data you have to form these “rules”.
Let’s not forget that “rules” are great. In fact, they will typically solve the larger majority of duplicated data and they should be put in place. What we are trying to solve here is for the situation where standard rules don’t scale, for unstructured data and for data that was incorrectly entered.
We are not done yet.
Enter the Graph. A database family that runs some of the biggest companies in the world. The concept is simple. We have nodes that are connected via relationships. For example, in LinkedIn, you are a Node and the manager you are connected with is another Node and your relationship would be “Worked For”. Now imagine the two records above now in a large network where we not only have people but documents, companies, tasks, emails and more. Some of the first steps that we will do now are to look at the data “around” the two documents i.e. direct relationships. From this, we will also run the process from above e.g. One works for a company called Lego, the other for Lego LLC. This allows us to also measure if the data around these two records seem similar or not.
Context is very important in data merging and hence the graph allows us to start to explore outside the metadata view and into connected data. It is within the next part of our merging engine that context grows even more. Now, we start to explore the indirect relationships i.e. Nodes that are many “hops” away from the two records. This is typically where our merging engine will become so statistically confident that these records are the same that it puts the nail in the coffin.
We are still not done yet.
All data is not created equal. In fact, as we move from customer to customer, we realise that a lot of companies have the same software, but their data looks very different. It is because of this that we introduce CluedIn Clean and CluedIn Train. CluedIn Clean allows a data engineer to normalise data on a scale that can’t be done in an automatic way. CluedIn Train is available to Data Stewards to be able to label or “train” the data. Simple and blunt questions like “Is John Smith a Person” will help CluedIn learn over time to identify certain objects in structured and unstructured data. I mentioned the clustering function above. This clustering function needs to be trained over time to get better. There is absolutely no way that generic models for clustering will help solve this problem - rather it will give a generic solution.
As an absolute final fallback, we get into manual merging of records.
These are for all the records that were not statistically confident enough to merge through the processes detailed above. It is at this point that our final smarts kick in, in the merging process. Everything that is manually merged by a human will also help train the data models for merging for your specific company i.e. the clustering will start to learn over time the “why” do certain records merge.
I will argue that this approach can take time to get used to. I can see the comfort blanket that comes from having a set of rules that govern data merging - but tell me, when was the last time you thought this was doing enough? It is hard for a business to accept that to get better precision in data, we need to rely on statistical confidence levels rather hard rules. As we often say to our customers, we are living in a data world of precision. Nothing is 0% or 100%, it lies somewhere in between. Especially use cases like Advanced Analytics, Machine Learning and more are all based on precision.
For the sake of brevity, I have left some parts of the overall merging out. Simply because we could be here all day reading about different string distance functions and what value they bring, the output of a clustering algorithm which is SVM based versus Random Forest. The value is in the output. We have helped customers identify and cleanse 14% of duplicated data across their systems. Data that was being used to fuel Business Intelligence, Machine Learning and more. We have helped customers place in processes around improving data quality and deduplication over time. Without a doubt, new merging techniques will evolve over time and CluedIn will be there to increase our precision with it.
Please comment with feedback. Thanks you.