The idea of a Universal ID?

The idea of a Universal ID?

It is the dream of the data architect that there is a ubiquitous and universal ID that stitches its way through all different systems. There is no denying that this would help with our current data qualms but it is not “the” answer.

Let’s start with the obvious statement - “I wish we thought of this earlier”. The problem we see at most companies is that the idea of a universal ID only surfaced recently. This means that there is no easy way to go back in history and rectify the records with a new universal ID. The second is that not all systems that you will use will support this. The third is a principle that I have learnt from the engineering industry and that is “separation of concerns” i.e. make sure one part of the system is responsible for one thing and one thing only. Many could argue we have already broken this e.g. CRM systems will allow you to manage leads, customers, documents. Marketing and more. This is not so much the problem, but rather that the CRM should stay being the CRM and we should not put any business specific objects that would span into other systems. In engineering, we would have another system that composes the data from many services into a coherent structure. If this sounds familiar in the data world, it is, this is exactly what a Data Hub solves.

Let's imagine that you are starting a company today.

You are choosing your systems to run the business and you are savvy enough to think “let’s have a universal record id that will proliferate itself throughout the business when records are created. ” Something like RPA would be perfect for this type of situation (as long as they have support for all of your stack, which rarely happens). If we play one example through, we have a customer visiting the website and tracked in Google Analytics. They ask for a demo by filling out a form with their email. The first point of entry is a centralised Id factory that will check for uniqueness of email and if it is unique it will generate a GUID or UUID for which the Id and the form details are then saved into the CRM. So far, so good. 

We then use the CRM to send them an email which is connected to your Office 365 account. The CRM tracks that you have sent the email, however in Office 365, you realise that the UUID/GUID was not part of the transaction at all, rather it used the email address. We would now have to set up some type of flow that says “When I have a new email in my email box, register a UUID for the Mail and the Contact”. Start to see how this is falling down?

The bottom line is that everything already has a unique Id, we are just thinking wrong in the process of blending data together throughout systems. The reason we want a Universal Id is that we always want a point to point way to join 2 systems. Instead, we should be using the engineering analogy above by giving the responsibility of the universal Id or at least the way to identify that two records are the system to a centralised system i.e. the Data Hub

The “downside” of the Data Hub and ELT approach that we take at CluedIn is that merging of records is not done at data entry time, but rather “at some time in the future”. Is this acceptable for all use cases? No. Imagine that you looked up 2 records in the Data Hub and you saw two records referring to you. You knew they were the same, you could yell at the screen for the system being silly, but essentially there is no way to be able to join these two records (automatically) without more data input.

I also think that many people give some systems too much credit for how well they have been built. I can name many systems we have worked within the financial industry that is well-established platforms but don’t have any referential integrity in the system i.e. you could easily create records with the same Id

But does everything have a universal way to detect duplicates?

Kind of. Well, with a certain precision we can say yes. Let’s take a document. It exists in SharePoint with 44 versions and then someone downloads the file and moves it to their DropBox. We now have 3 instances of the same file, but each will have different Ids now. SharePoint will be a Uri or the ItemId, it will be the Path on the File System and DropBox will use the Path for the Id. We can’t merge the records of these at all. That is why in certain situations, the content becomes the Id and in turn, the debate starts. If the content of the two documents are the same, are they the same document? CluedIn (by default) has a decided “Maybe”. The way that we can be more precise this our “Maybe” is by using many different hashing algorithms to hash the content, combined with the lineage that we track in our system. 

There is an edge case.

If we have the file in Dropbox but have not yet added it into CluedIn and in Dropbox we have modified the file slightly, then CluedIn will treat this as a separate document, but will offer the advice to a human that there are documents that look very similar and would you like to mark these as the same file.

If we take the “Person” record, this is also up for debate, but an email is a unique reference to a person (or group, mailbox etc) but it in no way can guarantee it. For example, imagine if you were [email protected]. This is unique. Then imagine that your friend Sarah signs you up for a dating service with that email address. If she was to set you up on a date, in no way is John Smith the author of this record, even though it was done under his account. But, we can put a certain statistical confidence level on this that it was John Smith. It is only after processing this data that we can start to gain or lose confidence in the data.

A Phone number is not a unique reference to a person, but it is for a certain range of time. Phone numbers can go back into a pool of phone numbers e.g. it doesn’t happen much today, but remember getting those phone calls “Hi is John there? Oh, you must have his number now”. So once again, it is unique with a certain level of confidence. At this point, you should really be doubting your ETL processes and the confidence you have with your data. What about a username online, or a nickname? Well once again, it is not unique at all, but rather a good indication. In today's data world, you need to be using a platform that works with a confidence level that caters for these situations.

In summary, there is no need for a Universal ID, because most systems cannot make it a reality. If you start to look into a centralised system like a Data Hub to act as the universal brain to look up a record by any Id in any system and it will get you back a result, you are on the right path.

Please leave a comment good or bad.

要查看或添加评论,请登录

Dominic Rebello的更多文章