登录查看更多内容

The idea of a Universal ID?

Dominic Rebello

Director | Data & AI

发布日期: 2019年6月13日

It is the dream of the data architect that there is a ubiquitous and universal ID that stitches its way through all different systems. There is no denying that this would help with our current data qualms but it is not “the” answer.

Let’s start with the obvious statement - “I wish we thought of this earlier”. The problem we see at most companies is that the idea of a universal ID only surfaced recently. This means that there is no easy way to go back in history and rectify the records with a new universal ID. The second is that not all systems that you will use will support this. The third is a principle that I have learnt from the engineering industry and that is “separation of concerns” i.e. make sure one part of the system is responsible for one thing and one thing only. Many could argue we have already broken this e.g. CRM systems will allow you to manage leads, customers, documents. Marketing and more. This is not so much the problem, but rather that the CRM should stay being the CRM and we should not put any business specific objects that would span into other systems. In engineering, we would have another system that composes the data from many services into a coherent structure. If this sounds familiar in the data world, it is, this is exactly what a Data Hub solves.

Let's imagine that you are starting a company today.

You are choosing your systems to run the business and you are savvy enough to think “let’s have a universal record id that will proliferate itself throughout the business when records are created. ” Something like RPA would be perfect for this type of situation (as long as they have support for all of your stack, which rarely happens). If we play one example through, we have a customer visiting the website and tracked in Google Analytics. They ask for a demo by filling out a form with their email. The first point of entry is a centralised Id factory that will check for uniqueness of email and if it is unique it will generate a GUID or UUID for which the Id and the form details are then saved into the CRM. So far, so good.

We then use the CRM to send them an email which is connected to your Office 365 account. The CRM tracks that you have sent the email, however in Office 365, you realise that the UUID/GUID was not part of the transaction at all, rather it used the email address. We would now have to set up some type of flow that says “When I have a new email in my email box, register a UUID for the Mail and the Contact”. Start to see how this is falling down?

The bottom line is that everything already has a unique Id, we are just thinking wrong in the process of blending data together throughout systems. The reason we want a Universal Id is that we always want a point to point way to join 2 systems. Instead, we should be using the engineering analogy above by giving the responsibility of the universal Id or at least the way to identify that two records are the system to a centralised system i.e. the Data Hub

The “downside” of the Data Hub and ELT approach that we take at CluedIn is that merging of records is not done at data entry time, but rather “at some time in the future”. Is this acceptable for all use cases? No. Imagine that you looked up 2 records in the Data Hub and you saw two records referring to you. You knew they were the same, you could yell at the screen for the system being silly, but essentially there is no way to be able to join these two records (automatically) without more data input.

I also think that many people give some systems too much credit for how well they have been built. I can name many systems we have worked within the financial industry that is well-established platforms but don’t have any referential integrity in the system i.e. you could easily create records with the same Id

But does everything have a universal way to detect duplicates?

Kind of. Well, with a certain precision we can say yes. Let’s take a document. It exists in SharePoint with 44 versions and then someone downloads the file and moves it to their DropBox. We now have 3 instances of the same file, but each will have different Ids now. SharePoint will be a Uri or the ItemId, it will be the Path on the File System and DropBox will use the Path for the Id. We can’t merge the records of these at all. That is why in certain situations, the content becomes the Id and in turn, the debate starts. If the content of the two documents are the same, are they the same document? CluedIn (by default) has a decided “Maybe”. The way that we can be more precise this our “Maybe” is by using many different hashing algorithms to hash the content, combined with the lineage that we track in our system.

There is an edge case.

If we have the file in Dropbox but have not yet added it into CluedIn and in Dropbox we have modified the file slightly, then CluedIn will treat this as a separate document, but will offer the advice to a human that there are documents that look very similar and would you like to mark these as the same file.

If we take the “Person” record, this is also up for debate, but an email is a unique reference to a person (or group, mailbox etc) but it in no way can guarantee it. For example, imagine if you were [email protected]. This is unique. Then imagine that your friend Sarah signs you up for a dating service with that email address. If she was to set you up on a date, in no way is John Smith the author of this record, even though it was done under his account. But, we can put a certain statistical confidence level on this that it was John Smith. It is only after processing this data that we can start to gain or lose confidence in the data.

A Phone number is not a unique reference to a person, but it is for a certain range of time. Phone numbers can go back into a pool of phone numbers e.g. it doesn’t happen much today, but remember getting those phone calls “Hi is John there? Oh, you must have his number now”. So once again, it is unique with a certain level of confidence. At this point, you should really be doubting your ETL processes and the confidence you have with your data. What about a username online, or a nickname? Well once again, it is not unique at all, but rather a good indication. In today's data world, you need to be using a platform that works with a confidence level that caters for these situations.

In summary, there is no need for a Universal ID, because most systems cannot make it a reality. If you start to look into a centralised system like a Data Hub to act as the universal brain to look up a record by any Id in any system and it will get you back a result, you are on the right path.

Please leave a comment good or bad.

要查看或添加评论，请登录

Dominic Rebello的更多文章

Lessons Learnt in Master Data Management

2023年4月18日

Lessons Learnt in Master Data Management

What did we learn from past customers? Lesson 1: Know every detail of what you're working on before going ahead with…

14 条评论
Build Versus Buy

2020年12月17日

Build Versus Buy

Quite a contentious discussion that has plagued the technology world for some time is the concept of build versus buy…

1 条评论
New Zeland: Privacy Act 2020 is in effect

2020年12月2日

New Zeland: Privacy Act 2020 is in effect

1 December 2020, the new Privacy Act 2020 came into effect. The Act introduces a number of new privacy protections for…

1 条评论
European Commission proposes measures to boost data sharing and support European data spaces

2020年12月1日

European Commission proposes measures to boost data sharing and support European data spaces

To better exploit the potential of ever-growing data in a trustworthy European framework, the Commission today proposes…
What is the difference between the Data Warehouse and CluedIn?

2020年6月16日

What is the difference between the Data Warehouse and CluedIn?

It is a relevant question. Very relevant.
COOL VENDOR: AWARDED ?

2020年5月11日

COOL VENDOR: AWARDED ?

Oh, Boy! What a year it's been since we relocated the family to sunny Brisbane to launch CluedIn APAC. Like any big…

85 条评论
Data Maturity (Model) Frameworks

2020年1月13日

Data Maturity (Model) Frameworks

Working in the data field, it is very apparent that there is a need for frameworks and maturity models so that…

4 条评论
What resources and process do I need to make a Data Foundation a success?

2019年9月11日

What resources and process do I need to make a Data Foundation a success?

Companies that approach building a data foundation with a technology-only focus will fail. A Data Foundation covers…
Merging data can be simple if you have the right tool kit...

2019年6月11日

Merging data can be simple if you have the right tool kit...

Blending data is hard. Let’s start with that realisation.

1 条评论
Why have I never achieved the 360 view of our customer?

2019年6月6日

Why have I never achieved the 360 view of our customer?

We have all asked for it. Many of us have spent an inordinate amount of money on trying to achieve it, yet I am still…

2 条评论

See all articles

Let's imagine that you are starting a company today.

But does everything have a universal way to detect duplicates?

There is an edge case.

Dominic Rebello的更多文章

Lessons Learnt in Master Data Management

Build Versus Buy

New Zeland: Privacy Act 2020 is in effect

European Commission proposes measures to boost data sharing and support European data spaces

What is the difference between the Data Warehouse and CluedIn?

COOL VENDOR: AWARDED ?

Data Maturity (Model) Frameworks

What resources and process do I need to make a Data Foundation a success?

Merging data can be simple if you have the right tool kit...

Why have I never achieved the 360 view of our customer?