What's my name?
Barthélémy Leconte
Strategic Account Manager at LexisNexis Risk Solutions | Helping businesses prevent Financial Crime
What's my name?
Well, looking at the heading of this article may have helped you: Barthélémy Leconte. That is unless you have received emails from me, in which case I usually sign Bart Leconte. And if you're part of the French Administration, to you I am Barthélémy Philippe Etienne Leconte.
Is my question stupid?
That's a silly question but it's an important one if you are building a tool based on name matching. And that's exactly what the company I work for has been doing for 30 years and what has been my job for the past 7 years.
We are talking about regulatory screening here, but you can apply the ideas below for any part of your business where you have to deal with names. From making the user interface more cultural inclusive to deduplicating your database.
For the sake of simplicity, I'll be focussing on individuals screening and solely on the name-matching part.
If you are not familiar with Regulatory Screening: Financial Institutions and some non-FI have an obligation to screen their database daily to uncover customers that have been added to national or international lists, and freeze their assets. Moreover, they need to screen their customers against lists of millions of names to detect persons with political exposure or adverse information in the media.
My names?
If you were to code a software to find me in your customer database, the safest bet would be to design it to look for the three variations that I go by (Bart Leconte, Barthélémy Leconte and Barthélémy Philippe Etienne Leconte).
However, this is a really French-centric approach as we tend to disregard middle names. And you're not going to be working solely with French customers, are you?
Picture this: Pablo, Diego, José, Francisco de Paula, Juan Nepomuceno, María de los Remedios, Crispín Cipriano de la Santísima Trinidad Ruiz y Picasso has been added to the Sanctions List! Is your tool just going to look for Pablo Picasso? or Pablo Ruiz Picasso? If you want to look for any of the combinations with 2, 3 or 4 of his names, you are looking at 1,925 combinations...
If you have no idea where my name comes from or the cultural background surrounding it, you're in for a treat and you may want to throw Barthélémy Etienne Leconte, Barthélémy Philippe, Philippe Etienne, Philippe Leconte & Etienne Leconte into the mix.
Now, code that.
Enough with the French accent
You have built the first block of your software, and now you start to wonder about the é. Should you also expend the search to Barthélemy, Barthelémy, and Barthelemy? Well, Barthélemy and Barthélémy are not the same names. But online, I often sign up as Barthelemy as accents are not accepted by all systems. So you can take the hit on precision and code your software to remove it.
You move to è and transform it to e, from à to a, from ü to... wait. According to this article, the german ü should be transformed to UE. Ok, let's do that. However, keep in mind that the French NGO Emmaüs, should be transformed to Emmaus, not Emmaeus.
Now, you just have to map ?, ?, ?...
Ok, nothing more to worry about right? No names with punctuation marks or anything?
I mean, we know what the Irish names are typically like, but we can safely rename Mister O'Hara to Mister Ohara. There is no chance this is a popular surname in Japan. Right? Ok, it is.
Let's split it in this case: Mister O Hara. It's not like anyone is named Mister O and this would mess up the system? Well you are wrong, let me introduce you to the ex-French Secretary of State, Mister O.
Now, code that.
Barthélémy Leconte PhD?
To prove a point, I was thinking of getting a PhD for my first LinkedIn article. But I've learned that I'll need more than one weekend to make that happen so I'll take another example.
Let's rock'n'roll and assume Brian May has been added to the Sanctions List. Sorry, I mean, Sir Brian May. I mean, Sir Brian May CBE (Commander of the Most Excellent Order of the British Empire). I mean, Sir Brian May CBE PhD.
Ok, for sure, we can remove some parts of his name. Let's ignore Sir. No one is named Sir, right? Like, it's not a first name. As this is becoming a recurring joke, I'll let you search for Sir Carter (King & Prince are also quite popular).
I'm not even going to discuss Jr.
Now, code that.
Enough with the cultural centrism
Joko Widodo (the Indonesian President), Gueni Th. Jóhannesson (the Icelandic President) and Xi Jinping (the Chinese President) enter the United Nations bar.
You are sitting at the bar, and the bartender leans toward you and asks:
"Can you please remind me of their last names? I want to greet them properly!"
Your safest bet would be to advise them to just call them "Mister President"!
Joko Widodo does not have a last name, this is not something that exists in Indonesia. Gueni goes by his given name, as is the norm in Iceland (if you really want to, you can go for Mister Thorlacius and not Mister Jóhannesson). Finally, it's Mister Xi, not Mister Jinping.
You manage to get all their names correctly but the entire UN assembly is coming in and boy, you are sweating.
Now, code that.
领英推荐
Who dares not use the Latin alphabet?
One good thing about humanity is that we tend to agree on stuff. No one could imagine living in a world with 293 different writing systems!
Joking aside - 293? Seriously guys?
Let's use one of the most famous writing systems as an example here: Cyrillic. Its romanisation (= the switch from Cyrillic characters to Latin ones) must be standardised by now. Probably just one standard, two tops. Right..?
Wrong again:
To complicate things even further, some languages have to romanise Cyrillic differently to preserve the original sound of their letters. Let's have a look at the Влади?мир Пу?тин Wikipedia page, shall we?
And remember, this is for one of the easy romanisations out there (Russian Cyrillic alphabet). It gets a bit trickier with Arabic, or Chinese. As a challenge, please transliterate this: 彁.
Now, code that.
No one is perfect
You have now perfectly understood the different naming conventions, your system can accept all kinds of letters and symbols, but still, your software is not returning one of the customers you are looking for.
There's only one explanation for this: there's a mistake in your database.
According to research, the top-notch OCR tools are accurate up to 99.3% of the time. In a database of millions of customers, that can add up quickly. Even if we assume that OCR on passports is more accurate, a 0.001% error rate on 1 million customers still gives you ten incorrect names. If you are not using OCR, we can also assume a lot of typos. It's harder to quantify but this great article explains why it happens all the time in newspaper.
Keep in mind that the mistake can come directly from the source data and not from your database.
Ok, no problem. Let's build this software to be error-tolerant. We have multiple algorithms at our disposal. Levenshtein distance is the most popular. But we could use Jaro-Winkler or Hamming distance.
I won't go into the details of these algorithms, just know that they will look at the number of editions, removals or additions between two names.
Great! But when are we compensating too much for mistakes? Using Levenstein distance, Mark Wahlberg and Mark Walberg are a 91% match. And yet, they are two completely different persons. Greg Clark and Greg Clarke would also give the same score (this is a real spelling mistake that happened in The Guardian).
Depending on the number of customers and lists you want to scan, that can be manageable. If not, you may need to deep dive into the Phonetic Matching algorithms, which should be an article on its own.
Now, code that.
I'm unique
I am, I truly am. I am a special snowflake. But my name is not. At least "Barthelemy Leconte" is not only born by me. As far as I know, two other people have my name.
So, that's only three people in total with an identical name. We can work with that. But what do you do if you find a match on James Smith? If I ask you to look at his Wikipedia page, you have the choice between 158 different entries. This name carries close to zero information and you will need to have other PII to know who your customer is.
This could be an article on its own, which it may be one day. But for now:
Code that.
Your first step of a long journey
You have guessed it, this is not the end. But it's the end of this article. I will not bore you with the need for names equivalence (e.g. Alexander & Sacha), or the detection of names in long strings of text. No time to get into the subject of how people can change names throughout their life either, or that not all aliases are equal.
Checking names in a database sounds like an easy task, and it could be achievable for a very limited, culturally similar set of names. But which business is dealing with this now?
And when you are done writting the perfect software, you will always find someone naming their child X ? A-Xii to ruin your day!
If you want to continue to learn about naming conventions, I suggest you read this excellent article by Patrick McKenzie , the inspiration for this post. Some of the examples are taken from the response to the original article posted here.
And of course, follow me!
About me:
I've been working for more than 6 years configuring softwares for companies to detect Sanctioned Entities, Politically Exposed People (PEP) and Adverse Media. I had the opportunity to tune models for customers in Europe, Middle East, Asia and Africa. I got the chance to lead a team of consultants doing this same job for CEE & CIS, and I am now a Fircosoft Enterprise Expert, for the French-speaking Europe region and Africa.
Coordinateur RH
1 个月Very interesting (and well written) read. Makes me think of my name in whole new ways.
Solutions Architect | Helping transform AML technology & processes
7 个月Great article. I've spent a lot of time explaining all of that to developers who want to build their own screening tools.
Strategic Account Manager at LexisNexis Risk Solutions | Helping businesses prevent Financial Crime
7 个月If you want to check why it's more and more important to get powerfull name matching software for Sanction Screening, you can check our pulse on the latest Sanctions lists addition https://risk.lexisnexis.com/global/en/insights-resources/infographic/sanctions-pulse
Financial Crime Compliance Expert, LexisNexis Risk Solutions - Helping Clients Navigate Complex Regulations
7 个月well worth a read ! Thanks for sharing Barthélémy! (Linkedin decided your accents are to be kept ;-))