登录查看更多内容

Unique IDs in programming

Christoph Jahn

CEO & Founder | Tools & Consulting for webMethods? | ????????????????.??????

发布日期: 2022年3月15日

Most people have probably come across what is usually called a UUID (universally unique ID) while using software. UUIDs are typically a cryptic combination of alphanumeric characters and do not make any sense to the human brain.?But why are they such a critical aspect to most computer programs?

Their purpose is pretty obvious: be able to identify a set of data (money transfer, customer, product, order, etc.) on a low technical level. The human brain, for most scenarios, does not need such an artificial construct but works nicely with the underlying “real” data. We identify a customer by looking at first name and surname. And if we have multiple customers “Mike Smith”, we add the date of birth. If that is still not enough, then there is the current address. And so on.

For the purpose of this discussion a customer’s UUIDs is not to be mixed up with the customer number but exists in addition. This may seem like overhead, but think about what happens when an organization buys a competitor. With a bit of “luck” there will be overlap between the customer numbers. Without a UUID already in place, all sorts of ugly workarounds need to be implemented under great time pressure, to be able to merge the customer lists then. If that happens, there is considerable risk of something going wrong, resulting in the loss of customers.

It would of course be possible to replicate the human brain’s approach of looking at data in their individual context. But that would make things unnecessarily complex, plus require a different approach for each type of data. So we help ourselves with a technical ID that is guaranteed to be unique. Generating such an ID is surprisingly complex, once you realize what the algorithm needs to accomplish:

Be fast: There are many scenarios where you need to create tens of thousands of UUIDs per second (e.g. high-frequency trading, payments processing, telco billing, etc.). But “randomness” usually requires the use of cryptographic functions, which are notoriously expensive operations. In recent years this has become less of a concern, though, since many CPU now offer dedicated support here.
Be unique across all computers that are involved with the application: While it is probably rarely a problem if two identical IDs are issued for two completely disparate organizations (ignoring scenarios like EDI), there are many cases where it is still highly relevant. Most critical applications run on more than one computer for high-availability and load-balancing purposes. So obviously there must never be a case where IDs clash. Also, it would likely cause problems if the same ID existed not only on the production system but also on a development or test system.
Be relatively short: Many UUIDs are between 30 and 40 characters long, which is really not long, given that it is guaranteed there will never be a clash.

Let’s now look into the use of UUIDs. Apart from pretty obvious things like the aforementioned customer etc., they are used in very many systems for internal purposes. A good example are relational database managements systems, where each record (aka row) has its own ID. The same is true for messaging system (think JMS or MQTT).

The two core use-cases I see for those internal IDs are fault diagnosis and linking data. In today’s world most systems are highly distributed, even without the use of a micro-services architecture (which increases the level of distribution by orders of magnitude). To track a business transaction across multiple systems, you need to be able to identify these sub-transactions and the means for this are UUIDs. Ideally you have an operations console that automatically connects things between systems. In reality, though, there is often a lot of manual work to be done.

领英推荐

Coding Challenge #66 - Zip File Cracker

John Crickett 4 个月前

Understanding Partial Equivalence in Rust's…

Luis Soares, M.Sc. 1 个月前

The Copy-and-Swap Idiom

Rainer Grimm 1 年前

Another example of linking data together is master data management (MDM). Many organizations have done something in that area and most have failed. The core reason in my view is the approach. It is a business problem that is very closely linked with many technical challenges. And most organizations are bad at dealing with such a combination. There are more aspects, but I will cover those in a separate article.

Back to UUIDs. It might be tempting to leverage internal IDs (e.g. from a database system) for your application. But be warned, this is a very dangerous road. Those IDs are guaranteed to be unique only in the context, for which they are created, but not outside. Even more critical is using just a part of the IDs, because the rest seems to be a fixed value. I have seen a business-critical end-user application where part of the database’s row ID?(Oracle Database v7) was used. Later the database was migrated to a higher version (Oracle Database v8) where the UUID algorithm had be changed. So the sub-string of the row ID was suddenly not unique anymore. The end-user application did not expect duplicates and crashed immediately after starting.

While at the subject of databases, there are people who like to use sequences as UUIDs. Sequences are numbers, which the database auto-increments and they seem a convenient and efficient way to obtain a unique ID. But there are various problems with that approach. Firstly, the ID is only unique within a single database instance. This typically creates all sorts of problems for testing the code, and also when moving it to production. Secondly, this kind of feature, while available in many database systems, is a proprietary extension of SQL. So you create yourself unnecessary problems for using different systems. Many organizations have standardized on one database system for production use. Having to use this also for DEV, CI, SIT, UAT, etc. may make things more difficult than necessary. More importantly, though, it increases the vendor lock-in with all the associated issues.

Let me finish with timestamps. They are the original sin of UUIDs. Really. People like them because they are human-readable, allow easy sorting of transaction into the order of processing, and just seem to be THE obvious way to go. But they are not unique! If your development machine is slow enough, relative to the transaction’s processing time, you may indeed not have issues. But that is only because at least one millisecond (you don’t use a resolution of seconds, do you?) goes by between transactions. A production machine, however, will likely be much faster. Yes, the risk decreases with nanoseconds. But what if multiple machines are working in parallel?

In one case I have seen there was considerable data loss, because someone had been clever enough to use a timestamp with a resolution of only seconds as the filename for writing PDFs into a directory. From there an archiving solution then picked them up for storage to fulfill a legal requirement. This guy’s notebook had been slow enough (it was in the early 2000s) that all files had been several seconds “apart”. But the production machine was a beefy server and it took several weeks until someone realized what had happened. Tens of thousands of documents were lost forever.

I hope this quick overview provided some value to you and will help you in the next discussion on why you really need a proper UUID.

(This text was first published on my personal blog in June 2020.)

Christoph Jahn

CEO & Founder | Tools & Consulting for webMethods? | ????????????????.??????

1 年

Here is another related article: https://medium.com/@thomasjay200/stop-using-integer-ids-in-your-database-5e5126a25dbe .

Christoph Jahn

CEO & Founder | Tools & Consulting for webMethods? | ????????????????.??????

1 年

For additional thoughts on the subject please have a look at https://vladmihalcea.com/uuid-database-primary-key/ . Thanks, Vlad Mihalcea, for posting this!

Manuel Alejandro Gamboa Jimenez

Software Developer

2 年

A second comment, timestamps can be used with an abstraction that ensure uniqueness and sequence like HLC clocks.

1 次回应

Manuel Alejandro Gamboa Jimenez

Software Developer

2 年

Nice article, an important example of the limitations of integer IDs is when your app is offline first, what if you can't ensure network connectivity 100% of the time, you will block the user and make him/her wait until the network comes back?, If you want to offer offline functionality you will have a hard work to make it works with integer or any kind of id different than UUID, becouse you can't ensure sequence or uniqueness in a disconnected node, there's no central server that ensure that uniqueness.

1 次回应

Christoph F. Strnadl, Ph.D., CBPP

Defining & realizing the technical infrastructure and standards for digital ecosystems and dataspaces | CTO | Gaia-X AISBL

2 年

Amazingly insightful and equally well formulated and fun to read!!

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Unique IDs in programming

Christoph Jahn

CEO & Founder | Tools & Consulting for webMethods? | ????????????????.??????

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Why Rust is More Memory-Safe Than C++

Understanding SOLID in C++?: Single Responsibility Principle

C++20: Basic Chrono Terminology

C++: The Powerhouse Behind Your Favorite Applications

Real-world Examples of Multithreading in .NET Applications

Operating System Fundamentals: Part 3 – Software Essentials!

Lets understand Go 1.21 new release

21 new features of Modern C++ to use in your project

The Key Achievements and Contributions of C++ in the World of Computing and Software

Concurrent and Parallel Programming

领英推荐

Staff communication. Successful!

2022年3月6日

The limits of conventional wisdom

2021年12月17日

Legacy software: Better than its reputation

2021年12月8日

The conflict of values and goals

2021年11月7日

Finding topics for great conversations with your customer

2021年10月25日

Thoughts on (personal) branding

2021年10月19日

How to give a presentation

2021年10月13日

You are not faster with an 80% solution

2021年10月2日

One secret of good demos

2021年9月28日

Understand the problem. Truly.

2021年9月14日

社区洞察

其他会员也浏览了

Why Rust is More Memory-Safe Than C++

Understanding SOLID in C++?: Single Responsibility Principle

C++20: Basic Chrono Terminology

C++: The Powerhouse Behind Your Favorite Applications

Real-world Examples of Multithreading in .NET Applications

Operating System Fundamentals: Part 3 – Software Essentials!

Lets understand Go 1.21 new release

21 new features of Modern C++ to use in your project

The Key Achievements and Contributions of C++ in the World of Computing and Software

Concurrent and Parallel Programming