The Document Versioning Pattern in Azure Cosmos DB
In high regulated industries, such as Finance, Healthcare, Insurance, etc., tracking histories of some portion of the data is paramount. This may be due to auditing, reporting, or simply for comparison and statistical analysis.
For instance, one of my current customer, has the need to keep track of "amendments" for premiums and claims, for each specific client. This is very typical.
One of the key features of Azure CosmosDB is called Change Feed. CosmosDB basically exposes through API the underlying log of changes for the documents in the collections. The changes are persisted, can be processed asynchronously and incrementally, and the output can be distributed across one or more consumers for parallel processing enabling a variety of applications like serving a microservices architecture , alerting in real time, trigger functions to execute a piece of business logic, etc. The change feed at the moment works for updates and writes (deletions are on the roadmap), and exposes only the most recent change corresponding to the item; it means that intermediate changes are not visible.
But, as you may have already guessed, the change feed cannot by itself cover the whole versioning requirement but it plays an important role in the overall solution.
SOLUTION
In short, it is possible to set up, for each collection which items are to be subject to versioning, a secondary or shadow collection, resulting in one that has the latest (and most queried data) and another that has all of the revisions of the data, somehow connected to the first one.
Please enter the Document Versioning Pattern. Let's follow the famous GoF pattern structure to make things easier.
- Intent. Ensure that each entity in collections, when updated, maintains the history of the changes
- Motivation. It is important to track history of entities throughout their lifecycle
- Applicability. As mentioned in the intro to this post, many companies have the need to track document changes for auditing, reporting, and statistical purposes
- Structure. The following diagram shows the simple structure for this pattern. Essentially, the key to understanding is that in order to keep the state of objects, every updates has to be turned into an "append" operation. Secondly, a "shadow" container has to be setup to keep the log of the changes. In the example below, the "change" is represented by the whole new version of the document.
- Participants. The whole mechanism is realised through the change feed to implement a "Materialized View".
- Consequences. Typically, this pattern works well when there are not many document revisions (history long) and most of the queries are done on the current version of the document. If these criteria are not met, this pattern might not be the right fit and also may suffer from performance degradation. In fact, if data changes frequently keeping versions will be very write intensive (given that there are two collections to keep in synch).
?A few departing thoughts
Three quick considerations.
First off, there are a few nuances that need to be factored in. For example, what happens with deletion? Shall we delete the whole history? Or can we promote an old version to be the current? These, and a few others, are all questions fundamentally related to the business outcome hence the answer must be drive by that.
Secondly, I have intentionally left out any implementation details and performance considerations. A Gitub repo would do the job. Hopefully soon ;-)
Thirdly, there is the die-hard myth about reusability. It deserves a separate post, and I've already started writing it. However, design patterns are a great level of granularity for reuse, striking a balance between the "too generic" and "too detailed" conundrum.
Last but not least - check out this great post by Andrei who has initiated what I believe is the most meaningful way to "talk business" about NoSQL patterns.
I welcome any ideas about creating a solid, referenceable catalogue of NoSQL patterns. A new literature is possible.
Even AI Agents need a memory | Principal Product Manager | Azure Cosmos DB | Advisor, Mentor & Coach
4 年Love the description Michele! I would argue our versioning is even a bit better than the approach in the picture!
Data Engineer at Everest Reinsurance Company
4 年Excellent article. Are the "premiums and claims" applications OLTP applications?