登录查看更多内容

A referential integrity solution pattern in the world of Microservices, document databases

Srinivasa Addepalli

发布日期: 2021年2月15日

References in document databases

Applications using document databases normally would have multiple collections (MongoDB term, which is similar to tables in RDBMS) and documents in each collection (document is equal to a row in RDBMS). Application logic requires documents in a collection referring to other documents in various other collections. Let me take an example of OpenNESS-EMCO. It has multiple collections. But, to make it simple, I will not talk about every collection and the references and will give relevant details for this article.

Some collections - "Projects" collection (a Project document is created in this collection for every tenant), "Composite Application" collection (every network-service/complex-application onboarded appear as a document in this collection, the ID field of the document has full qualified identity - such as "project-name", "composite-application-name"), "Composite profile" collection (one can create multiple profiles for each composite application and each profile is a document (Each profile document ID has full qualified identity - "project-name, "composite-application-name", "composite-profile-name"), "logical-clusters" collection (that has information about logical clusters in the system. Since logical cluster belongs to a project, its ID field consists of 'project-name', 'logical-cluster-name'). "Cluster provider" collection (each cluster provider that is registered with EMCO is saved as document in this collection with ID 'Cluster-Provider-Name"), "Clusters" collection (consisting of all clusters that OpenNESS-EMCO is registered with, with each cluster as a document with ID "cluster-provider-name" and "cluster-name") and "Deployment Intent Group" collection(with each DIG as a document, ID of document is full qualified identity with "project-name", "composite-application-name", "composite-profile-name", "DIG-name"). There are many more collections in EMCO, but will not list down everything here as these are good enough to discuss the problem statement and possible solution pattern.

As you had seen, there are references to other collection documents as part of the ID field of the documents. There are other kinds of references too, that are not part of the ID field of the document. For example, 'deployment-intent-group" document has references to "clusters" and a "logical-cluster". These references are part of the body of the DIG document (That is, these references are not part of _id field). Hence, references can be referred as 'child-reference' type and 'arbitrary-reference" type.

In document databases, knowledge of references is unknown to databases and it is up to the applications to take care of any integrity challenges.

Requirements

No broken references: One of the requirements is to ensure that there are no broken references. That is, if a document is being referred by other documents (either in the same collection or different collections), it shall not be deleted from the collection even if the administrator issues 'Delete' API command. Two solutions are possible - 'Deny the deletion" if it is being referred by other documents or "Mark for deletion" to be deleted at right time (such as removing during garbage collection time when it is safe to remove).

Only use approved content: In some industries, there is a step "approval". In case of OpenNESS-EMCO, once all the entries are made (such as profiles, intents etc..), "deployment-intent-group" document is approved before the instantiation operation is executed. The expectation is that any modifications related to DIG and modifications of documents that are directly or indirectly referred from DIG are not to be used during instantiation time. That is the content that is approved shall be used at the time of instantiation. If modified information is expected to be used, then one is required to change the approval status of DIG to none and approve it again after verifying the content to ensure that it meets the organization requirements.

Maintaining the history of documents: Few industries require nothing gets deleted and maintain history in case of modifications, at least for a period of time.

Regular cleanup: If the history is maintained, it is possible that the storage requirements go up with time. One shall ensure that the storage requirements are kept in mind. This can be done by cleanup (and hard delete) with or without archiving. Note that, One shall ensure that cleanup does not happen if the documents are being referred from other documents. In case of archiving, one also shall keep in mind privacy regulations (such as GDPR) in mind. That is, if the record is of personal in nature, it shall not be archived.

Also, clean up of documents shall only happen if the marked-for-deletion records are there for certain amount of time (Some organization might want keep the records for months).

Considerations

Micro-service architecture: One needs to keep in mind that multiple micro-services or some times multiple replicas of a given micro-service are operating on the database. Any time, one does multiple operations on the database, they all need to be executed atomically. Hence, one needs to think about "distributed locks". Always, good to go for lock timeout to ensure that any container that gets killed while in critical section does not make the entire system not-working as the lock was not released. Also explore fencing token in case you believe that the expiry time of lock occurs due to other conditions such as long latency.

Note that, some believe that maintaining multiple databases, one for each micro-service solves the referencing integrity related problem. That is not true. Note that a micro-service can be replicated multiple times. It means that there are multiple containers acting on the same database. I liked reading this post and I agree with the author https://hackernoon.com/is-shared-database-in-microservices-actually-anti-pattern-8cc2536adfe4. Maintaining database and populating the duplicate information via CQRS pattern with event sourcing appears to be a possible solution, but still it works if there is only one micro-service instance, in my view (I need to confirm this myself, but that is the impression I get). So, sharing database is not the reason for reference integrity challenges. I feel that dedicated database is a solution for something else, but not for maintaining referential integrity. I feel that it needs to be taken care at the application level.

Abstraction: Since, there are multiple developers working on an application, it is necessary to abstract the 'referential integrity' solution that does not require too many additions in the code. It shall be simple and the best solution is something that does not require any specific additions (or changes if it is existing code that works on collections and documents).

Don't deny deletions and modifications: Denying deletion and modification operations at the RESTful API level is not preferred. As described above, it is okay to "mark the corresponding documents" for deletion. And delete once there are no references to it. At the same time, it shall provide facilities for one to know where are all the references for admins to take appropriate dereferencing actions.

Safe cleanup shall be possible: System, as part of cleanup (garbage collection), shall ensure that it is safe to do cleanup, that is, there are no references to the documents it is cleaning (documents that are marked for deletion). At the same time, it shall be possible to make a document unreferenced. It that is not possible, the resources will never get cleaned up. What it means is that when the document is made 'mark for deletion', it shall not be referred by anybody after it is marked for deletion. But, existing entities that are referring to this "marked for deletion" document shall be able to refer to it and use it.

Solution Pattern

A solution pattern that satisfies the requirements with above mentioned considerations involves an abstraction layer between application logic and document database.

One thing to keep note is that in many micro-services, there is one-to-one correspondence between RESTful API resources and database resources. That makes the solution little easier. RESTful APIs are normally called by GUI or some external modules that use this application as a producer. These RESTful APIs are called 'admin actions' in the rest of the article.

Key aspects of solution pattern is something like this.

On deletion: When there is an admin action to delete a resource, mark that resource for deletion. Deny any admin action on a different resource that is referring to this marked-for-deletion resource. At the same time, if some entity already referring to this marked-for-deletion resource, continue to use this during any non-admin runtime operations. Note that, any marked-for-deletion documents' references can be put in 'to-be-deleted' collection for easier reference during clean-up.

On Modification: Always have a companion collection to maintain the history of documents. Every time any modification is made on a resource, copy the existing current copy in that companion collection. Modify the content on the copy of the original collection. System shall ensure to give a new version to the document. This can be integer (and incremented by 1 from the previous version of the document). Also, system shall ensure to add timestamp to the document. Timestamp is important in 'approval' stage cases as described above. As indicated above, any action on the approved meta resource shall only use the copy of the referred resources (direct or indirect) at that time of approval. Timestamping allows to get the right copy when queried. System shall ensure to give the copy of the resource that is latest with respect to time at which it was approved. For example, if there are copies of a resource at t1, t1+1, t1+3. And if the approval a meta resource happened at t1+2, system shall return the t1+1 copy of referred resource.

On garbage collection on 'marked-for-deletion' resources: Since it is important to do clean up safely, it shall not delete any resource that is referred elsewhere. Figuring this out shall not require complex logic from the application API developers. It is good if the information of relationships is taken as 'relationship-schema' and use this schema to figure out any references. In the above OpenNESS-EMCO example, DIG developers can create schema indicating that it refers to "logical-cluster" and "clusters" of type 'arbitrary-reference'. "Composite-profile' developers create schema indicating that DIGs are referred with 'child-reference'. Similarly, project developers are expected to create schema indicating that they have two child references - "composite profiles" and "logical-clusters".

Cleanup algorithm is expected to use these schemas in navigating the documents to ensure that 'marked-for-deletion' records are safe to remove. For example, if the marked-for-deletion' record is a child, it shall ensure that the parent itself is not referred by anybody else as 'arbitrary-reference'(and it requires some sort of recursion too). If the record can be referred as 'arbitrary reference', it shall ensure that it is not referred by those entities that have references to it as per the schema. Note that a given resource kind (collection) can be child of another collection and also can be arbitrariry referred by other collections. It is safe to delete the document, only if nobody has this as 'arbitrary-reference' to it and to its parents.

As described above in one of the requirements, yet times, some organization may want to keep the deleted data for a while (months). System shall ensure to keep the deleted records until that time (expiry).

On garbage collection of history of modifications: This garbage collection is expected to prune the history. It shall go through every companion collection and remove the old versions if it meets the criteria (Older than some timeout given by the admin or if the versions are more than certain limit).

Modularity considerations

Since API resources can extend multiple micro-services, it is good to take schema as the text document (Could be JSON). System shall interpret this text file and create internal data structures that help the clean up system. Schema can take more details on where to find the references in different collections. Schema may also also contain set of variables in the document schema that can be used to identify the identities.

Some say that use graph databases for keeping this metadata (relationships) while keeping the documents in document databases. But, this needs to be weighed as the complexity of system goes up. GraphDBs are good if there are additional functionality is required such as different types of queries. If the purpose is only for maintaining relationships for clean ups, GraphDBs could be overkill.

Since a given API operation makes multiple calls, distributed locks are required, but it is good if they are confined to the abstract layer. That is, no API developers shall need to worry about the locks.

Summary

Distributed system programming consists of multiple concepts and it could be challenging to educate all developers and maintainers of best practices. With micro-service architecture, this becomes even more complex. Hence, the belief is that it shall abstract complexities related to referential integrity in document databases, micro-service architecture by hiding them from majority of application developers. This post provided one solution pattern and appreciate any feedback/corrections.

要查看或添加评论，请登录

Srinivasa Addepalli的更多文章

About HTTP/3.0 - Browser behavior

2024年8月27日

About HTTP/3.0 - Browser behavior

HTTP has evolved over the years, from the original version HTTP/0.9 to the current standard HTTP/2.

2 条评论
SASE/SSE Security for Managed, Unmanaged, and Zero-Touch BYODs

2024年5月31日

SASE/SSE Security for Managed, Unmanaged, and Zero-Touch BYODs

In today’s digital landscape, enterprises see a diverse range of devices, each with its own set of characteristics and…

2 条评论
No, SWG technology is not going to disappear

2024年5月19日

No, SWG technology is not going to disappear

Secure Web Gateway (SWG) technology is a proxy-based solution designed to protect users' endpoint devices from…

2 条评论
Network-as-a-Service with 5G, SASE, MCN and role of Kubernetes and Project EMCO

2022年9月11日

Network-as-a-Service with 5G, SASE, MCN and role of Kubernetes and Project EMCO

NaaS (Network-as-a-Service) is again being talked about in the recent past in the Industry. Few years back, many…

5 条评论
Realizing next generation ZTNA and a design pattern for Next generation network security

2022年9月5日

Realizing next generation ZTNA and a design pattern for Next generation network security

Introduction NIST publication has the best description of zero trust. Few important points made by NIST in the…

2 条评论
5G & Multi Edge Application Delivery Networking - Role of Akraino/ICN & EMCO

2021年9月20日

5G & Multi Edge Application Delivery Networking - Role of Akraino/ICN & EMCO

Background Content delivery networking is well known in the Industry, but it is limited to distribution of static…
Reduce Carbon footprint with right software architectures - Kubernetes, WASM and EMCO role

2021年9月12日

Reduce Carbon footprint with right software architectures - Kubernetes, WASM and EMCO role

In my last post (https://www.linkedin.

2 条评论
CPU + IPU : Why Multi Cluster Orchestration becomes super important?

2021年9月4日

CPU + IPU : Why Multi Cluster Orchestration becomes super important?

There is quite a bit of buzz on DPUs (Data Processing Units) in the last few months. I, personally, don't like to call…

5 条评论
Rethinking on Zero Trust Architecture Solutions

2021年9月3日

Rethinking on Zero Trust Architecture Solutions

There is a lot of news about massive data breaches. Look at the list of data breaches in Aug 2021 alone here:…

15 条评论
Complementing Google Anthos/Microsoft Arc/AWS GitOps with EMCO for distributed application orchestration across clouds

2021年8月19日

Complementing Google Anthos/Microsoft Arc/AWS GitOps with EMCO for distributed application orchestration across clouds

All public cloud providers seem to implement GitOps way of deploying applications in the managed K8s clusters. The…

4 条评论

See all articles

社区洞察

Relational Databases

How do you measure and improve the performance of your query execution plan?

A referential integrity solution pattern in the world of Microservices, document databases

Srinivasa Addepalli

References in document databases

Requirements

Considerations

Solution Pattern

Modularity considerations

Summary

Srinivasa Addepalli的更多文章

社区洞察

其他会员也浏览了

Evolution of Eventual Consistency

Understanding ACID Properties in Relational Database

DuckDB: A swiss army tool

Parallelism in Databases: A Guide for DBAs

What is an In-Memory Database System?

The usage of secure Debezium connectors in Cloudera environments

Data Wars: A NewSQL Hope

Creating MuleSoft Persistent Object Store using Persistence Gateway on RTF on EKS

Why I question the Database-per-Service pattern

Digital Transformation - Modern DB

References in document databases

Requirements

Considerations

Solution Pattern

Modularity considerations

Summary

Srinivasa Addepalli的更多文章

About HTTP/3.0 - Browser behavior

SASE/SSE Security for Managed, Unmanaged, and Zero-Touch BYODs

No, SWG technology is not going to disappear

Network-as-a-Service with 5G, SASE, MCN and role of Kubernetes and Project EMCO

Realizing next generation ZTNA and a design pattern for Next generation network security

5G & Multi Edge Application Delivery Networking - Role of Akraino/ICN & EMCO

Reduce Carbon footprint with right software architectures - Kubernetes, WASM and EMCO role

CPU + IPU : Why Multi Cluster Orchestration becomes super important?

Rethinking on Zero Trust Architecture Solutions

Complementing Google Anthos/Microsoft Arc/AWS GitOps with EMCO for distributed application orchestration across clouds

社区洞察

其他会员也浏览了

Evolution of Eventual Consistency

Understanding ACID Properties in Relational Database

DuckDB: A swiss army tool

Parallelism in Databases: A Guide for DBAs

What is an In-Memory Database System?

The usage of secure Debezium connectors in Cloudera environments

Data Wars: A NewSQL Hope

Creating MuleSoft Persistent Object Store using Persistence Gateway on RTF on EKS

Why I question the Database-per-Service pattern

Digital Transformation - Modern DB