Why document databases are not sufficient?

Why document databases are not sufficient?

A shift is underway

A major shift is underway in the manner data is being generated and consumed. Various devices, log files, machines, apps etc. are now streaming data from different disparate sources. These data have less defined structure, and they come in all sorts of sizes and shapes, yet they are connected in some sense. These are real-time data where value is perishable if not extracted and analysed in real-time. The context is important, and we need to link different data to get the right insight. Need to act in real-time was never emphasised more than today.

All of these are changing the requirements and even for a traditional humble use case, we need to address these to solve the problem in satisfactorily manner. Patterns and anomalies detection, continuous and real-time analysis along with event and workflow management for automated actions are being expected in an out of the box manner apart from efficient data storage, query and management at large scale.



Rise of Real-Time Data


While the trend is disruptive and happening at an unprecedented speed, and is putting lots of pressure on the organisations, users, and developers to meet the requirements for their use cases. But on the other hand, majority of the databases and systems in the market were created decades ago. This is creating an ever-increasing impedance mismatch which cannot be addressed using the older or existing traditional approach.

What’s the major change?

Modern apps require fusion of different systems to process the data even for the traditional use cases. Which means we need to bring in all kinds of tools and tech, stitch them and then solve our problem in some way.


Need to handle many different types of data at the same time and space

While document database allows us to process unstructured data, but other elements are needed to be brought in as well.

Why document or graph database alone is not sufficient?

We can argue, why a document database is not sufficient to handle all types of data? Similarly, why a graph database may not be enough to deal with different kinds of data? Let's try and see why they alone are not adequate and, why it's not recommended to force fit.


  • Document database is not a stream processing engine. Which means we can’t, in true sense, ingest time-series data and expect the database to do event processing and take actions in real time. for example, following are some of the requirements for stream processing which a database is supposed to fulfil.

Real-time and continuous event/time-series data ingestion and processing

Need to find patterns & anomalies on streaming data

Need to compute Running aggregates and statistics for queries

Need to train models, predict on streaming data

State based continuous CEP and actions in real-time

Map-update instead of Map-reduce

Just having timestamp as primary key wouldn’t be enough to treat document database as stream processing database. (you probably know which document database I’m referring to)


  • Document database is not a Graph processing engine. This seems bit more obvious. How do we force a triple (subject, object, and predicate) to be stored within a document store with structure preserved for efficient queries and future tasks? Further, graph database can use AI as well in largely implicit manner due to its natural layout, if properly preserved. Some of the requirements are as follows.

Graph native integration with the stream processing.

Storing nodes and triples (subject, object, and predicate) in an efficient manner

Graph native algorithms support requires different store model.

Leveraging the relations for queries over the network

Ontology and property graph requires traversing links.

Avoiding run-time relation-based joins in traditional manner

Just having “refid” embedded in the document, is not enough to treat document database as Graph processing database. (you probably know which document database I’m referring to)


  • Document database doesn’t have AI right where data is. Which means we need to export data to an AI layer, deal with ML ops, train model and then import the model back into the system or use it at application level. This is clearly not scalable model for many cases. Also, it makes several jobs inefficient, time taking or error prone for constant and continual update of ML models.

It remains largely a manual task to train models [ even when metadata is largely fixed]

Versioning, deployment of models remains a challenge or manual job

AutoML – very hard from infra perspective

Metadata, data, models all mostly, are not in sync


Need for convergence

It becomes obvious from the problem descriptions and challenges that the problem space, due to the current data trend, is converging too many requirements in the singular space. For example, dealing with multiple kinds of data, stream processing with graph linking and storage, AI as part of query and ETL, workflow, CEP etc.

Therefore, to counter this convergence at the problem space, we must also converge at the solution level. And if we can do this, then it may be rather straight forward to deal with the challenges and enable use cases in an efficient manner.

  1. Checkout a case study of performance comparison, converged database vs MongoDB
  2. Performance benchmark for a converged database

Benefits of convergence


Benefits of converged database

Summary

While document database is needed at the bare minimum for many scenarios, but we may find it difficult to handle upcoming use cases or even existing use cases in modern context with traditional database such as document database. ?It requires huge effort, cost, resources and time to manually stitch multiple different systems to create a modern platform. Complexity increases in non-linear manner and in the end, it often becomes brittle where it invariably takes a long time to implement even a simple use case. We must align our thinking and model of data processing with the changing times.

We have worked with orgs solving similar problems, where it took them forever to add features and use cases. Converged database is one such model which naturally addresses the challenges in simple and scalable manner.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了