Why document databases are not sufficient?
Sachin Sinha
Hiring for multiple positions. Inside Sales/Leads/Prospect, Test engineers, digital marketing, Full Stack guys
A shift is underway
A major shift is underway in the manner data is being generated and consumed. Various devices, log files, machines, apps etc. are now streaming data from different disparate sources. These data have less defined structure, and they come in all sorts of sizes and shapes, yet they are connected in some sense. These are real-time data where value is perishable if not extracted and analysed in real-time. The context is important, and we need to link different data to get the right insight. Need to act in real-time was never emphasised more than today.
All of these are changing the requirements and even for a traditional humble use case, we need to address these to solve the problem in satisfactorily manner. Patterns and anomalies detection, continuous and real-time analysis along with event and workflow management for automated actions are being expected in an out of the box manner apart from efficient data storage, query and management at large scale.
While the trend is disruptive and happening at an unprecedented speed, and is putting lots of pressure on the organisations, users, and developers to meet the requirements for their use cases. But on the other hand, majority of the databases and systems in the market were created decades ago. This is creating an ever-increasing impedance mismatch which cannot be addressed using the older or existing traditional approach.
What’s the major change?
Modern apps require fusion of different systems to process the data even for the traditional use cases. Which means we need to bring in all kinds of tools and tech, stitch them and then solve our problem in some way.
While document database allows us to process unstructured data, but other elements are needed to be brought in as well.
Why document or graph database alone is not sufficient?
We can argue, why a document database is not sufficient to handle all types of data? Similarly, why a graph database may not be enough to deal with different kinds of data? Let's try and see why they alone are not adequate and, why it's not recommended to force fit.
Real-time and continuous event/time-series data ingestion and processing
Need to find patterns & anomalies on streaming data
Need to compute Running aggregates and statistics for queries
Need to train models, predict on streaming data
State based continuous CEP and actions in real-time
Map-update instead of Map-reduce
Just having timestamp as primary key wouldn’t be enough to treat document database as stream processing database. (you probably know which document database I’m referring to)
领英推荐
Graph native integration with the stream processing.
Storing nodes and triples (subject, object, and predicate) in an efficient manner
Graph native algorithms support requires different store model.
Leveraging the relations for queries over the network
Ontology and property graph requires traversing links.
Avoiding run-time relation-based joins in traditional manner
Just having “refid” embedded in the document, is not enough to treat document database as Graph processing database. (you probably know which document database I’m referring to)
It remains largely a manual task to train models [ even when metadata is largely fixed]
Versioning, deployment of models remains a challenge or manual job
AutoML – very hard from infra perspective
Metadata, data, models all mostly, are not in sync
Need for convergence
It becomes obvious from the problem descriptions and challenges that the problem space, due to the current data trend, is converging too many requirements in the singular space. For example, dealing with multiple kinds of data, stream processing with graph linking and storage, AI as part of query and ETL, workflow, CEP etc.
Therefore, to counter this convergence at the problem space, we must also converge at the solution level. And if we can do this, then it may be rather straight forward to deal with the challenges and enable use cases in an efficient manner.
Benefits of convergence
Summary
While document database is needed at the bare minimum for many scenarios, but we may find it difficult to handle upcoming use cases or even existing use cases in modern context with traditional database such as document database. ?It requires huge effort, cost, resources and time to manually stitch multiple different systems to create a modern platform. Complexity increases in non-linear manner and in the end, it often becomes brittle where it invariably takes a long time to implement even a simple use case. We must align our thinking and model of data processing with the changing times.
We have worked with orgs solving similar problems, where it took them forever to add features and use cases. Converged database is one such model which naturally addresses the challenges in simple and scalable manner.