The case for Elasticsearch Join API
Elasticsearch is the leading search engine as of this day - it has become the number one selection for thousands of companies worldwide.
It is used in a huge variety of cases - from the traditional logging events to the cyber world of alerts and incidents.
Its domination is complete and total with a huge and vibrant community that pushes and drives the development at a fantastic pace.
Elasticsearch has expanded its portfolio in a magnitude - from the humble ELK to the vast products offering including beat, APM, SIEM, Search, Cloud and more...
The elasticsearch company has grown tremendously and is hiring many engineers to fulfil the community's ongoing demand for new features and solutions.
My personal story with elasticsearch started a couple of years ago when we started using elasticsearch for indexing social network interactions and needed fast and complex search/aggregate capabilities.
Modelling the Data
Very early on we started mapping our domain entities model in a simple way that resembles standard SQL tables.
Soon enough we realized that we will not be able to correlate (Join) different entities that have common fields - elasticsearch has no support for joins !!!
After some basic research in the internet we came up with the next solutions:
* Add multiple types for a single index mapping (single index - many entities) - RIP Multi-Type indices not supported any-more
* Add nested documents (object arrays) for each index - the nested document represent the entity's out-going relations
* Add Parent Child relationship in the same index.
* Denormalize the data to store materialized (and predefined) join indices
None of these solutions is perfect - each has its cons / pros - depending on the data volume size, speed of the actual queries and amount of joins needed in reality.
Modern solutions
Now keep in mind that at this stage (6 years ago) the modern solutions that exist today where not mature (or didn’t exist) :
* Apache spark Joins with elasticsearch-hadoop-plugin to hand-off the join task to spark
* Apache presto with its elasticsearch plugin again to hand-off the join task to presto (plug-in offers minimal support for elasticsearch's rich DSL)
* Apache Calcite – an SQL execution (cost based) plan optimizer with adapter to elasticsearch (again - very basic support)
Elasticsearch to the rescue ?
At this point I started thinking - why doesn't elasticsearch offer this functionality ?
When searching the internet for some clues I came up with the following answers:
- Denormalize your data according to the needed joins your customers will need...
- Since elasticsearch is a distributed NoSQL datastore it is not relevant to try to "force" a relational 'algebraic behavior' on it...
Hmmm....
These all sound like good excuses, but unfortunately the nature of things is that the real day to day needs exceed such puristic thinking.
Let's review what elasticseach has done in the past few years to contradict these declarations to prove they do follow the community needs even if it contradicts some 'NoSQL - way of thinking':
* Adding transformer support - implementing materialization of group-by aggregation according to some pivot
This is feature that was circling for some time inside the community and was independently implemented by the application side until elasticsearch accepted the challenge created the appropriate API
* Adding SQL query language support - Again the SQL language is a standard event for NoSQL databases
The community has long requested this feature for a long time and has offered independent solutions to solve the translation from SQL to elastic's DSL until the engineering group of elasticsearch finally accepted the requests and implemented this as an API inside.
I can continue arguing that there are many additional features added to elasticsearch by a direct influence of the community which are not always aligned with a puristic engineering agenda - but this is the beauty of our community.
I would also argue that if we consider Kibana to be a BI tool - could you imagine a BI product that has no join operations ?!
So - why isn't join already a part of elasticsearch API ?
see discussions related to this subject :
* https://github.com/elastic/elasticsearch/pull/3278
* https://github.com/elastic/elasticsearch/issues/28639
* https://github.com/elastic/elasticsearch/issues/27315
* https://github.com/rmagen/elastic-gremlin/issues/42
* https://github.com/sirensolutions/siren-join (obsolete ...)
This definitely shows that the community is interested in such API for some time ...
The Inherent complexity of a Join API
I would like to dive into some of the complexities that I assume are a major preventer for such development to take place.
A join between 2 + indices require the next capabilities:
* execution plan for join direction optimization
* statistics on indices and fields to plan the optimizer join order
* push-down join-predicate to each index specifically
* paging capability to allow scrolling the data
* strong caching for the hash-join temp tables
* ability to sort / boost the join results according
* streaming (lots of) data between nodes across the cluster
We can continue adding additional engineering concerns here but I think that I've made the point regarding the complexities that are expected... nevertheless - you can find additional NoSQL databases offering similar join API in-spite of the huge engineering effort.
Existing Assets we already have today
Taking all the above into consideration it's worth mentioning that there are engineering assets that would simplify the development process:
The modern elasticsearch database (version >7) has added a manifold of capabilities and libraries that would assist such an engineering efforts – lets name a few -
* Async search:
The async search API let’s you asynchronously execute a search request, monitor its progress, and retrieve partial results as they become available.
This async capability is very useful in such long join operations which require such long running background multi-parts queries.
* Checkpoints (appeared in transforms API)
Each time a transform examines the source indices and creates or updates the destination index, it generates a checkpoint.
This capability to preserve the point in time the index was examined is also an important part of the ability to perform a paged join over large indices.
* Sorted Indices
When creating a new index in elasticsearch it is possible to configure how the Segments inside each Shard will be sorted.
This is also a supper important capability that was not present couple of years ago – It allows keeping the index sorted which helps performing the join more efficiently using Merge-Sort Join.
Additional Capabilities needed
Join Execution Planner – this functionality is necessary since the join order is crucial in determining the execution time and space.
Most Databases keep a living statistical information regarding tables that include cardinality, ordinality and variance of fields and their values distribution.
Elasticsearch has a great capability doing just that :
* Hyper log log counting functionality
* Great aggregating API
I would love to dive into additional engineering & cluster resource constraints (such as data transfare between nodes) - but I think the general complexity is understood...
?TLDR – What exactly do you want ??
To summarize - this is an opinioned article about the advantage of adding a join API to elasticsearch
We reviewed why this feature is needed even if it doesn't exactly fit the NoSQL puristic agenda.
We reviewed the existing elasticsearch alternative for making joins (mainly denormalize your data)
We discussed other open sources that have the abilities to do this join – mainly spark & presto
I argued that the new versions of elasticsearch are now in a great position to allow such engineering adventure.
Lastly we described some of the missing/needed capability to complete this task and showed that the existing elasticsearch abilities position it perfectly doing so.
Conclusion
Elasticsearch is a great open source database with advanced capabilities and a live & vibrant community.
The community has a real need for performing Join operation between indices and not always accepting the data denormalization alternative.
Elasticsearch's engineering group has grown both in size and maturity to have the great opportunity to implement exciting new features.
Good Luck Elasticsearch
?Lior Perry
The writer is a loving elasticsearch user and the outhor of YangDb - an open source knowledge-graph database which is based on elasticsearch as the storage layer
Data and AI Product Lead | Microsoft
4 年Lior Perry Overall, I think it's a great direction. I would argue that this is a special case of infrastructure augmentation to implement the general Data Mesh approach (see links below if not familiar). Whereby data products are agnostic, or rather abstract away, the storage layer. The crux of the problem in developing domain-specific data products often lies in the difficulty to catalogue and identify data. The document oriented approach and discoverability mechanisms as well as the collection of connected tools (e.g. Beats) make Elasticsearch a great platform to start evolving your Data Mesh. https://martinfowler.com/articles/data-monolith-to-mesh.html https://martinfowler.com/articles/data-mesh-principles.html
System Architect ? CKA CKAD Certified
4 年sounds like adding OOP capabilities to goLang, or making scala non- functional programming language or converting redis to be queue engine .... those are two different paradigm, not just a flavor of missing API. In NoSql it’s ok to to duplicate data, put inner documents instead of joins and there are dozens of other architecture decisions aimed to keep the technology do what it was build for. If you need complex joins , mathematical and data transformations , relational kind of modeling just pick the right tool for that : Sql Server , Vertica , etc.