The case for Elasticsearch Join API

The case for Elasticsearch Join API

Elasticsearch is the leading search engine as of this day - it has become the number one selection for thousands of companies worldwide.

It is used in a huge variety of cases - from the traditional logging events to the cyber world of alerts and incidents.

Its domination is complete and total with a huge and vibrant community that pushes and drives the development at a fantastic pace.

Elasticsearch has expanded its portfolio in a magnitude - from the humble ELK to the vast products offering including beat, APM, SIEM, Search, Cloud and more...

The elasticsearch company has grown tremendously and is hiring many engineers to fulfil the community's ongoing demand for new features and solutions.

My personal story with elasticsearch started a couple of years ago when we started using elasticsearch for indexing social network interactions and needed fast and complex search/aggregate capabilities.

Modelling the Data

Very early on we started mapping our domain entities model in a simple way that resembles standard SQL tables.

Soon enough we realized that we will not be able to correlate (Join) different entities that have common fields - elasticsearch has no support for joins !!!

After some basic research in the internet we came up with the next solutions:

* Add multiple types for a single index mapping (single index - many entities) - RIP Multi-Type indices not supported any-more

* Add nested documents (object arrays) for each index - the nested document represent the entity's out-going relations

* Add Parent Child relationship in the same index.

* Denormalize the data to store materialized (and predefined) join indices

None of these solutions is perfect - each has its cons / pros - depending on the data volume size, speed of the actual queries and amount of joins needed in reality.

Modern solutions

Now keep in mind that at this stage (6 years ago) the modern solutions that exist today where not mature (or didn’t exist) :

* Apache spark Joins with elasticsearch-hadoop-plugin to hand-off the join task to spark 

* Apache presto with its elasticsearch plugin again to hand-off the join task to presto (plug-in offers minimal support for elasticsearch's rich DSL)

* Apache Calcite – an SQL execution (cost based) plan optimizer with adapter to elasticsearch (again - very basic support)


Elasticsearch to the rescue ?

At this point I started thinking - why doesn't elasticsearch offer this functionality ?

When searching the internet for some clues I came up with the following answers: 

- Denormalize your data according to the needed joins your customers will need...

- Since elasticsearch is a distributed NoSQL datastore it is not relevant to try to "force" a relational 'algebraic behavior' on it...

Hmmm....

These all sound like good excuses, but unfortunately the nature of things is that the real day to day needs exceed such puristic thinking.

Let's review what elasticseach has done in the past few years to contradict these declarations to prove they do follow the community needs even if it contradicts some 'NoSQL - way of thinking':

* Adding transformer support - implementing materialization of group-by aggregation according to some pivot

This is feature that was circling for some time inside the community and was independently implemented by the application side until elasticsearch accepted the challenge created the appropriate API

* Adding SQL query language support - Again the SQL language is a standard event for NoSQL databases 

The community has long requested this feature for a long time and has offered independent solutions to solve the translation from SQL to elastic's DSL until the engineering group of elasticsearch finally accepted the requests and implemented this as an API inside. 

I can continue arguing that there are many additional features added to elasticsearch by a direct influence of the community which are not always aligned with a puristic engineering agenda - but this is the beauty of our community. 

I would also argue that if we consider Kibana to be a BI tool - could you imagine a BI product that has no join operations ?!

So - why isn't join already a part of elasticsearch API ?

see discussions related to this subject :

* https://github.com/elastic/elasticsearch/pull/3278

* https://github.com/elastic/elasticsearch/issues/28639

* https://github.com/elastic/elasticsearch/issues/27315

* https://github.com/rmagen/elastic-gremlin/issues/42

* https://github.com/sirensolutions/siren-join (obsolete ...)

This definitely shows that the community is interested in such API for some time ...

The Inherent complexity of a Join API

I would like to dive into some of the complexities that I assume are a major preventer for such development to take place.

A join between 2 + indices require the next capabilities:

* execution plan for join direction optimization

* statistics on indices and fields to plan the optimizer join order

* push-down join-predicate to each index specifically

* paging capability to allow scrolling the data

* strong caching for the hash-join temp tables

* ability to sort / boost the join results according 

* streaming (lots of) data between nodes across the cluster

We can continue adding additional engineering concerns here but I think that I've made the point regarding the complexities that are expected... nevertheless - you can find additional NoSQL databases offering similar join API in-spite of the huge engineering effort.


Existing Assets we already have today 

Taking all the above into consideration it's worth mentioning that there are engineering assets that would simplify the development process:

The modern elasticsearch database (version >7) has added a manifold of capabilities and libraries that would assist such an engineering efforts – lets name a few -

* Async search:

The async search API let’s you asynchronously execute a search request, monitor its progress, and retrieve partial results as they become available.

This async capability is very useful in such long join operations which require such long running background multi-parts queries.

* Checkpoints (appeared in transforms API)

Each time a transform examines the source indices and creates or updates the destination index, it generates a checkpoint.

This capability to preserve the point in time the index was examined is also an important part of the ability to perform a paged join over large indices.

* Sorted Indices

When creating a new index in elasticsearch it is possible to configure how the Segments inside each Shard will be sorted. 

This is also a supper important capability that was not present couple of years ago – It allows keeping the index sorted which helps performing the join more efficiently using Merge-Sort Join.

Additional Capabilities needed 

Join Execution Planner – this functionality is necessary since the join order is crucial in determining the execution time and space.

Most Databases keep a living statistical information regarding tables that include cardinality, ordinality and variance of fields and their values distribution.

Elasticsearch has a great capability doing just that :

* Hyper log log counting functionality

* Great aggregating API 

I would love to dive into additional engineering & cluster resource constraints (such as data transfare between nodes) - but I think the general complexity is understood...


?TLDR – What exactly do you want ??

To summarize - this is an opinioned article about the advantage of adding a join API to elasticsearch

We reviewed why this feature is needed even if it doesn't exactly fit the NoSQL puristic agenda.

We reviewed the existing elasticsearch alternative for making joins (mainly denormalize your data) 

We discussed other open sources that have the abilities to do this join – mainly spark & presto

I argued that the new versions of elasticsearch are now in a great position to allow such engineering adventure.

Lastly we described some of the missing/needed capability to complete this task and showed that the existing elasticsearch abilities position it perfectly doing so.

Conclusion 

Elasticsearch is a great open source database with advanced capabilities and a live & vibrant community.

The community has a real need for performing Join operation between indices and not always accepting the data denormalization alternative.

Elasticsearch's engineering group has grown both in size and maturity to have the great opportunity to implement exciting new features.

Good Luck Elasticsearch


?Lior Perry

The writer is a loving elasticsearch user and the outhor of YangDb - an open source knowledge-graph database which is based on elasticsearch as the storage layer

Ilya Venger

Data and AI Product Lead | Microsoft

4 年

Lior Perry Overall, I think it's a great direction. I would argue that this is a special case of infrastructure augmentation to implement the general Data Mesh approach (see links below if not familiar). Whereby data products are agnostic, or rather abstract away, the storage layer. The crux of the problem in developing domain-specific data products often lies in the difficulty to catalogue and identify data. The document oriented approach and discoverability mechanisms as well as the collection of connected tools (e.g. Beats) make Elasticsearch a great platform to start evolving your Data Mesh. https://martinfowler.com/articles/data-monolith-to-mesh.html https://martinfowler.com/articles/data-mesh-principles.html

回复
Tomer Shaiman

System Architect ? CKA CKAD Certified

4 年

sounds like adding OOP capabilities to goLang, or making scala non- functional programming language or converting redis to be queue engine .... those are two different paradigm, not just a flavor of missing API. In NoSql it’s ok to to duplicate data, put inner documents instead of joins and there are dozens of other architecture decisions aimed to keep the technology do what it was build for. If you need complex joins , mathematical and data transformations , relational kind of modeling just pick the right tool for that : Sql Server , Vertica , etc.

要查看或添加评论,请登录

Lior Perry的更多文章

  • An Efficient (In-Mem) Graph Storage

    An Efficient (In-Mem) Graph Storage

    (8 minutes) In short: I will describe our need to efficiently store (in memory) a large graph and be able to flush it…

    3 条评论
  • YangDb Data Fusion

    YangDb Data Fusion

    Data today is the main resource of many modern companies and organizations.Many of these companies utilize data to…

  • The Making of a DB

    The Making of a DB

    Check out the new site www.yangdb.

  • From Graph To Elasticsearch

    From Graph To Elasticsearch

    From Graph to Elastic Introduction This is the second post in the series of Graph DB over Elastic Search. The First…

    3 条评论
  • Maintaining Document History inside ElasticSearch

    Maintaining Document History inside ElasticSearch

    News: During the last year or so I've been doing some interesting things with elasticsearch - implementing Graph DB…

    3 条评论
  • A new approach for (Software) talent recruiting...

    A new approach for (Software) talent recruiting...

    Every company wants to recruit the best technical people - that is a fact. The problem is how to identify the real…

  • Master class - in Java

    Master class - in Java

    I’ve been a programmer for the last 15 years or so. I remember my early days of programming, the long hours of reading…

  • Ask the Right Questions

    Ask the Right Questions

    Interviewing is a difficult job..

    4 条评论

社区洞察

其他会员也浏览了