Introduction to Elasticsearch

Aneshka Goyal

AWS Certified Solutions Architect | Software Development Engineer III at Egencia, An American Express Global Business Travel Company

发布日期: 2023年3月11日

What is Elasticsearch?

Elastic search is a RESTful, distributed search and analytics engine built on top of Apache Lucene. Apache Lucene being an open source java library that provides index and search features (we will see how important are indexes to elasticsearch). One of the most common use cases that has also become an identity of elasticsearch is it being the heart of the common ELK (elastic) stack that is used to take data from various sources and available in different formats, store, aggregate and visualize this data and perform analysis, all possible in real time. Elastic stack comprises of Elascticsearch, Logstash, Kibana and Beats. Beats and Logstash enable in data collection and aggregation from various sources and store these in elasticsearch. Kibana provides a UI which allows us(users) to interact with the data and create visualizations, perform analysis and search activities. All the data storage, analysis and searching takes place in the elasticsearch. One common example of this data can be the logs that are collect from various applications in our organization. The image below gives an overview of the process.

Being a core part of the elastic stack elasticsearch deals with Big Data and is able to provide insights to these in realtime (milliseconds). This helps in security and business analysis powered by ML (machine learning) capabilities. Having great search capabilities it can serve for applications that rely on searching of data like e-commerce applications that would want to search various products catalogs and provide the response in fraction of a second. Thus the USP of elasticsearch being able to consume data and provide results realtime (fraction of seconds).

Let's try to understand how elasticsearch functions to be able to provide such lightening fast response times.

Elasticseach works on the concepts on Document, Index and Inverted index.

Document

Elasticsearch allows us to store information as documents. These are the basic units of information represented in JSON. We can think of documents as records or rows of a table in relational database. Each document can have fields representing string, numeric or dates etc. We can represent both structured as well as unstructured data as documents. Each document will belong to some index which defines what type of document it is.

Index

Index can be thought of as the highest levels where searches and queries are performed. If we talk interms of relational database then index is similar to a table in relational database. Index groups related documents together. Whenever we want to perform any operation on elasticsearch we need to specify the index we are performing the operation on. In the example that we will see below, we will have a dept-index that groups department documents and performs operations on these.

Inverted Index

Inverted index is a mechanism which is leveraged by elasticsearch and various other popular search engines as well. Elastic search stores group of related documents under an index as we discussed above, each of these documents will have some information or data represented as key-value pairs (as these documents are JSON objects, the values can be of different data types like string, dates etc). Elasticsearch will index these values of each field of the document (by default) ie we can understand these as creating a map, with each term(of textual value) becoming a key of the map and value of the map being the location of the documents where this term is present. This is the process of inverted indexing for a textual field (whose values are of textual type), a different data structure is used internally by elasticsearch to create these inverted indexes depending on the data type of the value we are talking about. For example for geo data fields it uses BKD trees to create an inverted index. These inverted indexes are actually behind the real time response times of the searches that we perform on elasticsearch like fetching all documents that contain a particular term.

Thus we understood above how it efficiently stores and indexes data to provide lightening fast response times. In the beginning while introducing about what is Elasticsearch, we said its distributed search and analytics engine, let's try to understand what is distributed, how and what is the advantage.

When we talk about term distributed, we think of clusters of nodes and each node serving the set of operations. Same goes for for elasticsearch being distributed in nature i.e there is a cluster of nodes. Each node houses some shards. These shards itself is a sufficient index(does not depend on any other shard). Basically each index (housing a group of related documents) can be broken into one or more physical shards and these shards can be then placed on different nodes. Elasticsearch is smart enough to be able to redistribute these shards once the number of nodes in the cluster changes. As mentioned earlier each shard can be thought of as a sufficient index. There are two types of shards: primaries and replicas.?Each document in an index will belong to a primary shard and replica shards(if needed). The distribution and replication allows us to be safe in case of hardware issues (on a node), the replica can take over without any impact. Thus being distributed in nature elasticsearch allows for scalability, fault tolerance thus high availability. Elasticsearch also allows for Cross Cluster replication (CCR), this helps to deal with entire cluster going down in case of a disaster in a single location, then the secondary cluster can take charge, CCR works in the active-passive replication mode. This intern prevents us from having a single point of failure with just one datacenter.

Elasticsearch allows us to store and query the stored data. The query can be done using the elasticsearch DSL(Domain Specific Language) or SQL like query language. Though Elasticsearch is NoSQL database but still it provides a SQL like wrapper to help us easily query the underlying indexes. One can think of Elasticsearch SQL as a?translator, one that understands both SQL and Elasticsearch and makes it easy to read and process data in real-time, at scale by leveraging Elasticsearch capabilities.

We will talk more about the elasticsearch DSL that basically represents the queries in JSON format. It consists of 2 types of clauses, leaf and compound. Leaf clauses like match, term etc act on individual fields(of the document) and the compound clauses are nothing but wrap other compound or leaf clauses like the bool clause that we will visit below.

Let's setup elasticsearch and kibana as first step. We will be using the docker images of these and will spin up the docker containers. Below are the set of commands that we need to execute. Please note that we can even download the two in our machines directly and skip the docker setup.

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.6.2?

The above command is used to pull a specific version of elasticseach image in docker.

Next we will create a network called elastic and run a single node named container(es01-test) with the above image. We will map the elasticsearch docker ports to same ports in our local machine.

docker network create elastic
docker run --name es01-test --net elastic -p 127.0.0.1:9200:9200 -p 127.0.0.1:9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.6.2

We created a network so that we can spin up a kibana container as well on same lines and the kibana can provide a user interface on the underlying elasticsearch instance that we just spun up above.

docker pull docker.elastic.co/kibana/kibana:7.6.2
docker run --name kib01-test --net elastic -p 127.0.0.1:5601:5601 -e "ELASTICSEARCH_HOSTS=https://localhost:9200" docker.elastic.co/kibana/kibana:7.6.2

We will be able to connect to kibana on localhost port 5601.

Next let's create a spring boot application. This application will let us perform various operations like define an index mapping, save documents and perform various search and aggregation operations.

We will use the Spring initializr to initialize a spring project. The pom looks something like the one below.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="https://maven.apache.org/POM/4.0.0" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="https://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
   <modelVersion>4.0.0</modelVersion>
   <parent>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-parent</artifactId>
      <version>2.7.7</version>
      <relativePath/> <!-- lookup parent from repository -->
   </parent>
   <groupId>com.example</groupId>
   <artifactId>elasticsearch</artifactId>
   <version>0.0.1-SNAPSHOT</version>
   <name>elasticsearch</name>
   <description>Demo project for Spring Boot with elastic search</description>
   <properties>
      <java.version>11</java.version>
   </properties>
   <dependencies>
      <dependency>
         <groupId>org.springframework.boot</groupId>
         <artifactId>spring-boot-starter-data-elasticsearch</artifactId>
      </dependency>

      <dependency>
         <groupId>org.springframework.boot</groupId>
         <artifactId>spring-boot-starter-web</artifactId>
      </dependency>

      <dependency>
         <groupId>org.springframework.boot</groupId>
         <artifactId>spring-boot-starter-test</artifactId>
         <scope>test</scope>
      </dependency>
   </dependencies>

   <build>
      <plugins>
         <plugin>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-maven-plugin</artifactId>
         </plugin>
      </plugins>
   </build>

</project>

Spring-boot-starter-data-elasticseach is all that we need for our application to be able to connect to easlticsearch. Apart from this the web dependency allows us to expose rest endpoints as well will be creating some endpoints to interact with our application.

Before Creating APIs let's create an index.

@Document(indexName = "dept-index")
public class Dept {
    @Id
    String id;
    @Field(type = FieldType.Keyword)
    String name;
    @Field(type = FieldType.Text)
    String desc;
    @Field (type = FieldType.Keyword)
    String category;
    @Field (type = FieldType.Integer)
    String maxCapacity;
    @Field(type = FieldType.Nested)
    List<Employee> employees;

    public String getId() {
        return id;
    }

    public String getName() {
        return name;
    }

    public String getDesc() {
        return desc;
    }

    public String getCategory() {
        return category;
    }

    public String getMaxCapacity() {
        return maxCapacity;
    }

    public List<Employee> getEmployees() {
        return employees;
    }

    public void setId(String id) {
        this.id = id;
    }

    public void setName(String name) {
        this.name = name;
    }

    public void setDesc(String desc) {
        this.desc = desc;
    }

    public void setEmployees(List<Employee> employees) {
        this.employees = employees;
    }

    public void setCategory(String category) {
        this.category = category;
    }

    public void setMaxCapacity(String maxCapacity) {
        this.maxCapacity = maxCapacity;
    }
}

We created an index name dept-index that is intended to store information about departments like name, description, category, max capacity of a department, list of employees, each employee itself being an object that has fields like id, name, age and description. Here we created this index and defined the data type of the fields. If we do not define the mappings as we did above Elasticsearch will generate a mapping in auto mode (thus NoSQL in nature).

public class Employee {

    String id;

    String name;

    Integer age;

    String desc;

    public String getId() {
        return id;
    }

    public String getName() {
        return name;
    }

    public Integer getAge() {
        return age;
    }

    public String getDesc() {
        return desc;
    }

    public void setId(String id) {
        this.id = id;
    }

    public void setName(String name) {
        this.name = name;
    }

    public void setAge(Integer age) {
        this.age = age;
    }

    public void setDesc(String desc) {
        this.desc = desc;
    }
}

Let's now define the connection to our elasticsearch endpoint so that once we start the application the same index is available in our elasticsearch.

@Configuration
@EnableElasticsearchRepositories(basePackages = "*")
@ComponentScan(basePackages = { "com.example.elasticsearch" })
public class Config extends AbstractElasticsearchConfiguration {
    @Bean
    @Override
    public RestHighLevelClient elasticsearchClient() {
        ClientConfiguration clientConfiguration = ClientConfiguration.builder()
                .connectedTo("localhost:9200")
                .build();

        return RestClients.create(clientConfiguration).rest();
    }
}

This configuration just tells the port and host to connect to elasticsearch. We will be using the Elasticsearch repository as well so we have enabled that as well.

Now once we start this application and go to kibana localhost:5601 we should be able to query for the mappings present in the dept-index that we created. Also when we hit the localhost:9200 for elasticseach we should see the response as below.

The Lets query the mappings for dept-index in Kibana using the dev tools and hitting the following query

GET dept-index/_mapping

The response looks like below. This would help us confirm that the index is created as expected.

? {
  "dept-index" : {
? ? "mappings" : {
? ? ? "properties" : {
? ? ? ? "_class" : {
? ? ? ? ? "type" : "keyword",
? ? ? ? ? "index" : false,
? ? ? ? ? "doc_values" : false
? ? ? ? },
? ? ? ? "category" : {
? ? ? ? ? "type" : "keyword"
? ? ? ? },
? ? ? ? "desc" : {
? ? ? ? ? "type" : "text"
? ? ? ? },
? ? ? ? "employees" : {
? ? ? ? ? "type" : "nested",
? ? ? ? ? "properties" : {
? ? ? ? ? ? "_class" : {
? ? ? ? ? ? ? "type" : "keyword",
? ? ? ? ? ? ? "index" : false,
? ? ? ? ? ? ? "doc_values" : false
? ? ? ? ? ? },
? ? ? ? ? ? "age" : {
? ? ? ? ? ? ? "type" : "long"
? ? ? ? ? ? },
? ? ? ? ? ? "desc" : {
? ? ? ? ? ? ? "type" : "text",
? ? ? ? ? ? ? "fields" : {
? ? ? ? ? ? ? ? "keyword" : {
? ? ? ? ? ? ? ? ? "type" : "keyword",
? ? ? ? ? ? ? ? ? "ignore_above" : 256
? ? ? ? ? ? ? ? }
? ? ? ? ? ? ? }
? ? ? ? ? ? },
? ? ? ? ? ? "id" : {
? ? ? ? ? ? ? "type" : "text",
? ? ? ? ? ? ? "fields" : {
? ? ? ? ? ? ? ? "keyword" : {
? ? ? ? ? ? ? ? ? "type" : "keyword",
? ? ? ? ? ? ? ? ? "ignore_above" : 256
? ? ? ? ? ? ? ? }
? ? ? ? ? ? ? }
? ? ? ? ? ? },
? ? ? ? ? ? "name" : {
? ? ? ? ? ? ? "type" : "text",
? ? ? ? ? ? ? "fields" : {
? ? ? ? ? ? ? ? "keyword" : {
? ? ? ? ? ? ? ? ? "type" : "keyword",
? ? ? ? ? ? ? ? ? "ignore_above" : 256
? ? ? ? ? ? ? ? }
? ? ? ? ? ? ? }
? ? ? ? ? ? }
? ? ? ? ? }
? ? ? ? },
? ? ? ? "id" : {
? ? ? ? ? "type" : "text",
? ? ? ? ? "fields" : {
? ? ? ? ? ? "keyword" : {
? ? ? ? ? ? ? "type" : "keyword",
? ? ? ? ? ? ? "ignore_above" : 256
? ? ? ? ? ? }
? ? ? ? ? }
? ? ? ? },
? ? ? ? "maxCapacity" : {
? ? ? ? ? "type" : "integer"
? ? ? ? },
? ? ? ? "name" : {
? ? ? ? ? "type" : "keyword"
? ? ? ? }
? ? ? }
? ? }
? }
}

Next let's take a look at the controller layer to see what type of operations we can perform in our application.

@RestController
@RequestMapping("v1/departments")
public class ElasticSearchController {

    @Inject
    private final ElasticService service;

    public ElasticSearchController(ElasticService service) {
        this.service = service;
    }


    @PostMapping
    public void createDept(@RequestBody Dept dept){
         service.save(dept);
    }

    @GetMapping
    public SearchHits<Dept> getDeptsWithDesc (@RequestParam("desc") String desc){
        return service.getDepartmentWithDesc(desc);
    }

    @GetMapping ("/with-filter")
    public SearchHits<Dept> getDeptsWithDescAndCategoryFilter (@RequestParam("desc") String desc, @RequestParam("filter") String filter) {
        return service.getDepartmentWithDescAndCategoryFilter(desc, filter);
    }

    @GetMapping ("/with-aggregator")
    public void getDeptsWithDescPlusAggregator (@RequestParam("desc") String desc) {
        service.getDepartmentWithDescPlusAggregator(desc);
    }


}

Here we get a hint of different functionalities i.e. we can create a department document. We are able to search the departments with a particular description, we are able to apply the filter on category and then search on description (if we want to limit our search results). Finally we will take a look at aggregations that is aggregating our documents into different buckets and then perform max/avg analysis (as we will see).

Before Jumping into service layer let's take a look at our no code repo layer

领英推荐

MongoDB Series - Part 1 - The Basics

Shrey Batra 1 年前

Generic Abstractions

Jason Williscroft 6 个月前

Timescale Newsletter ?? Postgres-Powered AI

Timescale 5 个月前

@Repository
public interface DeptRepo extends ElasticsearchRepository<Dept, String> {

}

This is implementing ElasticsearchRepository that provides us with the implementation for basic save, getById, save all, delete etc. We can even use the query tag and specify our custom query in any. We are connecting to this repo to save our documents, other operations will be performed with a second way that spring and elasticsearch provides us with.

So let's take a look at the two ways to manage indexes and perform normal as well as bulk operations on indexes.

The repository way is the same that we described above. For the way that leverages rest template, we have 3 ways to perform search operations i.e native query, String query and Criteria query. NativeQuery?provides the maximum flexibility for building a query using objects representing Elasticsearch constructs, we can find a direct one to one mapping between the elasticsearch DSL and terms present in native query. String query allows us to represent the same query in a string format instead of building it term by term. Criteria query is for a purpose opposite to native query as it hides the elasticsearch specific terms and does not impose us to use those while building our queries. We will take a look at the implementation of these with our service layer code.

@Service
public class ElasticService {

    @Inject
    private final DeptRepo repo;

    @Inject
    private final ElasticsearchOperations elasticsearchOperations;

    public ElasticService(DeptRepo repo, ElasticsearchOperations elasticsearchOperations) {
        this.repo = repo;
        this.elasticsearchOperations = elasticsearchOperations;
    }

    public void save(Dept dept) {
        repo.save(dept);
    }

    public SearchHits<Dept> getDepartmentWithDesc (String desc) {

        Criteria criteria = new Criteria("desc").contains(desc);
        CriteriaQuery query = new CriteriaQuery(criteria);

       return elasticsearchOperations.search(query, Dept.class, IndexCoordinates.of("dept-index"));
    }

    public SearchHits<Dept> getDepartmentWithDescAndCategoryFilter (String desc, String filter) {
        StringQuery query = new StringQuery("{\"bool\": { \"must\": [{\"match_phrase\" : {\"desc\" : {\"query\" : " + "\""+ desc + "\", \"slop\" : 2} } } ], \"filter\": [ {\"term\": {\"category\": " +"\"" + filter + "\" }}]}}");
        return elasticsearchOperations.search(query, Dept.class, IndexCoordinates.of("dept-index"));
    }

    public void getDepartmentWithDescPlusAggregator (String desc) {

        NativeSearchQuery query = new NativeSearchQueryBuilder()
                .withQuery(new MatchPhraseQueryBuilder("desc", desc).slop(2))
                .withAggregations(AggregationBuilders.terms("term-agg").field("category").subAggregation(new MaxAggregationBuilder("agg-maxCapacity").field("maxCapacity")))
                .build();

        SearchHits<Dept> ans =  elasticsearchOperations.search(query, Dept.class, IndexCoordinates.of("dept-index"));

        ElasticsearchAggregations aggregations = (ElasticsearchAggregations) ans.getAggregations();

        List<Aggregation> aggregations1 = aggregations.aggregations().asList();

        ParsedStringTerms t = (ParsedStringTerms) aggregations1.get(0);

        System.out.println("Total buckts in term agg: " + t.getBuckets().size());

        ParsedStringTerms.ParsedBucket bucket = (ParsedStringTerms.ParsedBucket) t.getBuckets().get(0);

        ParsedStringTerms.ParsedBucket bucket1 = (ParsedStringTerms.ParsedBucket) t.getBuckets().get(1);

        System.out.println("Bucket 1 total docs: " + bucket.getDocCount());

        ParsedMax parsedMax = (ParsedMax) bucket.getAggregations().asList().get(0);

        System.out.println("Bucket 1's max capacity is: " + parsedMax.getValue());

        ParsedMax parsedMax2 = (ParsedMax) bucket1.getAggregations().asList().get(0);

        System.out.println("Bucket 2 total doc count: " + bucket1.getDocCount());

        System.out.println("Bucket 2's max capacity is: " + parsedMax2.getValue());

    }
}

We will take a look at the code above in a step by step fashion as we perform the

operations that we exposed in controller layer. First we try to save 3 documents or departments that have the following details. This is done by hitting the POST endpoint of our application. This endpoint leverages the repository to perform the save operation internally.

Now since we have some data (as shown in image above), we are all set to execute some search, filtering and aggregations.

First let's try a simple use case where we want to search based on description terms. So any department description that has those terms should come up in our response. This is one of the use case of search engines or some e-commerce application that depend on search capabilities.

Making a GET call to below endpoint

https://localhost:8080/v1/departments?desc=dept

Gives us the following response

{    
"totalHits": 3,
    "totalHitsRelation": "EQUAL_TO",
    "maxScore": 1.0,
    "scrollId": null,
    "searchHits": [
        {
            "index": "dept-index",
            "id": "Dept-1",
            "score": 1.0,
            "sortValues": [],
            "content": {
                "id": "Dept-1",
                "name": "Operations",
                "desc": "a op dept",
                "category": "non tech",
                "maxCapacity": "30",
                "employees": [
                    {
                        "id": "emp-5",
                        "name": "smone4",
                        "age": 22,
                        "desc": "analyst I"
                    },
                    {
                        "id": "emp-6",
                        "name": "someone4",
                        "age": 24,
                        "desc": "analyst II"
                    }
                ]
            },
            "highlightFields": {},
            "innerHits": {},
            "nestedMetaData": null,
            "routing": null,
            "explanation": null,
            "matchedQueries": []
        },
        {
            "index": "dept-index",
            "id": "Dept-2",
            "score": 1.0,
            "sortValues": [],
            "content": {
                "id": "Dept-2",
                "name": "Tech",
                "desc": "a technology dept",
                "category": "tech",
                "maxCapacity": "100",
                "employees": [
                    {
                        "id": "emp-3",
                        "name": "smone1",
                        "age": 22,
                        "desc": "sde I"
                    },
                    {
                        "id": "emp-4",
                        "name": "someone2",
                        "age": 24,
                        "desc": "sde II"
                    }
                ]
            },
            "highlightFields": {},
            "innerHits": {},
            "nestedMetaData": null,
            "routing": null,
            "explanation": null,
            "matchedQueries": []
        },
        {
            "index": "dept-index",
            "id": "Dept-3",
            "score": 1.0,
            "sortValues": [],
            "content": {
                "id": "Dept-3",
                "name": "HR",
                "desc": "a resource dept",
                "category": "non tech",
                "maxCapacity": "45",
                "employees": [
                    {
                        "id": "emp-5",
                        "name": "smone2",
                        "age": 22,
                        "desc": "hr"
                    },
                    {
                        "id": "emp-6",
                        "name": "someone2",
                        "age": 35,
                        "desc": "hr"
                    }
                ]
            },
            "highlightFields": {},
            "innerHits": {},
            "nestedMetaData": null,
            "routing": null,
            "explanation": null,
            "matchedQueries": []
        }
    ],
    "aggregations": null,
    "suggest": null,
    "empty": false
}

The service logic below helps us execute a contains operation on our description field. And also while executing the query we include information about the target index.

public SearchHits<Dept> getDepartmentWithDesc (String desc) {

    Criteria criteria = new Criteria("desc").contains(desc);
    CriteriaQuery query = new CriteriaQuery(criteria);

   return elasticsearchOperations.search(query, Dept.class, IndexCoordinates.of("dept-index"));
}

Let's now take a look at the filtering as well. Let's say we want to search on description but filtered on a category to reduce our search space. For this we hit the second endpoint described below.

https://localhost:8080/v1/departments/with-filter?desc=a dept&filter=tech

Here we search for a dept but filtered by category as tech. While creating departments we had saved 2 non tech and one tech category department.

    "totalHits": 1,
    "totalHitsRelation": "EQUAL_TO",
    "maxScore": 0.17280531,
    "scrollId": null,
    "searchHits": [
        {
            "index": "dept-index",
            "id": "Dept-2",
            "score": 0.17280531,
            "sortValues": [],
            "content": {
                "id": "Dept-2",
                "name": "Tech",
                "desc": "a technology dept",
                "category": "tech",
                "maxCapacity": "100",
                "employees": [
                    {
                        "id": "emp-3",
                        "name": "smone1",
                        "age": 22,
                        "desc": "sde I"
                    },
                    {
                        "id": "emp-4",
                        "name": "someone2",
                        "age": 24,
                        "desc": "sde II"
                    }
                ]
            },
            "highlightFields": {},
            "innerHits": {},
            "nestedMetaData": null,
            "routing": null,
            "explanation": null,
            "matchedQueries": []
        }
    ],
    "aggregations": null,
    "suggest": null,
    "empty": false
}{

One field to note here is score basically elasticsearch calculates the match score and returns the documents and their score. This is the basic difference when a match query and a filter clause. The match query can return not a perfect match as well with a lower relevance score but filters will always work on 0 and 1 and no score calculations will be made.

So here instead of 3 (as we got above with just search and no filtering) we just got 1 response as others were filtered out.

Let's dive into the service layer code that helped us achieve the use case of filtering while searching.

public SearchHits<Dept> getDepartmentWithDescAndCategoryFilter (String desc, String filter) {
    StringQuery query = new StringQuery("{\"bool\": { \"must\": [{\"match_phrase\" : {\"desc\" : {\"query\" : " + "\""+ desc + "\", \"slop\" : 2} } } ], \"filter\": [ {\"term\": {\"category\": " +"\"" + filter + "\" }}]}}");
    return elasticsearchOperations.search(query, Dept.class, IndexCoordinates.of("dept-index"));
}

This is the exact query that we have in string format which we can even copy and execute on our kibana dev tools by enclosing it in an outermost query clause. Let's zoom into the exact query that we can execute on easlticseach.

GET dept-index/_search
{
? "query":?
? ? {
? ? ? "bool":?
? ? ? { "must":?
? ? ? ? [
? ? ? ? ? {"match_phrase" :?
? ? ? ? ? ? {
? ? ? ? ? ? ? "desc" :?
? ? ? ? ? ? ? {
? ? ? ? ? ? ? ? "query" : "a dept",?
? ? ? ? ? ? ? ? "slop" : 2
? ? ? ? ? ? ? }
? ? ? ? ? ? }
? ? ? ? ? }?
? ? ? ? ],?
? ? ? "filter":?
? ? ? ? [?
? ? ? ? ? {
? ? ? ? ? ? "term": {
? ? ? ? ? ? ? "category": "tech"?
? ? ? ? ? ? ??
? ? ? ? ? ? }
? ? ? ? ? ??
? ? ? ? ? }
? ? ? ? ]
? ? ? ??
? ? ? }
? ? ??
? ? }
??
}

This elasticsearch query does to search in an index thus the line starting with GET. Next let's dive into the terms.

Query: In the query context, a query clause answers the question “How well does this document match this query clause?” Besides deciding whether or not the document matches, the query clause also calculates a relevance score in the?_score?metadata field.
Bool: The default query for combining multiple leaf or compound query clauses, as?must,?should,?must_not, or?filter?clauses. The?must?and?should?clauses have their scores combined — the more matching clauses, the better — while the?must_not?and?filter?clauses are executed in filter context.
Must: The clause (query) must appear in matching documents and will contribute to the score.
Match Phrase: Used for matching exact phrases or word proximity matches, used for searching textual data as we did above.
Slop: Is used to give the allowed gap between the 2 terms (a and dept in this case).
Filter: The clause (query) must appear in matching documents. However unlike?must?the score of the query will be ignored.
Term: A Term level query, returns documents that contain an exact term in a provided field. Here value of category should be tech

These are just some query clauses that we implemented here and discussed but the range is vast to help us achieve the desired results.

Now we have seen searching and searching combined with filtering. Let's now take a look at how we can work with aggregations.

Aggregations allow us to summarise our data and answer basic questions like max, min, average value etc. There are three types of aggregations supported by elasticsearch.

Metrics aggregations that calculate metrics, such as a sum or average, from field values.
Bucket aggregations that clubs data into buckets, based on field values, ranges, or other criteria, and then we can analyse the metrics in each bucket. We will have a look at this one in our example below.
Pipeline aggregations work by taking input from other aggregations instead of documents or fields.

Let's start by hitting the below mentioned endpoint

https://localhost:8080/v1/departments/with-aggregator?desc=a dept

With this endpoint we want to aggregate the search results obtained for a simple match query for matching a dept in description field in our entire data set that consists of 2 non tech and 1 tech category department. We want to aggregate the results obtained into buckets based on the category they fall into and then calculate the max capacity of each bucket (i.e if there are n departments in a bucket, we want to return the maximum capacity among those n departments. We want to do this for all the buckets that we create as part of this aggregation).

Service layer code looks like this.

public void getDepartmentWithDescPlusAggregator (String desc) {

    NativeSearchQuery query = new NativeSearchQueryBuilder()
            .withQuery(new MatchPhraseQueryBuilder("desc", desc).slop(2))
            .withAggregations(AggregationBuilders.terms("term-agg").field("category").subAggregation(new MaxAggregationBuilder("agg-maxCapacity").field("maxCapacity")))
            .build();

    SearchHits<Dept> ans =  elasticsearchOperations.search(query, Dept.class, IndexCoordinates.of("dept-index"));

    ElasticsearchAggregations aggregations = (ElasticsearchAggregations) ans.getAggregations();

    List<Aggregation> aggregations1 = aggregations.aggregations().asList();

    ParsedStringTerms t = (ParsedStringTerms) aggregations1.get(0);

    System.out.println("Total buckts in term agg: " + t.getBuckets().size());

    ParsedStringTerms.ParsedBucket bucket = (ParsedStringTerms.ParsedBucket) t.getBuckets().get(0);

    ParsedStringTerms.ParsedBucket bucket1 = (ParsedStringTerms.ParsedBucket) t.getBuckets().get(1);

    System.out.println("Bucket 1 total docs: " + bucket.getDocCount());

    ParsedMax parsedMax = (ParsedMax) bucket.getAggregations().asList().get(0);

    System.out.println("Bucket 1's max capacity is: " + parsedMax.getValue());

    ParsedMax parsedMax2 = (ParsedMax) bucket1.getAggregations().asList().get(0);

    System.out.println("Bucket 2 total doc count: " + bucket1.getDocCount());

    System.out.println("Bucket 2's max capacity is: " + parsedMax2.getValue());

}

Here we make use of Native query for querying all departments that have a dept string in their description. We also specify a term bucket aggregator on category to categorise results based on the department category, here these buckets would be based on two category values i.e. tech and non tech. We then want to calculate the max capacity as a sub aggregation in a bucket.

Point to note ElasticSearchAggregations don't get directly serialised to json hence we need to wrap these into some custom serialisable objects post parsing.

In subsequent lines we parse the results so obtained and print these values. The console output looks like below.

Let's also have a look at what we have in elasticsearch to validate the above max values for each category.

As depicted we have red ovals representing the category for each department doc we saved. And for non tech category the green boxes represent the max capacity value for the 2 non tech departments. Out of 30 and 45, 45 is the maximum value for this bucket hence the sub aggregation output. For the tech category there is just one value 100 and hence that the maximum output in our console for bucket number 2.

Thus in this we learnt about elasticsearch and how we can create a spring boot application that uses elasticsearch and is able to perform a set of operations on the data housed in the elasticsearch. We also worked with docker to spin up Elasticsearch and Kibana containers for easy data visualisation and interactions.

Sources of Knowledge

https://www.elastic.co/guide/en/elasticsearch/reference/current/elasticsearch-intro.html
https://www.baeldung.com/spring-data-elasticsearch-tutorial
https://docs.docker.com/get-started/
https://reflectoring.io/spring-boot-elasticsearch/

要查看或添加评论，请登录

Aneshka Goyal的更多文章

Introduction to Distributed Tracing

2025年2月1日

Introduction to Distributed Tracing

What is Distributed Tracing? The word tracing is to trace the request as it flows through the system. Since modern…

2 条评论
Introduction to Service Discovery

2025年1月11日

Introduction to Service Discovery

What is Service Discovery? Service Discovery as the name suggests allows us to know or discover where each instance of…
Introduction to Micro frontend

2024年9月8日

Introduction to Micro frontend

What is Micro frontend? The term “micro frontends” debuted in the 2016 ThoughtWorks Technology Radar guide. At its…
Introduction to Pub-Sub and Streams with Redis&SpringBoot

2024年8月17日

Introduction to Pub-Sub and Streams with Redis&SpringBoot

Publish/Subscribe Problem: Let's say we have synchronous messaging between two components of our system called as…

2 条评论
Introduction to Time Series Database - InfuxDB

2024年6月23日

Introduction to Time Series Database - InfuxDB

What is Time Series Data? As the title of the blog depicts we would be discussing about time series databases and in…

1 条评论
Introduction to Ontology

2024年5月20日

Introduction to Ontology

What is Ontology? An ontology is a formal and structural description of knowledge about a specific domain. Knowledge is…
From Java 17 to Java 21 - Features and Benefits

2024年2月18日

From Java 17 to Java 21 - Features and Benefits

Java has been constantly evolving with new features and enhancements. With the recent LTS (Long term support) version…

2 条评论
Vault Authentication and Springboot integration

2024年1月20日

Vault Authentication and Springboot integration

What is Vault? Vault is an identity-based secrets and encryption management system. A secret is anything that we want…
Introduction to gRPC with Spring boot

2024年1月2日

Introduction to gRPC with Spring boot

Overview RPC stands for remote procedure calls. In this the client is able to directly invoke a method on server…

6 条评论
Introduction to Triple Crown

2023年8月20日

Introduction to Triple Crown

Organizations are always trying to improve how they work, in order to increase efficiency and reduce errors. This…

4 条评论

See all articles

Introduction to Elasticsearch

Aneshka Goyal

AWS Certified Solutions Architect | Software Development Engineer III at Egencia, An American Express Global Business Travel Company

领英推荐

Aneshka Goyal的更多文章

社区洞察

其他会员也浏览了

Timescale Newsletter ?? Postgres-Powered AI

Just Enough Spark! Core Concepts Revisited !!

Harnessing the Power of Elasticsearch: boosting your search capabilities

What is Elasticsearch? Creating Google-Like Search Capabilities

What is Elasticsearch?

Advance Indexing with Couchbase and Node.js

Why Schema-less JSON Databases Are So Useful

Why You Shouldn't Use An ORM With DynamoDB

GraphQL vs MongoDB: Which One is Best for 2025?

ITea Talks with Kaloyan Kostov: Exploring RavenDB: The Next-Gen NoSQL Database for .NET Developers

领英推荐

Aneshka Goyal的更多文章

Introduction to Distributed Tracing

Introduction to Service Discovery

Introduction to Micro frontend

Introduction to Pub-Sub and Streams with Redis&SpringBoot

Introduction to Time Series Database - InfuxDB

Introduction to Ontology

From Java 17 to Java 21 - Features and Benefits

Vault Authentication and Springboot integration

Introduction to gRPC with Spring boot

Introduction to Triple Crown

社区洞察

其他会员也浏览了

Timescale Newsletter ?? Postgres-Powered AI

Just Enough Spark! Core Concepts Revisited !!

Harnessing the Power of Elasticsearch: boosting your search capabilities

What is Elasticsearch? Creating Google-Like Search Capabilities

What is Elasticsearch?

Advance Indexing with Couchbase and Node.js

Why Schema-less JSON Databases Are So Useful

Why You Shouldn't Use An ORM With DynamoDB

GraphQL vs MongoDB: Which One is Best for 2025?

ITea Talks with Kaloyan Kostov: Exploring RavenDB: The Next-Gen NoSQL Database for .NET Developers