登录查看更多内容

What is Elasticsearch and why is it so fast?

Anuj Yadav

Technology Leader | Ixigo | Microsoft | MakeMyTrip | Expedia

发布日期: 2024年8月16日

Originally posted at https://anujyadav.substack.com/p/high-speed-information-retrieval. I post often at https://anujyadav.substack.com/, follow along.

Elasticsearch

Elasticsearch is a powerful, distributed search and analytics engine designed for handling large volumes of structured and unstructured data. It’s built on top of Apache Lucene, which is a highly efficient search library. Elasticsearch is widely used for full-text search, log and event data analysis, real-time application monitoring, and more.

How Indexes Are Maintained

Indexes in Elasticsearch are the fundamental building blocks that store and organize data. An index is a collection of documents, and each document is a JSON object that contains fields.

1. Shards and Replicas:

Shards: An index is divided into multiple shards, which are distributed across the nodes in an Elasticsearch cluster. Each shard is a fully functional and independent index that can be hosted on any node in the cluster.
Replicas: To ensure fault tolerance and high availability, each shard can have one or more replicas. Replicas are copies of the primary shards and can serve read requests, which helps distribute the load.

2. Inverted Index:

Elasticsearch uses an inverted index to store data. An inverted index maps terms (words, numbers, etc.) to the documents that contain them, allowing for fast full-text search. Instead of mapping documents to terms, it maps terms to documents, making search operations extremely efficient.

3. Segments:

Each shard consists of multiple segments, which are immutable data structures that contain a subset of the shard’s data. When new data is added, it is written to a new segment. Over time, segments are merged to optimize storage and search efficiency. Merging is a background process that consolidates smaller segments into larger ones, reducing the number of segments and improving search speed.

Patterns of Updating Indexes

Updating indexes in Elasticsearch can involve adding, deleting, or updating documents. Since segments are immutable, updates are handled as follows:

1. Document Addition:

When a new document is added, it is written to a new segment. The segment is created in memory and periodically flushed to disk. This process is fast because it avoids the need to modify existing segments.

2. Document Deletion:

When a document is deleted, Elasticsearch marks the document as deleted in the segment. The document is not physically removed until the segment is merged. This lazy deletion approach minimizes the impact on search performance.

领英推荐

2025 Guide to Architecting an Iceberg Lakehouse

Alex Merced 3 个月前

The Latest in Distributed SQL - January

TiDB, powered by PingCAP 1 个月前

Kafka Explained

Farshid A. 11 个月前

3. Document Update:

Updating a document is internally treated as a combination of delete and add operations. The old document is marked as deleted, and a new document version is added to a new segment. Like deletion, the actual removal happens during segment merging.

4. Segment Merging:

As segments grow, Elasticsearch performs background merging. This process compacts smaller segments into larger ones, removing deleted documents and optimizing the index for faster search operations.

Why Elasticsearch Is So Fast

Several factors contribute to Elasticsearch’s speed:

1. Inverted Index:

The use of an inverted index allows Elasticsearch to quickly locate documents based on search terms. Instead of scanning all documents, it only needs to look up the terms in the index.

2. Distributed Architecture:

Elasticsearch’s distributed nature allows it to scale horizontally by adding more nodes to the cluster. Each node can host multiple shards, enabling parallel processing of search queries across the cluster, which significantly reduces search time.

3. Memory and Caching:

Elasticsearch uses memory efficiently by caching frequently accessed data. This reduces the need to read from disk, speeding up search operations. The caching mechanisms, including the file system cache and internal node caches, ensure that commonly searched terms and queries are quickly retrievable.

4. Real-Time Search:

Elasticsearch supports near real-time search, meaning that new data is almost immediately searchable after being indexed. This is possible due to the way it writes data to memory first and then periodically flushes it to disk.

5. Segment Optimization:

The segment merging process optimizes search performance by reducing the number of segments and eliminating deleted documents. This ensures that the index remains compact and fast to search through.

6. Efficient Query Execution:

Elasticsearch is optimized for executing complex queries, including full-text search, aggregations, and filtering. It uses various techniques like query rewriting, filtering, and optimization to ensure that queries are executed as efficiently as possible.

要查看或添加评论，请登录

Anuj Yadav的更多文章

Series: Introduction to Columnar Databases

2024年6月6日

Series: Introduction to Columnar Databases

1. Columnar Databases: - A columnar database organizes data in a column-wise format rather than the traditional…
I wrote some old JS code and NodeJS hydrated UI for the Simulation

2024年5月4日

I wrote some old JS code and NodeJS hydrated UI for the Simulation

I had a presentation today, I built a using NodeJS for that. It required me to write and backend.
Starting up? Multitasking - Good Or Bad

2015年12月13日

Starting up? Multitasking - Good Or Bad

I am a big multitasking fan myself. So, please read it out.

1 条评论
Practical - Agile software development in a complex project

2015年10月8日

Practical - Agile software development in a complex project

I have worked on some not so complex projects and some very complex projects. I found that it is quite easy to execute…
They are all differently abled

2015年8月10日

They are all differently abled

If you get the best team in the world no doubt you will be able to deliver. Actual leadership lies in making the…

2 条评论
Let them respect your work, not position - Part 2

2015年7月15日

Let them respect your work, not position - Part 2

In my last post I wanted to make some points about management style. Me being me did that in a abstract manner as I…
Let them respect your work, not position

2015年7月11日

Let them respect your work, not position

Let them respect your work, not position. Time and again you will come across something similar.

6 条评论
Is Your Engineer Working On Product Or Project?

2014年9月7日

Is Your Engineer Working On Product Or Project?

Why do we have two different words Project and Product? I know, all you smart people can give me lot of answers for…

1 条评论

See all articles

What is Elasticsearch and why is it so fast?

Anuj Yadav

Technology Leader | Ixigo | Microsoft | MakeMyTrip | Expedia

Elasticsearch

How Indexes Are Maintained

1. Shards and Replicas:

2. Inverted Index:

3. Segments:

Patterns of Updating Indexes

1. Document Addition:

2. Document Deletion:

领英推荐

3. Document Update:

4. Segment Merging:

Why Elasticsearch Is So Fast

1. Inverted Index:

2. Distributed Architecture:

3. Memory and Caching:

4. Real-Time Search:

5. Segment Optimization:

6. Efficient Query Execution:

Anuj Yadav的更多文章

社区洞察

其他会员也浏览了

Advanced Filtering Techniques With DynamoDB

State

OpenSearch Make it Easy to Retrieve, Search, Visualize, and Analyze Your Data:

Bard, You Flip-Flopper

What is a graph database and what are its use cases - Definition, examples & trends

To SQL, NoSQL or NewSQL, that’s the Query!

Optimizing Time Series Management: The Strategic Choice of PostgreSQL with TimescaleDB

NoSQL Market is Set to Fly High in Years to Come

ELK Stack

Kafka Logstash pipeline or Databricks connector to write data to Elasticsearch, MongoDB, or Neo4?

Elasticsearch

How Indexes Are Maintained

1. Shards and Replicas:

2. Inverted Index:

3. Segments:

Patterns of Updating Indexes

1. Document Addition:

2. Document Deletion:

领英推荐

3. Document Update:

4. Segment Merging:

Why Elasticsearch Is So Fast

1. Inverted Index:

2. Distributed Architecture:

3. Memory and Caching:

4. Real-Time Search:

5. Segment Optimization:

6. Efficient Query Execution:

Anuj Yadav的更多文章

Series: Introduction to Columnar Databases

I wrote some old JS code and NodeJS hydrated UI for the Simulation

Starting up? Multitasking - Good Or Bad

Practical - Agile software development in a complex project

They are all differently abled

Let them respect your work, not position - Part 2

Let them respect your work, not position

Is Your Engineer Working On Product Or Project?

社区洞察

其他会员也浏览了

Advanced Filtering Techniques With DynamoDB

State

OpenSearch Make it Easy to Retrieve, Search, Visualize, and Analyze Your Data:

Bard, You Flip-Flopper

What is a graph database and what are its use cases - Definition, examples & trends

To SQL, NoSQL or NewSQL, that’s the Query!

Optimizing Time Series Management: The Strategic Choice of PostgreSQL with TimescaleDB

NoSQL Market is Set to Fly High in Years to Come

ELK Stack

Kafka Logstash pipeline or Databricks connector to write data to Elasticsearch, MongoDB, or Neo4?