What is Elasticsearch and why is it so fast?
Originally posted at https://anujyadav.substack.com/p/high-speed-information-retrieval. I post often at https://anujyadav.substack.com/, follow along.
Elasticsearch
Elasticsearch is a powerful, distributed search and analytics engine designed for handling large volumes of structured and unstructured data. It’s built on top of Apache Lucene, which is a highly efficient search library. Elasticsearch is widely used for full-text search, log and event data analysis, real-time application monitoring, and more.
How Indexes Are Maintained
Indexes in Elasticsearch are the fundamental building blocks that store and organize data. An index is a collection of documents, and each document is a JSON object that contains fields.
1. Shards and Replicas:
2. Inverted Index:
Elasticsearch uses an inverted index to store data. An inverted index maps terms (words, numbers, etc.) to the documents that contain them, allowing for fast full-text search. Instead of mapping documents to terms, it maps terms to documents, making search operations extremely efficient.
3. Segments:
Each shard consists of multiple segments, which are immutable data structures that contain a subset of the shard’s data. When new data is added, it is written to a new segment. Over time, segments are merged to optimize storage and search efficiency. Merging is a background process that consolidates smaller segments into larger ones, reducing the number of segments and improving search speed.
Patterns of Updating Indexes
Updating indexes in Elasticsearch can involve adding, deleting, or updating documents. Since segments are immutable, updates are handled as follows:
1. Document Addition:
When a new document is added, it is written to a new segment. The segment is created in memory and periodically flushed to disk. This process is fast because it avoids the need to modify existing segments.
2. Document Deletion:
When a document is deleted, Elasticsearch marks the document as deleted in the segment. The document is not physically removed until the segment is merged. This lazy deletion approach minimizes the impact on search performance.
领英推荐
3. Document Update:
Updating a document is internally treated as a combination of delete and add operations. The old document is marked as deleted, and a new document version is added to a new segment. Like deletion, the actual removal happens during segment merging.
4. Segment Merging:
As segments grow, Elasticsearch performs background merging. This process compacts smaller segments into larger ones, removing deleted documents and optimizing the index for faster search operations.
Why Elasticsearch Is So Fast
Several factors contribute to Elasticsearch’s speed:
1. Inverted Index:
The use of an inverted index allows Elasticsearch to quickly locate documents based on search terms. Instead of scanning all documents, it only needs to look up the terms in the index.
2. Distributed Architecture:
Elasticsearch’s distributed nature allows it to scale horizontally by adding more nodes to the cluster. Each node can host multiple shards, enabling parallel processing of search queries across the cluster, which significantly reduces search time.
3. Memory and Caching:
Elasticsearch uses memory efficiently by caching frequently accessed data. This reduces the need to read from disk, speeding up search operations. The caching mechanisms, including the file system cache and internal node caches, ensure that commonly searched terms and queries are quickly retrievable.
4. Real-Time Search:
Elasticsearch supports near real-time search, meaning that new data is almost immediately searchable after being indexed. This is possible due to the way it writes data to memory first and then periodically flushes it to disk.
5. Segment Optimization:
The segment merging process optimizes search performance by reducing the number of segments and eliminating deleted documents. This ensures that the index remains compact and fast to search through.
6. Efficient Query Execution:
Elasticsearch is optimized for executing complex queries, including full-text search, aggregations, and filtering. It uses various techniques like query rewriting, filtering, and optimization to ensure that queries are executed as efficiently as possible.