TT#11:  "Tech Talk on Elasticsearch"

TT#11: "Tech Talk on Elasticsearch"

Elasticsearch: The Definitive Guide to Data Navigation ??

Unraveling the Mystery ??: Elasticsearch, a sophisticated and distributed search engine, is designed to navigate and analyze vast volumes of data at breakneck speed. Initially crafted for full-text search capabilities, it has evolved into a versatile tool for diverse data analytics, spanning from log files to intricate business metrics.

Analogous Understanding ???: Picture Elasticsearch as a seasoned navigator in the boundless oceans of data. Much like a navigator charting an efficient course, Elasticsearch adeptly traverses and retrieves valuable information from extensive datasets, providing insights akin to a seasoned guide leading you through unexplored territories.

Technical Insights ??:

  1. Distributed Nature ??: Elasticsearch thrives in a distributed environment, where data is dispersed across multiple nodes, ensuring both speed and reliability.
  2. JSON-based Documents ??: Elasticsearch stores data as JSON documents. Each document is a collection of key-value pairs, allowing for flexibility and straightforward representation of complex data structures.
  3. Indexing ??: Much like an index in a book, Elasticsearch creates an index to organize and expedite the retrieval of data. It employs a refined indexing mechanism for efficient search operations.
  4. Full-Text Search ??: Originally designed for full-text search, Elasticsearch excels in matching and retrieving documents based on textual content. It transcends simple keyword matching, incorporating advanced search features.
  5. Scalability ??: Elasticsearch is highly scalable, enabling organizations to seamlessly expand their data infrastructure as their needs grow. It adapts to increasing data volumes without compromising performance.
  6. Real-time Analytics ??: Elasticsearch supports real-time analytics, making it suitable for applications where insights need to be derived promptly from constantly evolving data.
  7. Query DSL ??: It employs a robust and expressive query language known as the Query DSL (Domain-Specific Language). This language allows users to formulate complex queries to extract specific information.
  8. Aggregations ??: Elasticsearch provides aggregations, enabling users to summarize and analyze data, similar to how statistical analyses are performed on datasets.
  9. Open-Source Foundation ??: Elasticsearch is an open-source project, fostering a collaborative community that continually enhances its features and capabilities.

In essence, Elasticsearch is the virtuoso navigator of data realms, adept at efficiently steering through massive datasets, empowering businesses and users with the insights needed to make informed decisions in the dynamic landscape of information. ??


How elasticsearch works internally?

Data Arrival ??: Documents, representing products (websites, logs, articles), stream into Elasticsearch like crates arriving at a bustling marketplace.

Indexing ???:

  • Prepping the Goods ???: Documents are unpacked, analyzed, and transformed into searchable units like keywords and phrases - akin to merchants sorting and categorizing their wares.
  • Building the Map ???: An intricate map is constructed, listing every keyword and the “crates” (documents) it appears in - much like a meticulous market map revealing which stalls sell specific items.
  • Sharding for Speed ??: This map is then divided into smaller, manageable sections (shards) for quicker lookups, similar to dividing the market into specialized districts.

Searching ??:

  • Customer Inquiry ??: A query arrives, searching for a specific item (think a customer asking for “red shoes”).
  • Navigating the Map ??: Elasticsearch swiftly consults the map, locating all “crates” containing “red” and “shoes” across the marketplace (shards).
  • Assessing Value ??: Each relevant document is then carefully examined, considering factors like price, brand, and popularity - akin to a merchant evaluating the quality and desirability of their shoes.
  • Presenting the Best ??: Finally, the top contenders are presented to the customer, ensuring they find the perfect pair (the most relevant documents returned).

Key Features ??:

  • Inverted Index ??: The intricate map listing keywords to documents - the core technology enabling rapid search.
  • Sharding & Replication ??: Dividing the information for faster searching and ensuring redundancy in case of “market disruptions.”
  • Scalability ??: As more products (data) arrive, the market (Elasticsearch cluster) can expand by adding more stalls (nodes).

Internal Workings ??:

  • Lucene Engine ??: At the heart lies a powerful search engine called Lucene, responsible for indexing and scoring documents.
  • Distributed Architecture ??: Multiple nodes work together to handle requests and store data, ensuring smooth operation even when the marketplace is bustling.
  • Flexible Data ??: Documents can hold diverse information, making Elasticsearch adaptable to various needs.

In essence, Elasticsearch is the market wizard of data realms, adeptly navigating through massive datasets, and empowering businesses and users with the insights needed to make informed decisions in the dynamic landscape of information. ??


Inverted Index Unveiled: Charting the Word Map ???

Explanation ??: An inverted index is a savvy data structure utilized by search engines like Elasticsearch to accelerate the process of locating information within a colossal dataset. It flips the conventional approach on its head, mapping each unique word to a list of documents where that word appears, rather than listing documents and their corresponding words. Picture it as a potent roadmap that directs you to the precise locations of words within an extensive library of documents.

Analogy ???: Envision yourself as an intrepid explorer in a vast library teeming with books on diverse topics. Traditionally, you might have a list of books with their respective contents. However, an inverted index would be akin to having a catalog at the end of each aisle, alphabetically listing every unique word and pinpointing exactly which books (documents) contain that word. It’s as if the library itself morphs into a guide, streamlining your quest for specific information.

Technical Details ??:

  1. Mapping Words to Documents ???: In an inverted index, each unique word in the dataset morphs into a key. The corresponding value is a list of documents (or locations) where that word appears. It’s akin to a dictionary, where words (keys) guide you to their meanings (documents).
  2. Fast Retrieval ?: When you search for a word, the inverted index facilitates lightning-fast retrieval. Instead of scanning every document, you consult the index to obtain a precise list of documents containing the word. This efficiency mirrors using a well-organized map to swiftly locate destinations.
  3. Flexibility in Searches ??: An inverted index enables versatile searches. You can hunt for specific words, phrases, or even intricate combinations. It’s like possessing a versatile tool that empowers you to find not only individual books but also entire sections or themes within the library.
  4. Efficient for Large Datasets ??: In scenarios with vast amounts of data, an inverted index shines. It drastically reduces the search space, rendering it manageable even in expansive collections. It’s like having a guide specifically tailored for navigating extensive bookshelves.
  5. Support for Partial Matches ??: Unlike traditional indexes, an inverted index facilitates partial matches. If you’re uncertain about the complete word, it assists you in finding documents containing fragments. This feature is like a friendly guide who comprehends your intent, even when you’re not entirely certain.
  6. Update Flexibility ??: When new documents are added or existing ones change, the inverted index adapts dynamically. It’s like possessing a map that gets updated in real-time as new landmarks are erected or existing ones undergo renovations.

In essence, an inverted index is your reliable guide in the library of data, making the search for specific words or information a breeze. It optimizes the process, offering speed, flexibility, and adaptability, much like a well-crafted map that transforms exploration into an efficient and delightful journey. ??

Below is the detailed example of inverted index:

Inverted Index Example with Two Documents: Unveiling Elasticsearch's Power

Documents:

Let's consider two simple documents:

Document ID: 1
Text: "Elasticsearch is a powerful search engine that enables efficient data retrieval and analysis."

Document ID: 2
Text: "Data analytics with Elasticsearch provides valuable insights for informed decision-making."        

Inverted Index: Explained

Now, let's delve into how the inverted index works with these two documents.

  1. Tokenization:Tokenization breaks down the text into individual terms or tokens.Document 1 Tokens:"Elasticsearch," "is," "a," "powerful," "search," "engine," "that," "enables," "efficient," "data," "retrieval," "and," "analysis."Document 2 Tokens:"Data," "analytics," "with," "Elasticsearch," "provides," "valuable," "insights," "for," "informed," "decision-making."
  2. Lowercasing:Convert all tokens to lowercase for case-insensitive searches.Document 1 Tokens (Lowercased):"elasticsearch," "is," "a," "powerful," "search," "engine," "that," "enables," "efficient," "data," "retrieval," "and," "analysis."Document 2 Tokens (Lowercased):"data," "analytics," "with," "elasticsearch," "provides," "valuable," "insights," "for," "informed," "decision-making."
  3. Stop Words Removal:Remove common words that don't contribute significantly to search relevance.Document 1 Tokens (Stop Words Removed):"elasticsearch," "powerful," "search," "engine," "enables," "efficient," "data," "retrieval," "analysis."Document 2 Tokens (Stop Words Removed):"data," "analytics," "elasticsearch," "provides," "valuable," "insights," "informed," "decision-making."
  4. Indexing:Index each unique token with a reference to the document ID.

{
  "elasticsearch": [1, 2],
  "powerful": [1],
  "search": [1, 2],
  "engine": [1, 2],
  "enables": [1],
  "efficient": [1],
  "data": [1, 2],
  "retrieval": [1],
  "analysis": [1],
  "analytics": [2],
  "provides": [2],
  "valuable": [2],
  "insights": [2],
  "informed": [2],
  "decision-making": [2]
}        

Search Operation:

Now, if a user performs a search like "data insights," Elasticsearch can efficiently locate the relevant documents:

  1. The terms "data" and "insights" are identified in the index.
  2. The index points to Document IDs 1 and 2 associated with these terms.

As a result, Elasticsearch swiftly retrieves both documents, showcasing how the inverted index facilitates rapid and accurate search operations across multiple documents. This fundamental mechanism is at the core of Elasticsearch's capability to handle diverse datasets with speed and precision.


Elasticsearch: Powering Giants in the Digital Landscape ??

Technology Titans ??:

  • Netflix ??: Powers their search and recommendation algorithms, catering to millions of subscribers.
  • eBay ???: Drives their product search and discovery features, enhancing the shopping experience.
  • Uber ??: Manages real-time location tracking and ride dispatch across their global network.
  • Slack ??: Enables efficient search within their communication platform, fostering seamless collaboration.
  • Spotify ??: Orchestrates their music discovery and personalized recommendations, tuning into user preferences.

E-commerce Empires ??:

  • Shopify ??: Powers product search and personalized recommendations for millions of online stores.
  • Instacart ??: Streamlines efficient grocery delivery by optimizing routes and processing orders.
  • Zalando ??: Delivers personalized fashion recommendations and product filtering, tailoring to individual styles.

Media & Entertainment Moguls ??:

  • 纽约时报 ??: Powers their archive search and content discovery, connecting readers with stories that matter.
  • The Guardian ???: Enables readers to navigate their vast news repository, broadening perspectives.
  • BBC ??: Handles content search and recommendation across their diverse platforms, catering to varied interests.

Travel & Hospitality Hotshots ???:

  • Airbnb ??: Powers their accommodation search and recommendation engine, making travel feel like home.
  • Expedia Group ??: Enables efficient hotel and flight search for their global user base, simplifying travel planning.
  • Marriott International ??: Powers their hotel information search and guest experience personalization, enhancing the hospitality experience.

Finance & Banking Behemoths ??:

  • Capital One ??: Enables fraud detection and risk management through comprehensive data analysis, safeguarding customer interests.
  • 汇丰 ??: Powers their internal search for documents and customer information, streamlining banking operations.
  • Barclays ??: Utilizes Elasticsearch for security analytics and anomaly detection, fortifying financial security.

These are just a few examples, and the list continues. Elasticsearch’s versatility and scalability make it a sought-after tool for organizations of all sizes across diverse industries, truly embodying the spirit of digital transformation. ??


Here is a list of some commonly used queries in Elasticsearch:

Match Query:

Description: Performs a full-text search on the analyzed text.

Example:

{
   "match": {
   "field_name": "search text"
    }
 }        

Match Phrase Query:

Description: Matches the entire input phrase, preserving the order of terms.

Example:

{
 "match_phrase": {
 "field_name": "search text"
   }
 }        

Match Phrase Prefix Query:

Description: Matches a partial phrase with a specified prefix.

Example:

{
  "match_phrase_prefix": {
  "field_name": "search tex"
   }
 }        

Match Bool Prefix Query:

Description: Combines multiple match queries with boolean operators (AND, OR, NOT).

Example:

{

    "bool": {
        "must": [
          { "match": { "field1": "value1" } },
           { "match": { "field2": "value2" } }
         ]
       }
     }        

Match All Query:

Description: Matches all documents in the index.

Example:

{
  "match_all": {}
}        

Match None Query:

Description: Matches no documents in the index.

Example:

{
  "match_none": {}
}        

Common Terms Query:

Description: Supports ignoring common terms in the input text.

Example:

{
   "common": {
      "field_name": {
         "query": "search text",
         "cutoff_frequency": 0.001
         }
       }
 }        

Fuzzy Match Query:

Description: Allows approximate matching with a specified fuzziness level.

Example:

?{
    "match": {
      "field_name": {
         "query": "search text",
          "fuzziness": "AUTO"
        }
      }
  }        

Term Query:

?Description: Matches documents that contain an exact term in a specified field.

?Example:

?{
     "term": {
       "field_name": "exact_term"
      }
 }        

Terms Query:

Description: Matches documents that contain any of the specified terms in a field.

Example:

{
    "terms": {
       "field_name": ["term1", "term2"]
       }
}?        

These queries cover a range of search scenarios, allowing you to tailor your Elasticsearch queries based on your specific use case.


Below is the code snippet with detailed explanation. ??

@Service
@Slf4j
public class BookService {

@Autowired
private ElasticsearchClient elasticsearchClient;

private static final String BOOK_INDEX = "books";

public void saveBook(List<Book> bookList) throws IOException {

  BulkRequest.Builder br = new BulkRequest.Builder();

   for (Book book : bookList) {
         br.operations(op -> op
                .create(idx -> idx
                      .index(BOOK_INDEX)
                      .id(book.getId())
                      .document(book)
                  )
           );
       }

   BulkResponse result = elasticsearchClient.bulk(br.build());

      if (result.errors()) {
           log.error("Bulk had errors");
           for (BulkResponseItem item : result.items()) {
              if (item.error() != null) {
                 log.error(item.error().reason());
              }
           }
        }
    }


public List<Book> searchBooksByMatchQuery(String field, String queryText) throws IOException {

SearchResponse<Book> response = elasticsearchClient.search(searchRequestBuilder -> searchRequestBuilder
                    .index(BOOK_INDEX)
                    .query(q -> q
                           .match(t -> t
                                .field(field)
                                .query(queryText)
                            )
                       ),
              Book.class
       );

        return convertHitsToBooks(response.hits().hits());
    }


public List<Book> searchBookByMatchPhraseQuery(String field, String queryText) throws IOException {
        
SearchResponse<Book> response = elasticsearchClient.search(searchRequestBuilder -> searchRequestBuilder
                        .index(BOOK_INDEX)
                        .query(q -> q
                                .matchPhrase(t -> t
                                        .field(field)
                                        .query(queryText)
                                )
                        ),
                Book.class
        );

        return convertHitsToBooks(response.hits().hits());
    }

public List<Book> searchBookByFuzzyMatchQuery(String field, String queryText) throws IOException {
        SearchResponse<Book> response = elasticsearchClient.search(searchRequestBuilder -> searchRequestBuilder
                        .index(BOOK_INDEX)
                        .query(q -> q
                                .fuzzy(t -> t
                                        .field(field)
                                        .fuzziness("AUTO")
                                        .value(queryText)
                                )
                        ),
                Book.class
        );

        return convertHitsToBooks(response.hits().hits());
    }

private List<Book> convertHitsToBooks(List<Hit<Book>> hitBookList) {
        List<Book> books = new ArrayList<>();
        for (Hit<Book> hit : hitBookList) {
            books.add(hit.source());
        }
        return books;
    }
}        

Explanation:

  1. BulkRequest: This is a type of request in Elasticsearch that allows for multiple operations to be performed in a single request. This is useful when you want to index, update, or delete many documents in a single operation.
  2. BulkRequest.Builder: This is a builder pattern in Java, which is used to build complex objects step by step. It provides a clear and flexible way to construct an object. In this case, it’s used to build a BulkRequest.
  3. Operations: These are the actions that will be performed on the documents. In this case, the operation is create, which means a new document will be indexed in Elasticsearch. The index method sets the index where the document will be stored, the id method sets the ID of the document, and the document method sets the content of the document.
  4. Elasticsearch Client: This is the client that communicates with the Elasticsearch server. It executes the BulkRequest and returns a BulkResponse.
  5. BulkResponse: This is the response from the Elasticsearch server. It contains information about the executed operations, such as whether they were successful or not.
  6. Error Handling: If there are any errors during the execution of the BulkRequest, they are logged for debugging purposes.
  7. The search method is called on the elasticsearchClient with a searchRequestBuilder that specifies the index to search (BOOK_INDEX) and the query to execute. The query is a match query, which is a standard query for performing full-text searches, including fuzzy matching and phrase or proximity queries. The field method specifies the field in the document to search, and the query method specifies the text to search for.
  8. The match_phrase query is a type of full-text query that is used when you want to find documents containing a particular phrase. Unlike the match query, which analyzes the query text and constructs a phrase query as the result of the analysis, the match_phrase query analyzes the query text, constructs a phrase query as the result of the analysis, and also applies a slop to it.
  9. The fuzzy query is a type of query that uses similarity based on Levenshtein edit distance for matching documents. The field method specifies the field in the document to search, the fuzziness method sets the level of fuzziness to use, and the value method specifies the text to search for. The fuzziness parameter AUTO allows Elasticsearch to determine the fuzziness level automatically based on the length of the term.


Explore the exciting world of Elasticsearch with my latest hello world project! ??

?? Dive into the details: Elasticsearch Example


?? Join the Conversation: Share this post with your friends and colleagues who are passionate about web development and tech innovation. Let's learn and grow together. Your network will thank you! ??


要查看或添加评论,请登录

Satyam Barsainya的更多文章

社区洞察

其他会员也浏览了