Search Concepts Cheatsheet - Elastic Oriented
Han Xiang Choong
Senior Customer Architect - APJ @ Elastic | Applied ML | Use-case Delivery | Search Experiences
Decided to write an overview of key search concepts, just to refresh and crystallize my understanding. This is far from comprehensive but I think it captures most of the essentials. This broadly reflects my understanding of search at the present moment - I hope you, mysterious reader, will find it useful as well.
Tokenization?
Decomposes text into individual units, which may be words, phrases, symbols, n-grams, or other lexical structures. Tokenization creates searchable terms from input text, facilitating matching of query terms against documents. Common rules include splitting on whitespace, punctuation, special characters, while preserving special compounds such as apostrophes and abbreviations. Influences how documents are stored and how queries are processed.?
Customized tokenization allows tailoring of search to align with specific use-case conditions, languages, user expectations, domain terminology, and also facilitates methods such as partial or fuzzy matching.?
Chunking
Split long documents into shorter segments (chunks). Common strategies include word and sentence level chunking. Optimal chunk size is context and use-case dependent, though lengths of 256 to 512 words/tokens have been recommended on the basis on empirical experiments. Smaller chunks are more precise but risk losing essential context. Larger chunks capture more context but introduce irrelevant information, which can reduce the quality of a generated answer when passed to an LLM.?
In Elastic 8.15, chunking is automatically handled using the semantic_text datatype, though custom chunking strategies can be implemented via painless script ingest processors.?
Intuitively, we aim to target the specific portions of a document relevant to a query, in order to maximize search relevance and reduce interference from irrelevant chunks of text. Chunks should ideally be independent units of information, with reliance on additional context minimized as much as possible.?
Difficult to achieve in practice, especially in domains where a considerable quantity of background information is necessary to interpret the material. This area is where RAG in domains such as legal and healthcare tends to fall short.
Partial Matching?
Partial matching allows discovery of results containing only a part of the search query. This assists iterative discovery - ie. When you are not fully aware of what you are searching for, or when searching for complex compound terms. Matching on prefixes (Heads), suffixes (Tails), infixes (Bodies), and n-grams is possible. Elastic offers wildcard queries and n-gram analyzers for this purpose.
Fuzzy Matching
Fuzzy matching allows for approximate searches rather than exact matches. Leverages Levenshtein Edit Distance (Which measures the minimum number of single character edits required to change one word into another) to create a similarity measure based purely on either lexical features or phonetics.?
Fuzzy matches provide a means to handle spelling errors and variations gracefully. Elastic offers a fuzzy query type with a configurable ‘fuzziness’ threshold.?
Analyzer
Analyzers in Elastic are composed of character filters, a tokenizer, and token filters. Provides a customizable framework for text processing, and applies filters in a specific order. Unlike a pipeline, analyzers are bi-directional. Convenient tool for organizing data-processing procedures which are context-specific. Helps maintain consistency between indexing and search.?
Normalization?
Converts text into canonical forms, ensuring that multiple variations can be treated equivalently by search. Examples include lowercasing, accent removal, punctuation removal, whitespace standardization, unicode and spelling standardization, acronym expansion, numerical standardization, linguistic inter-conversion (Katakana to Hiragana), domain terminology normalization, stemming/lemmatization.?
Lexical Search
Term-based search centered on exact matches of keywords or phrases, leveraging the inverted index for speed. Text fields go through tokenization and normalization to improve consistency and predictability of search behavior. In Elastic, can be either case sensitive or insensitive.?
BM-25 & TF-IDF?
Ranking functions used to determine the importance of search terms within a document. TF-IDF multiplies term frequency with the log of the ratio of total documents to number of documents containing a particular term.?
BM-25? penalizes document length and term frequency past a certain point, accounting for saturation effects (Diminishing returns for term frequency where TF-IDF increases linearly).?
BM-25 is the default scoring function in Elastic since version 5.0?
Embedding
Vector representation of words, phrases, sentences, documents, in n-dimensional space. Learned as a product of text-based machine learning tasks spanning the entire breadth of supervision degrees. There is artfulness involved in choosing embeddings based on model training data and learning tasks.?
Captures semantic relationships between textual segments in geometric space, allowing for conception of textual similarity using measures of vector similarity. May serve as a form of dimensionality reduction from an unbounded space (Language) to a comparatively smaller one (typically 384-8000 floating point or binary dimensions) [Calling language unbounded is potentially controversial but the question is practically philosophical]
Provides a standard format which tensor-based ML algorithms can efficiently interpret and process. Can be interpreted as a representation of co-occurrence patterns and correlational relationships between textual elements. May be construed as an incomplete representation of the statistical properties of the originating model.?
Vectors derived from latter layers of deep learning models have undergone a series of non-linear transformations; thus they are not direct projections of underlying lexical distributions.??
领英推荐
Note: Embedding models have a length limitation, beyond which information is not encoded or preserved - Chunk sizes should be chosen such that each chunk falls within this limit.?
Vector Distance
Geometric distance between two vectors, which is most commonly calculated via Euclidean or Manhattan distance. May be taken as a measure of semantic closeness in the context of vector search. Suffers from curse-of-dimensionality, where distance converges to uniform value as dimensions scale.
Vector Similarity
The alikeness of two vectors, typically calculated by cosine of the angle between two vectors. Does not depend of magnitude of vector, which may be generally preferred in NLP. This is because magnitude correlates with document length/word frequency, which may not be relevant for semantic comparison. Retains interpretability and efficacy in high dimensional spaces.
Relevance
Somewhat nebulous concept. Measures how well a search system aligns with user expectations and use-case requirements. Includes such considerations as topical, situational, cognitive, and affective relevance. Altered by perceptions of document authoritativeness, popularly, freshness, as well as user behavior. May be considered the ‘absolute’ metric of search engine success, and is affected by every design and tuning decision made during construction. May be measured by ‘soft’ metrics, such as user acceptance and satisfaction ratings, expert judgment, user activity trends, or by harder metrics such as F1 score and mean-average-precision.
Hybrid Search
Search procedure which combines keyword matching mechanisms with measures of vector similarity. Queries are embedded, and both the embedding and the original query are simultaneously searched.?
Attempts to ground search with an easily-interpretable and explicit component, while enhancing it with nuanced statistical interpretation through correlational relationships derived from deep learning, which are often difficult to interpret.?
Lexical matching scores and vector similarity scores must be normalized to ensure fair comparison by removing biases due to scale imbalances.?
Simple normalization such as Z-Score or Min-Max can be applied to ensure that scores exist within comparable scales. Weighting may be used to bias search towards one method over the other, which is dependent on use-case context. Alternatively, scores may be combined using reciprocal rank fusion (RRF).?
Reciprocal Rank Fusion?
A method of combining multiple ranked lists into a single ranking. Deterministic, non-machine learning algorithm. Balances competing influences from diverse ranking systems. For each item, for each list in which it appears, sum (1/[k+r_i]), where k is a constant and r_i is the ranking of the item within the list. k is a tunable parameter, where smaller values favor consistently top-ranked items over lower ranked items. Specific values of k are typically assessed and chosen empirically.?
Lower vs Higher Floating Point Precision?
Choice of precision in current-day ML context attempts to balance semantic richness with resource intensity. Higher precision floating point numbers offer greater numerical stability and accuracy, stabilizing training by minimizing the effect of rounding errors. Lower precision FP numbers saves resources at the cost of semantic richness. Note that there may be diminishing returns from a semantic richness vs resource intensity perspective. In certain domains, lower precision be sufficient to capture most of the available information.
Reducing the resource footprint of models opens up a range of operational implications. For example, deploying more model replicas to enhance robustness and increase throughput, or deploying models on edge devices and consumer grade hardware. A good strategy is to train on server-grade hardware at high precision, then reduce precision for subsequent deployment.?
Sparse Vectors vs Dense Vectors
Sparse vectors contain relatively few non-zero elements despite higher dimensionality than dense vectors (10^4 - 10^6). This makes them storage efficient and highly performant, despite lacking the depth or richness of dense vectors. Conversely, sparse vectors may be less susceptible to statistical noise, having been stripped to their essential elements. Potentially more interpretable, as each dimension might correspond to a specific feature or element of text.?
Dense vectors are mostly non-zero, generally lower dimensional (10^2 - 10^4), and more resource intensive. Semantically richer as a result, and the default choice for current-day machine learning applications (possibly owing to a flagrant disregard for resource efficiency in our industry)
ELSER (Sparse Embedding)
Elastic model, which bridges the gap between sparse and dense representation. Contains high-dimensional sparse embeddings learned through training on text corpus. Balances the semantic richness of dense vectors with the performance characteristics of sparse vectors. Feasible to run at scale on CPU nodes, enabling vector search with acceptable performance in lower-resource contexts.?
Learning to Rank
Aims to improve relevance by building a custom ranking function, more effectively aligning search engine performance to user expectations. In Elastic, this takes the form of a machine learning model utilizing a list of queries, documents, and judgements for training data.?
Judgements may be manually generated or derived from user behavior with some manual finetuning. Data may be further enhanced with features drawn from documents and queries. Training optimizes ranking metrics such as discounted cumulative gain or mean average precision.?
Elastic itself does not support model training. XGBoost is recommended due to eland support.?
Semantic Reranking
Take some documents and output a revised ranking. Add the query to render the ranking query-aware. Use a more powerful model than can be afforded at the search and retrieval stage.?