Xo jili casino login download free.Enjoy Free 888+200 Daily Legal Bonus

Decided to write an overview of key search concepts, just to refresh and crystallize my understanding. This is far from comprehensive but I think it captures most of the essentials. This broadly reflects my understanding of search at the present moment - I hope you, mysterious reader, will find it useful as well.

Tokenization?

Decomposes text into individual units, which may be words, phrases, symbols, n-grams, or other lexical structures. Tokenization creates searchable terms from input text, facilitating matching of query terms against documents. Common rules include splitting on whitespace, punctuation, special characters, while preserving special compounds such as apostrophes and abbreviations. Influences how documents are stored and how queries are processed.?

Customized tokenization allows tailoring of search to align with specific use-case conditions, languages, user expectations, domain terminology, and also facilitates methods such as partial or fuzzy matching.?

Chunking

Split long documents into shorter segments (chunks). Common strategies include word and sentence level chunking. Optimal chunk size is context and use-case dependent, though lengths of 256 to 512 words/tokens have been recommended on the basis on empirical experiments. Smaller chunks are more precise but risk losing essential context. Larger chunks capture more context but introduce irrelevant information, which can reduce the quality of a generated answer when passed to an LLM.?

In Elastic 8.15, chunking is automatically handled using the semantic_text datatype, though custom chunking strategies can be implemented via painless script ingest processors.?

Intuitively, we aim to target the specific portions of a document relevant to a query, in order to maximize search relevance and reduce interference from irrelevant chunks of text. Chunks should ideally be independent units of information, with reliance on additional context minimized as much as possible.?

Difficult to achieve in practice, especially in domains where a considerable quantity of background information is necessary to interpret the material. This area is where RAG in domains such as legal and healthcare tends to fall short.

Partial Matching?

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html

Partial matching allows discovery of results containing only a part of the search query. This assists iterative discovery - ie. When you are not fully aware of what you are searching for, or when searching for complex compound terms. Matching on prefixes (Heads), suffixes (Tails), infixes (Bodies), and n-grams is possible. Elastic offers wildcard queries and n-gram analyzers for this purpose.

Fuzzy Matching

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html

Fuzzy matching allows for approximate searches rather than exact matches. Leverages Levenshtein Edit Distance (Which measures the minimum number of single character edits required to change one word into another) to create a similarity measure based purely on either lexical features or phonetics.?

Fuzzy matches provide a means to handle spelling errors and variations gracefully. Elastic offers a fuzzy query type with a configurable ‘fuzziness’ threshold.?

Analyzer

https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer.html

Analyzers in Elastic are composed of character filters, a tokenizer, and token filters. Provides a customizable framework for text processing, and applies filters in a specific order. Unlike a pipeline, analyzers are bi-directional. Convenient tool for organizing data-processing procedures which are context-specific. Helps maintain consistency between indexing and search.?

Normalization?

Converts text into canonical forms, ensuring that multiple variations can be treated equivalently by search. Examples include lowercasing, accent removal, punctuation removal, whitespace standardization, unicode and spelling standardization, acronym expansion, numerical standardization, linguistic inter-conversion (Katakana to Hiragana), domain terminology normalization, stemming/lemmatization.?

Lexical Search

Term-based search centered on exact matches of keywords or phrases, leveraging the inverted index for speed. Text fields go through tokenization and normalization to improve consistency and predictability of search behavior. In Elastic, can be either case sensitive or insensitive.?

BM-25 & TF-IDF?

Ranking functions used to determine the importance of search terms within a document. TF-IDF multiplies term frequency with the log of the ratio of total documents to number of documents containing a particular term.?

BM-25? penalizes document length and term frequency past a certain point, accounting for saturation effects (Diminishing returns for term frequency where TF-IDF increases linearly).?

BM-25 is the default scoring function in Elastic since version 5.0?

Embedding

Vector representation of words, phrases, sentences, documents, in n-dimensional space. Learned as a product of text-based machine learning tasks spanning the entire breadth of supervision degrees. There is artfulness involved in choosing embeddings based on model training data and learning tasks.?

Captures semantic relationships between textual segments in geometric space, allowing for conception of textual similarity using measures of vector similarity. May serve as a form of dimensionality reduction from an unbounded space (Language) to a comparatively smaller one (typically 384-8000 floating point or binary dimensions) [Calling language unbounded is potentially controversial but the question is practically philosophical]

Provides a standard format which tensor-based ML algorithms can efficiently interpret and process. Can be interpreted as a representation of co-occurrence patterns and correlational relationships between textual elements. May be construed as an incomplete representation of the statistical properties of the originating model.?

Vectors derived from latter layers of deep learning models have undergone a series of non-linear transformations; thus they are not direct projections of underlying lexical distributions.??

Note: Embedding models have a length limitation, beyond which information is not encoded or preserved - Chunk sizes should be chosen such that each chunk falls within this limit.?

Vector Distance

Geometric distance between two vectors, which is most commonly calculated via Euclidean or Manhattan distance. May be taken as a measure of semantic closeness in the context of vector search. Suffers from curse-of-dimensionality, where distance converges to uniform value as dimensions scale.

Vector Similarity

The alikeness of two vectors, typically calculated by cosine of the angle between two vectors. Does not depend of magnitude of vector, which may be generally preferred in NLP. This is because magnitude correlates with document length/word frequency, which may not be relevant for semantic comparison. Retains interpretability and efficacy in high dimensional spaces.

Relevance

Somewhat nebulous concept. Measures how well a search system aligns with user expectations and use-case requirements. Includes such considerations as topical, situational, cognitive, and affective relevance. Altered by perceptions of document authoritativeness, popularly, freshness, as well as user behavior. May be considered the ‘absolute’ metric of search engine success, and is affected by every design and tuning decision made during construction. May be measured by ‘soft’ metrics, such as user acceptance and satisfaction ratings, expert judgment, user activity trends, or by harder metrics such as F1 score and mean-average-precision.

Hybrid Search

Search procedure which combines keyword matching mechanisms with measures of vector similarity. Queries are embedded, and both the embedding and the original query are simultaneously searched.?

Attempts to ground search with an easily-interpretable and explicit component, while enhancing it with nuanced statistical interpretation through correlational relationships derived from deep learning, which are often difficult to interpret.?

Lexical matching scores and vector similarity scores must be normalized to ensure fair comparison by removing biases due to scale imbalances.?

Simple normalization such as Z-Score or Min-Max can be applied to ensure that scores exist within comparable scales. Weighting may be used to bias search towards one method over the other, which is dependent on use-case context. Alternatively, scores may be combined using reciprocal rank fusion (RRF).?

Reciprocal Rank Fusion?

https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html

A method of combining multiple ranked lists into a single ranking. Deterministic, non-machine learning algorithm. Balances competing influences from diverse ranking systems. For each item, for each list in which it appears, sum (1/[k+r_i]), where k is a constant and r_i is the ranking of the item within the list. k is a tunable parameter, where smaller values favor consistently top-ranked items over lower ranked items. Specific values of k are typically assessed and chosen empirically.?

Lower vs Higher Floating Point Precision?

Choice of precision in current-day ML context attempts to balance semantic richness with resource intensity. Higher precision floating point numbers offer greater numerical stability and accuracy, stabilizing training by minimizing the effect of rounding errors. Lower precision FP numbers saves resources at the cost of semantic richness. Note that there may be diminishing returns from a semantic richness vs resource intensity perspective. In certain domains, lower precision be sufficient to capture most of the available information.

Reducing the resource footprint of models opens up a range of operational implications. For example, deploying more model replicas to enhance robustness and increase throughput, or deploying models on edge devices and consumer grade hardware. A good strategy is to train on server-grade hardware at high precision, then reduce precision for subsequent deployment.?

Sparse Vectors vs Dense Vectors

Sparse vectors contain relatively few non-zero elements despite higher dimensionality than dense vectors (10^4 - 10^6). This makes them storage efficient and highly performant, despite lacking the depth or richness of dense vectors. Conversely, sparse vectors may be less susceptible to statistical noise, having been stripped to their essential elements. Potentially more interpretable, as each dimension might correspond to a specific feature or element of text.?

Dense vectors are mostly non-zero, generally lower dimensional (10^2 - 10^4), and more resource intensive. Semantically richer as a result, and the default choice for current-day machine learning applications (possibly owing to a flagrant disregard for resource efficiency in our industry)

ELSER (Sparse Embedding)

https://www.elastic.co/guide/en/elasticsearch/reference/current/semantic-search-elser.html

Elastic model, which bridges the gap between sparse and dense representation. Contains high-dimensional sparse embeddings learned through training on text corpus. Balances the semantic richness of dense vectors with the performance characteristics of sparse vectors. Feasible to run at scale on CPU nodes, enabling vector search with acceptable performance in lower-resource contexts.?

Learning to Rank

https://www.elastic.co/guide/en/elasticsearch/reference/current/learning-to-rank.html

Aims to improve relevance by building a custom ranking function, more effectively aligning search engine performance to user expectations. In Elastic, this takes the form of a machine learning model utilizing a list of queries, documents, and judgements for training data.?

Judgements may be manually generated or derived from user behavior with some manual finetuning. Data may be further enhanced with features drawn from documents and queries. Training optimizes ranking metrics such as discounted cumulative gain or mean average precision.?

Elastic itself does not support model training. XGBoost is recommended due to eland support.?

Semantic Reranking

https://www.elastic.co/guide/en/elasticsearch/reference/current/semantic-reranking.html

Take some documents and output a revised ranking. Add the query to render the ranking query-aware. Use a more powerful model than can be afforded at the search and retrieval stage.?

Search Concepts Cheatsheet - Elastic Oriented

Han Xiang Choong

Senior Customer Architect - APJ @ Elastic | Applied ML | Use-case Delivery | Search Experiences

Tokenization?

Chunking

Partial Matching?

Fuzzy Matching

Analyzer

Normalization?

Lexical Search

BM-25 & TF-IDF?

Embedding

领英推荐

Vector Distance

Vector Similarity

Relevance

Hybrid Search

Reciprocal Rank Fusion?

Lower vs Higher Floating Point Precision?

Sparse Vectors vs Dense Vectors

ELSER (Sparse Embedding)

Learning to Rank

Semantic Reranking

更多精彩文章

社区洞察

其他会员也浏览了

Tuning Information Retrieval in Agent Builder Search applications with Google Search?Adaptor.

Tips and Tricks for Advanced Strategies in Web Scraping and Price Intelligence

Combined usage of SKOS and OWL: an experimentation on the Digital Europa Thesaurus

Similarity search is not a silver spoon!

Google Algorithm Updates: Getting Ready for Future Changes

Pipes Before Pipelines

When Short Queries Meet Long Documents:

Boutique Knowledge Graphs: Creating Smart Content at Any Scale

Googling: Science or Art?

Deep dive into JSON-LD IRI IDs in TerminusDB

Tokenization?

Chunking

Partial Matching?

Fuzzy Matching

Analyzer

Normalization?

Lexical Search

BM-25 & TF-IDF?

Embedding

领英推荐

Vector Distance

Vector Similarity

Relevance

Hybrid Search

Reciprocal Rank Fusion?

Lower vs Higher Floating Point Precision?

Sparse Vectors vs Dense Vectors

ELSER (Sparse Embedding)

Learning to Rank

Semantic Reranking

Code Snippet: Parallel LLM Calls

2024年9月19日

Snippet: Speeding up Bulk Upload Speeds to Elastic with Parallelisation in Python

2024年9月13日

Automating Traditional Search

2024年9月9日

A Personal Chatbot Interface with Elasticsearch & Streamlit

2024年8月31日

Search in Elastic 8.15 - Building RAG Extremely Quickly WITHOUT Code

2024年8月12日

Advanced RAG Techniques Part 2: Querying and Testing

2024年8月6日

Advanced RAG Techniques Part 1: Data Processing

2024年8月6日

Searching for Best Practices in RAG: The Sparknotes Version

2024年7月26日

The Basics: Managing Time-Series Data with Elastic Datastreams

2024年7月24日

The Basics: DocumentRAG using Elastic Hybrid Search & GPT-4o

2024年7月16日

社区洞察

其他会员也浏览了

Tuning Information Retrieval in Agent Builder Search applications with Google Search?Adaptor.

Tips and Tricks for Advanced Strategies in Web Scraping and Price Intelligence

Combined usage of SKOS and OWL: an experimentation on the Digital Europa Thesaurus

Similarity search is not a silver spoon!

Google Algorithm Updates: Getting Ready for Future Changes

Pipes Before Pipelines

When Short Queries Meet Long Documents:

Boutique Knowledge Graphs: Creating Smart Content at Any Scale

Googling: Science or Art?

Deep dive into JSON-LD IRI IDs in TerminusDB