Text Similarity

Text Similarity

Upendra Sharma, Arun Ayachitula


1. Motivation

While adept at storing factual knowledge and excelling in NLP tasks, large pre-trained language models must be improved in knowledge manipulation and providing provenance. Retrieval-Augmented Generation (RAG) [5] models integrate parametric (seq2seq models) and nonparametric (dense vector index of knowledge base) memory, enhancing language generation. With two different formulations, RAG models outperform existing models in open-domain QA tasks and generate more specific, diverse, and factual language than traditional parametric-only models, marking a significant advancement in knowledge-intensive NLP tasks. In simpler words, RAG is a natural language processing technique that merges an information retrieval component with a text generation model to address these problems. This approach uses generative models within the context of information retrieved.

Figure 1: RAG Overview

Its essential function is to retrieve documents or texts pertinent to a specific task or question, enhancing the effectiveness and relevance of the generated text. As shown in 1, the RAG use-case can be thought of as three essential steps: i) Generation of a searchable knowledgebase. ii)Search for the most relevant documents from Knowledgebase. iii) Finally, summarize the retrieved documents to generate a response for the user query.

Figure 2: RAG example from AIOps

Figure 2 shows the output from our implementation of the RAG solution for retrieving the root cause documents to aid the site reliability engineers in performing their tasks speedily and more effectively.

The rest of the article is organized as follows: In section 2, we will discuss the background of semantics and semantic distance and a summary of various techniques used to compute the semantic distance between words and cosine similarity/distance. Next, we discuss the idea of sentence embeddings and approaches for generating them. Finally, we mention briefly about distance using embeddings.

2. Introduction/Background

Finding semantically similar text has been one of the critical problems in natural language processing (NLP) research. Now, a text could be a word, a sentence, or a document, and each of them poses an increasing level of difficulty, but in essence, the idea is to find semantically similar text by defining a measure of similarity between texts; this measure is called semantic distance. In this section, we will first briefly discuss word embeddings and then talk about sentence embeddings used to perform the first two tasks mentioned in Section 1.

2.1 Word to vector

Humans are good at estimating semantic distance, but an automatic method of quantitatively computing semantic distance for processing data in large quantities has been an important topic of NLP research [7]. Traditional NLP regarded words as discrete symbols (a localist representation); in other words, they were typically represented as one-hot vectors with each vector having the size of the whole of the vocabulary, for instance:

CPU : [0, 0, 1, 0, . . . , 0]1×N

CORE : [1, 0, 0, 0, . . . , 0]1×N

These vectors can now be used in many machine-learning tasks, such as classification, sentiment analysis clustering, etc.

This approach has many problems. Firstly, N is as large as 250,000 for the English language. Still, the fundamental problem with this representation (besides the large size) is that each word is assumed to be orthogonal to the other word – no similarity between words. This problem has been addressed by a large body of research (till 2010), which tried to use resources like WordNet and Wiktionary to capture the meaning overlaps and similarities and invent different representations (Mohammad et al. in [7] call it knowledgerichmeasures). Still, that effort failed to provide a complete solution in its ability to capture similarity/relatedness while being compact for speed and efficiency.

One of the most successful ideas of modern statistical NLP is that of distributional semantics, which is inspired by the maxim ”You shall know a word by the company it keeps”[4]. This means that a word’s meaning is mainly indicated by its surrounding words, which are mostly a fixed-size window. For instance, consider the word w, say processor, in the following sentences:

"The software development team optimized their code to run efficiently on the latest multi-core processors."

"In cloud computing, the processor’s speed and capacity are crucial for handling large-scale data processing tasks."

"The new algorithm was designed to reduce processor load, enhancing the system’s overall performance."

The words around w, i.e., processor, are the context. The distributional semantics approach uses many contexts to build up the vector representation of w. The vector thus generated is a dense real-valued vector that is similar (in values along its various dimensions) to vectors of words that appear in similar contexts – the dimension of the vector space is typically N >= 300. These word vectors are also called word embeddings or neural word representations. This is a distributed representation of the word w because the meaning of the word processor is distributed across the dimension of the vector – embeddings because each word is embedded in the N-dimensional vector space. In other words, Word embedding is an NLP technique in which words from the corpus are mapped as vectors. Good embeddings cluster similar words in the embeddings space. Naseem et al. have surveyed the latest research technologies [10]. A complete survey is out of the scope of this article. Figure 3 (blog by Fabio [16]), but in this article, we summarize the advanced word vector representation approaches in two essential categories: i) continuous word representation (context-independent) and ii) contextual word representations (context-dependent).

2.1.1 Continuous word representation (context-independent)

Continuous word representation techniques, such as Word2Vec [6], GloVe [11], and FastText [2], convert words into vectors by using neural networks to analyze words in their context. These methods involve training a model on a text corpus to learn word associations based on how often words appear near each other. The resulting vectors capture semantic similarities, with words that appear in similar contexts having similar vector representations.

Figure 3: A small taxonomy of word embedding technologies

This technique allows for a more nuanced understanding of word meaning than older methods that treat words as isolated entities.

2.1.2 Contextual Word Representations

Contextual word representation techniques involve creating word vectors that consider the context in which a word appears rather than treating each word in isolation. These techniques use models like BERT, ELMO, and GPT, which generate word embeddings based on the words surrounding a sentence. This approach allows the model to capture nuances in meaning that change with context, making the word representations more dynamic and accurate for natural language understanding tasks. The resulting vectors are more effective in capturing the subtleties of language use compared to traditional, context-independent embeddings.

Transformer [15] has been proven to be very efficient and faster than LSTM or CNN for language modeling and is thus the preferred architecture for all the advances in this domain.

GPT (OpenAITransformer): Generative Pre-Training [12], a pioneering transformer-based pre-trained language model, excels in adapting word semantics contextually using the transformer’s decoder. It operates as an auto-regressive model, predicting subsequent words based on prior context. While GPT demonstrates impressive results in various applications, it is limited in being unidirectional, focusing solely on predicting context in a left-to-right sequence.

2.2 Sentence to vectors

A large number of modern transformers can be thought of as variants of BERT. They enhance word embeddings by using deep networks and an attention mechanism. Unlike simpler models like word2vec, which generate the same vector for a word regardless of context, BERT’s attention mechanism adjusts embeddings based on surrounding words. For example, the word “space” would have different embeddings in “very low disk space” versus “space is the final frontier,” illustrating BERT’s ability to capture context-specific meanings. But in a RAG setup, we want to compare sentences, not words. And BERT embeddings are produced for each token. We need a single vector representing our sentences or paragraphs like sentence2vec. Reimer and Gurevych built the first transformer explicitly built for this Sentence-BERT (SBERT) [13], a modified version of BERT. This adaptation addresses BERT’s limitations in generating sentence-level embeddings for semantic similarity assessment, which was computationally intensive with the original BERT. Sentence-BERT employs Siamese and triplet network structures to derive embeddings that can be compared using cosine similarity, significantly improving performance in sentence-pair tasks like clustering and semantic textual similarity.

BERT (and SBERT) use a WordPiece tokenizer — meaning that every word equals one or more tokens. SBERT allows us to create a single vector embedding for sequences containing no more than 128 tokens. Anything beyond this limit is cut. This limit isn’t ideal for long pieces of text but is reasonable when comparing sentences or small-average-length paragraphs. Many of the latest models allow for longer sequence lengths, too.

2.3 Challenges

Word and sentence embeddings, crucial in natural language processing, face challenges like polysemy and homonymy, where words have multiple meanings and context-dependency, affecting their ability to represent meaning accurately. Additionally, biases in training data can lead to skewed associations, and there’s a general lack of interpretability in understanding how and why words are represented as they are in these models. These issues can lead to errors and biased outcomes in various applications.

3 Embeddings

As we can see from Figure 1, there are two generative models for RAG: i) the language model that will answer the question and ii) the embedding model that will pick the source material from the knowledge base. We use GPT3.5 Turbo to answer the questions. However, there are many more options for embeddings. Unlike the GPT models, OpenAI’s embedding is not superior [9]. MTEB benchmarks show that other models score higher than ada-002. In particular, the Instructor models (XL and large) do very well.

In our work, we have used two types of models for generating embeddings for our sentences/text. One of them is OpenAI’s text-embedding-ada-002, and the other is Instructor-l[14].

We use the cosine similarity metric to measure semantic relatedness/similarity between two embedding vectors. Cosine similarity is equal to Cos(angle) where the angle is measured between the vector representation of two words/documents. So, if the cosine of the angle is one (or very close to one), the words are semantically similar.

4 Cosine Similarity

Cosine similarity is a mathematical way to measure how similar two sets of information are. It’s calculated by taking the cosine of the angle between two vectors (as shown in Figure 4). The cosine similarity does not depend on the magnitudes of the vectors but only on their angle.


Figure 4: cosine of the angle between vectors

A cosine similarity is a value bound by a constrained range of 0 and 1. The closer the value is to 0, the more it means that the two vectors are orthogonal or perpendicular. When the value is closer to one, the angle is smaller, and the images are more similar. When picking the threshold for similarities in text/documents, a value higher than 0.5 usually shows strong similarities.

There are many approaches for computing corpus-based measures of distributional distance, for instance, cosine, Manhattan, Euclidean, Hindle, Lin, Kullback-Leibler, α-skew, Jensen-Shannon, etc.. We chose to go with cosine similarity because of the following reasons: i) Hindle, Lin, Kullback-Leibler, α-skew, and Jensen-Shannon are non-symmetric measures, in a sense that distance from w1 to w2 is not the same as w2 and w1. The range of these distances is also from 0 to ∞. The Manhattan and Euclidean distances are symmetric, but they too range from 0 to ∞. ii) cosine is the most commonly used measure of similarity in literature.

Thus, we found the cosine similarity to be the most suitable for our use case, as it is symmetric and bounded from ?1 to 1. Cosine similarity can be easily converted into a distance (i.e., always positive) by subtracting it from a constant, say 1.

5 How to use embeddings in search?

So now we know we can compute word embeddings and use them to compute the semantic similarity between sentences. To make it usable and practical, we would need to be able to store the embeddings effectively in some kind of database and use it to find/retrieve documents from the store. This task is done by specialized databases called vector databases.

Vector databases store data as high-dimensional vectors (word embeddings). Each vector has a certain number of dimensions, ranging from tens to thousands, depending on the complexity and granularity of the data. Vector databases store and manage unstructured data, such as Text, Images, and Audio. Vector databases are being used for:

  • Machine learning models to remember previous inputs.
  • Use machine learning to power search, recommendations, and text generation use cases.
  • Complement generative AI models.
  • Provide an external knowledge base for generative AI chatbots.

Vector databases use similarity measures to compare the vectors stored in the database and find the ones most similar to a given query vector. Commonly used similarity measures are Euclidian distance, Manhattan distance, and Cosine similarity. Some examples of vector databases include Redis Stack, Datastax Astra DB, and Elasticsearch Vectorstore. Some open-source ones are Chroma DB, Weaviate, Faiss, Milvus, etc.

6 How to compute similarity labels?

Often, it is convenient to know the similarity measure between my query, the documents searched, and also the final response. We are using Chroma DB, which uses the following similarity measure between A and B vectors.


6.1 Computing similarity labels

A similarity label is a word assigning a class of distance. We are using three labels, namely LOW, MEDIUM, and HIGH. We can put some reasonable thresholds and assign a similarity label to each distance if we can figure out the probability distribution of the distance.

Let X be the random variable that represents the angle between any two embeddings and D be the random variable that represents the cosine distance between any two embeddings. We assume that X is uniformly distributed between 0 ? π – a reasonable assumption. Our objective is to compute the distribution of D given that X is uniformly distributed; we, in the current case, try to solve a much simpler problem, that is, we try finding the range of the cosine distance mentioned in 2.

Ideally, X should be distributed between 0 and π. Still, it has been observed, in our experiments and also from a discussion on the OpenAI Embeddings forum [1], that for ADA-002, the cos(θ), where θ is the angle between embeddings, does not go below lower than 0.7 (embeddings generated by OpenAI ada002 are not isotropic [8]). This means that for the model ADA-002, the embeddings range is 0.7 ≤ cos(θ) ≤ 1. This means that 0 ≤ θ ≤ arccos(0.7) or 0 ≤ θ ≤~ π/4.

So, we modify our assumption that X is uniformly distributed with the range [0, π/4]. With this assumption, the range can be split the θ range into three equal portions computed as follows: i) HIGH similarity when θ ∈ [0, π/12); ii) MEDIUM when θ ∈ [π/12, π/6) and iii) LOW when θ ∈ [π/6, π/4]. This means the range for d is:

6.2 How to handle the non-isotropic behavior

Options:

  • Explore other embedding models.
  • Try to make it more isotropic.

References

[1] Embedding Results Scale Seems Off.

[2] Piotr Bojanowski et al. Enriching Word Vectors with Subword Information. 2017. arXiv: 1607.04606 [cs.CL ].

[3] Simple and effective post-processing for word representations.

[4] J.R.Firth. “A synopsis of linguistic theory 1930-55.”In: 1952-59 (1957), pp. 1–32.

[5] Patrick Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. 2021. arXiv: 2005.11401 [cs.CL ].

[6] Tomas Mikolov et al. Efficient Estimation of Word Representations in Vector Space. 2013. arXiv: 1301.3781 [cs.CL ].

[7] Saif M. Mohammad and Graeme Hirst. Distributional Measures of Semantic Distance: A Survey. 2012. arXiv: 1203.1858 [cs.CL ].

[8] Jiaqi Mu, Suma Bhat, and Pramod Viswanath. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. 2018. arXiv: 1702.01417 [cs.CL ].

[9] Niklas Muennighoff et al. MTEB: Massive Text Embedding Benchmark. 2023. arXiv: 2210.07316 [cs.CL ].

[10] Usman Naseem et al. A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models. 2020. arXiv:2010.15036 [cs.CL ].

[11] Jeffrey Pennington, Richard Socher, and Christopher D Manning. “Glove: Global Vectors for Word Representation.” In: EMNLP. Vol. 14. 2014, pp. 1532–1543.

[12] Alec Radford et al. “Language Models are Unsupervised Multitask Learners”

[13] Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. 2019. arXiv: 1908.10084 [cs.CL ].

[14] Hongjin Su et al. One Embedder, Any Task: Instruction-Finetuned Text Embeddings.2023. arXiv: 2212.09741 [cs.CL ].

[15] Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural Information Processing Systems. Ed. by I. Guyon et al. Vol. 30. Curran Associates, Inc., 2017.

[16] 11 word embedding models you should know.

Appendix A: Postprocessing after ada-002

A post-processing algorithm, as mentioned in [8]

A possible implementation by Curt Kennedy [3].

Acknowledgments

Thanks to Girish Mohite, Krishna Sumanth Gummadi, Subbareddy Paturu, Amar Mollakantalla, Murali Batthena, Vamshi Gummalla, Nitik Kumar, Kishore Maalae, Saravanan Kumarasamy, Divakar Reddy Doma, Rainy Moona, Muhammad Danish, Suryakanth Barathi, Shakuntala Prabhu, Pradeep Soundarajan, Hyder Khan, Godwin Dsouza, Prameela S, Vipin Sreedhar, Abhishek Gurav, Santosh Kumar Panigrahi, Diwakar Natarajan, Shivam Choudhary for their contributions to AIOps development.






要查看或添加评论,请登录

社区洞察

其他会员也浏览了