What is text embedding ?
From Wikimedia Commons, the free media repository

What is text embedding ?

Did you know Cambridge dictionary works like a language model ? and Each time you lookup a word, it's as if you are obtaining a vector that represents the essence of that word! Let us dive into the fascinating world of semantic search.

Computers and computer languages are deterministic, which is why we store values in databases to facilitate this inherent aspect of computer languages. The IF statement relies on known values to determine the subsequent logic to execute. However, in semantic search or natural language processing, this deterministic nature is absent.

Words do not exist in isolation. The meaning of "I buy an Apple" depends on the context established by previous statements. To comprehend the meaning of words, it is crucial to understand their relationships with one another. To capture these relationships, we use word embeddings, which are multi-dimensional vector representations of words and their associations with other words.

Text embeddings (or text vectors) are generated by trained language models, which are neural networks developed from extensive text corpora. To obtain a text vector, a sentence or text is input into a neural network model, which then outputs the text vector. This vector captures the relationships of words in the corpus. Different models produce distinct text vectors because the training corpora and mechanisms vary.

When using text embeddings, we aim to understand the meaning of words, which is expressed through the use of other English words. This process is similar to looking up a word in a dictionary; when you search for "cat" in the Cambridge Dictionary, it uses other English words to define "cat."

Cat noun
a?small?animal?with?fur, four?legs, a?tail, and?claws, 
usually?kept?as a?pet?or for?catching?mice
(from Cambridge online)         

If you consult the Collins Dictionary, the definition is similar, but it uses a different sequence of English words. In this example, the models are Cambridge and Collins, and the vocabulary definitions are the text vectors, demonstrating the relationship of "cat" with other English words.

Kids learn the about world by using the language (i.e. word relationships) their parents use and computer scientists train AI models to understand our world by letting the NN to discover the relationships with billion words. That maybe some how explains the reasoning power of AI models. My personal view is the secret of AI will likely happen in combining linguistics with mathematics. When we can use math symbols to describe all linguistic theories, then the AI models will be able to comprehend human-centric reality.

OpenAI offers several models and charges a fee for generating text embeddings. There are also open-source models available, with some providing more accurate representations of word meanings and relationships than OpenAI's Ada model.

https://iamnotarobot.substack.com/p/should-you-use-openais-embeddings

要查看或添加评论,请登录

馬Antony 裕杰的更多文章

社区洞察

其他会员也浏览了