登录查看更多内容

A topic that often comes up on the discussions forum is spaCy's Vocab object and its vectors. So let's go over a few properties

Explosion

Developer tools and tailored solutions for AI and Natural Language Processing. Makers of spaCy and Prodigy.

发布日期: 2022年4月6日

A topic that often comes up on the discussions forum is spaCy's Vocab object and its vectors. So let's go over a few properties of vectors currently found in medium (md) and large (lg) models.

One of the main features of the Vocab in spaCy is the vector store. This is the single place where pre-trained word-embeddings can be found. Having a single place for these vectors saves on a lot of memory!

It's *not* always the case that each word stored in the vector lookup has a unique vector though! spaCy allows for some pruning in order to save on disk/memory. When two strings have similar vectors, they may get merged together. The medium (md) models typically do this.

You can inspect the meta information of the spaCy models to get an impression of how much pruning has been done. The large/medium models currently always have the same number of keys, but they differ in the number of vectors.

If you're curious: you can actually look for pruned vectors by looping over the vectors table.

The small (sm) spaCy models don't ship with vectors. When you call .vector on these tokens you still get a numeric vector but it's a fallback to the internal Tok2Vec tensor. More details on the difference are discussed here.

It's also important to understand that these vectors do *not* carry any context. The same string in a sentence may have multiple meanings and the .vector property does not catch this!

领英推荐

Multilingual RAG, Algorithmic Thinking, Outlier…

Towards Data Science 8 个月前

Rag : Vanilla vs Multimodal

Inspiring Lab 2 个月前

Image noising and denoising with heatdiff

JoS QUANTUM 2 个月前

These vector tables can be used to calculate similarity statistics but they are also used to determine if a word is "out of vocabulary" via the .is_oov property. If a string does not appear in the vector table the .is_oov property returns True.

You might think that the vectors table is a dictionary that matches strings to vectors. Practically, it does. But that's not how it's implemented internally! If you ask for the .keys() you get hash values instead of strings.

This is where the StringStore makes an appearance. This object handles all the translation from hash to string and from string to hash. Using hashes makes everything much faster and lighter, so we need an object to handle the translation.

You typically won't interact with this StringStore yourself because it's more of an implementation detail, but it's good to understand that there's a mechanism that deals with the translation between hash and string.

It deserves repeating: the StringStore does *not* determine if a word is OOV! There can be strings in the StringStore that don't have vectors. The StringStore is really just an object that looks up strings by 64-bit hashes.

With the Vocab's StringStore and Vectors, Token objects can fetch lexical properties from a single place in memory. This helps keep things lightweight/fast.

We hope this thread helped explain some internal details! We might be interested in doing more of these long threads in the future. So if there are general topics that you'd like to see explored in more detail, let us know!

If you need help with an NLP pipeline that utilizes spaCy, we are happy to help you with our new services offering,?spaCy Tailored Pipelines. The spaCy team will build you a custom natural language processing pipeline, delivered in a standardized format using spaCy’s?projects?system.

Martin Kirilov

Tech Lead Generative AI @ KPN

2 年

Stan Meyberg

1 次回应

Philip Vollet

VP Growth @ Weaviate

2 年

If you want to understand spaCy better you should check this awesome article by ?? Vincent Warmerdam

7 次回应

查看更多评论

要查看或添加评论，请登录

Explosion的更多文章

2022年5月6日

Is it possible to have entities within entities within entities? Sure! Let's take a closer look at spaCy's SpanCategorizer

Let's discuss how this new feature works by sharing some slides from our recent Budapest ML Forum talk by ákos Kádár…

2 条评论

A topic that often comes up on the discussions forum is spaCy's Vocab object and its vectors. So let's go over a few properties

Explosion

Developer tools and tailored solutions for AI and Natural Language Processing. Makers of spaCy and Prodigy.

领英推荐

Explosion的更多文章

社区洞察

其他会员也浏览了

AI-Powered Search: Embedding-Based Retrieval and Retrieval-Augmented Generation (RAG)

Exploring Data Retrieval Methods in Vector Databases

Prompt Engineering

Template Pattern in Prompt Engineering

Understanding the RAG Pipeline: Components and Hyperparameters

Leveraging VGG16 and Nearest Neighbors for Efficient Image Classification and Similarity Retrieval: A case study on Outdoor Place Recognition

Problem Solving using Probabilistic Formal Systems for the Semantic Web

High Fidelity Retrieval Augmented Generation (RAG) with Meta Llama 3.1 at PubNub

I thought making a HTML-to-Markdown Custom GPT would be simple. But it isn't.

领英推荐

Explosion的更多文章

Is it possible to have entities within entities within entities? Sure! Let's take a closer look at spaCy's SpanCategorizer

社区洞察

其他会员也浏览了

AI-Powered Search: Embedding-Based Retrieval and Retrieval-Augmented Generation (RAG)

Exploring Data Retrieval Methods in Vector Databases

Prompt Engineering

Template Pattern in Prompt Engineering

Understanding the RAG Pipeline: Components and Hyperparameters

Leveraging VGG16 and Nearest Neighbors for Efficient Image Classification and Similarity Retrieval: A case study on Outdoor Place Recognition

Problem Solving using Probabilistic Formal Systems for the Semantic Web

High Fidelity Retrieval Augmented Generation (RAG) with Meta Llama 3.1 at PubNub

I thought making a HTML-to-Markdown Custom GPT would be simple. But it isn't.