A topic that often comes up on the discussions forum is spaCy's Vocab object and its vectors. So let's go over a few properties
A topic that often comes up on the discussions forum is spaCy's Vocab object and its vectors. So let's go over a few properties of vectors currently found in medium (md) and large (lg) models.
One of the main features of the Vocab in spaCy is the vector store. This is the single place where pre-trained word-embeddings can be found. Having a single place for these vectors saves on a lot of memory!
It's *not* always the case that each word stored in the vector lookup has a unique vector though! spaCy allows for some pruning in order to save on disk/memory. When two strings have similar vectors, they may get merged together. The medium (md) models typically do this.
You can inspect the meta information of the spaCy models to get an impression of how much pruning has been done. The large/medium models currently always have the same number of keys, but they differ in the number of vectors.
If you're curious: you can actually look for pruned vectors by looping over the vectors table.
The small (sm) spaCy models don't ship with vectors. When you call .vector on these tokens you still get a numeric vector but it's a fallback to the internal Tok2Vec tensor. More details on the difference are discussed here.
It's also important to understand that these vectors do *not* carry any context. The same string in a sentence may have multiple meanings and the .vector property does not catch this!
领英推荐
These vector tables can be used to calculate similarity statistics but they are also used to determine if a word is "out of vocabulary" via the .is_oov property. If a string does not appear in the vector table the .is_oov property returns True.
You might think that the vectors table is a dictionary that matches strings to vectors. Practically, it does. But that's not how it's implemented internally! If you ask for the .keys() you get hash values instead of strings.
This is where the StringStore makes an appearance. This object handles all the translation from hash to string and from string to hash. Using hashes makes everything much faster and lighter, so we need an object to handle the translation.
You typically won't interact with this StringStore yourself because it's more of an implementation detail, but it's good to understand that there's a mechanism that deals with the translation between hash and string.
It deserves repeating: the StringStore does *not* determine if a word is OOV! There can be strings in the StringStore that don't have vectors. The StringStore is really just an object that looks up strings by 64-bit hashes.
With the Vocab's StringStore and Vectors, Token objects can fetch lexical properties from a single place in memory. This helps keep things lightweight/fast.
We hope this thread helped explain some internal details! We might be interested in doing more of these long threads in the future. So if there are general topics that you'd like to see explored in more detail, let us know!
If you need help with an NLP pipeline that utilizes spaCy, we are happy to help you with our new services offering,?spaCy Tailored Pipelines. The spaCy team will build you a custom natural language processing pipeline, delivered in a standardized format using spaCy’s?projects?system.
Tech Lead Generative AI @ KPN
2 年Stan Meyberg
VP Growth @ Weaviate
2 年If you want to understand spaCy better you should check this awesome article by ?? Vincent Warmerdam