Ignore out-of-vocabulary words when averaging vectors in Spacy

问题

I would like to use a pre-trained word2vec model in Spacy to encode titles by (1) mapping words to their vector embeddings and (2) perform the mean of word embeddings.

To do this I use the following code:

import spacy
nlp = spacy.load('myspacy.bioword2vec.model')
sentence = "I love Stack Overflow butitsalsodistractive"
avg_vector = nlp(sentence).vector

Where nlp(sentence).vector (1) tokenizes my sentence with white-space splitting, (2) vectorizes each word according to the dictionary provided and (3) averages the word vectors within a sentence to provide a single output vector. That's fast and cool.

However, in this process, out-of-vocabulary (OOV) terms are mapped to n-dimensional 0 vectors, which affects the resulting mean. Instead, I would like OOV terms to be ignored when performing the average. In my example, 'butitsalsodistractive' is the only term not present in my dictionary, so I would like nlp("I love Stack Overflow butitsalsodistractive").vector = nlp("I love Stack Overflow").vector.

I have been able to do this with a post-processing step (see code below), but this becomes too slow for my purposes, so I was wondering if there is a way to tell the nlp pipeline to ignore OOV terms beforehand? So when calling nlp(sentence).vector it does not include OOV-term vectors when computing the mean

import numpy as np
avg_vector = np.asarray([word.vector for word in nlp(sentence) if word.has_vector]).mean(axis=0)

Approaches tried

In both cases documents is a list with 200 string elements with ≈ 400 words each.

Without dealing with OOV terms:

import spacy
import time
nlp = spacy.load('myspacy.bioword2vec.model')
times = []
for i in range(0, 100):
    init = time.time()
    documents_vec = [document.vector for document in list(nlp.pipe(documents))]
    fin = time.time()
    times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.0850741124153136 s

Ignoring OOV terms in output vector. Note that in this case we need to add an extra 'if' statment for those cases in which all words are OOV (if this happens the output vector is r_vec):

r_vec = np.random.rand(200) # Random vector for empty text
# Define function to obtain average vector given a document
def get_vector(text):
    vectors = np.asarray([word.vector for word in nlp(text) if word.has_vector])
    if vectors.size == 0:
        # Case in which none of the words in text were in vocabulary
        avg_vector = r_vec
    else:
        avg_vector = vectors.mean(axis=0)
    return avg_vector

times = []
for i in range(0, 100):
    init = time.time()
    documents_vec = [get_vector(document) for document in documents]
    fin = time.time()
    times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.4214172649383543 s

In this example the mean difference time in vectorizing 200 documents was 0.34s. However, when processing 200M documents this becomes critical. I am aware that the second approach needs an extra 'if' condition to deal with documents full of OOV terms, which might slightly increase computational time. In addition, in the first case I am able to use nlp.pipe(documents) to process all documents in one go, which I guess must optimize the process.

I could always look for extra computational resources to apply the second piece of code, but I was wondering if there is any way of applying the nlp.pipe(documents) ignoring the OOV terms in the output. Any suggestion will be very much welcome.

回答1:

see this post by the author of Spacy which says:

The Doc object has immutable text, but it should be pretty easy and quite efficient to create a new Doc object with the subset of tokens you want.

Try this for example:

import spacy
nlp = spacy.load('en_core_web_md')
import numpy as np

sentence = "I love Stack Overflow butitsalsodistractive"

print(sentence)
tokens = nlp(sentence)
print([t.text for t in tokens])
cleanText = " ".join([token.text for token in tokens if token.has_vector])
print(clean)
tokensClean = nlp(cleanText)
print([t.text for t in tokensClean])


np.array_equal(tokens.vector, tokensClean.vector)
#False

If you want to speed things up, disable the pipeline components in spacy with you don't use (such as NER, dependency parse, etc ..)

来源：https://stackoverflow.com/questions/57672043/ignore-out-of-vocabulary-words-when-averaging-vectors-in-spacy

标签

python

nlp

word2vec

spacy