问题
So right now I have a really simple program that will take a sentence and find the sentence in a given book that is most semantically similar and prints out that sentence along with the next few sentences.
import spacy
nlp = spacy.load('en_core_web_lg')
#load alice in wonderland
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
text = strip_headers(load_etext(11)).strip()
alice = nlp(text)
sentences = list(alice.sents)
mysent = nlp(unicode("example sentence, could be whatever"))
best_match = None
best_similarity_value = 0
for sent in sentences:
similarity = sent.similarity(mysent)
if similarity > best_similarity_value:
best_similarity_value = similarity
best_match = sent
print sentences[sentences.index(best_match):sentences.index(best_match)+10]
I want to get better results by telling SpaCy to ignore the stop words when doing this process, but I don't know the best way to go about this. Like I could create a new blank list and append each word that isn't a stop word to the list
for sentence in sentences:
for word in sentence:
if word.is_stop == 'False':
newlist.append(word)
but I would have to make it more complicated than the code above because I would have to keep the integrity of the original list of sentences (because the indexes would have to be the same if I wanted to print out the full sentences later). Plus if I did it this way, I would have to run this new list of lists back through SpaCy in order to use the .similarity method.
I feel like there must be a better way of going about this, and I'd really appreciate any guidance. Even if there isn't a better way than appending each non-stop word to a new list, I'd appreciate any help in creating a list of lists so that the indexes will be identical to the original "sentences" variable.
Thanks so much!
回答1:
What you need to do is to overwrite the way spaCy computes similarity.
For similarity computation, spaCy firsts computes a vector for each doc by averaging the vectors of each token (token.vector attribute) and then performs cosine similarity by doing:
return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
You have to tweak this a bit and not take into account vectors of stop words.
The following code should work for you:
import spacy
from spacy.lang.en import STOP_WORDS
import numpy as np
nlp = spacy.load('en_core_web_lg')
doc1 = nlp("This is a sentence")
doc2 = nlp("This is a baby")
def compute_similarity(doc1, doc2):
vector1 = np.zeros(300)
vector2 = np.zeros(300)
for token in doc1:
if (token.text not in STOP_WORDS):
vector1 = vector1 + token.vector
vector1 = np.divide(vector1, len(doc1))
for token in doc2:
if (token.text not in STOP_WORDS):
vector2 = vector2 + token.vector
vector2 = np.divide(vector2, len(doc2))
return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(compute_similarity(doc1, doc2)))
Hope it helps!
来源:https://stackoverflow.com/questions/52807080/is-there-a-simple-way-to-tell-spacy-to-ignore-stop-words-when-using-similarity