问题
I have tried to remove words from a document that are considered to be named entities by spacy, so basically removing "Sweden" and "Nokia" from the string example. I could not find a way to work around the problem that entities are stored as a span. So when comparing them with single tokens from a spacy doc, it prompts an error.
In a later step, this process is supposed to be a function applied to several text documents stored in a pandas data frame.
I would appreciate any kind of help and advice on how to maybe better post questions as this is my first one here.
nlp = spacy.load('en')
text_data = u'This is a text document that speaks about entities like Sweden and Nokia'
document = nlp(text_data)
text_no_namedentities = []
for word in document:
if word not in document.ents:
text_no_namedentities.append(word)
return " ".join(text_no_namedentities)
It creates the following error:
TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got spacy.tokens.span.Span)
回答1:
This will get you the result you're asking for. Reviewing the Named Entity Recognition should help you going forward.
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'This is a text document that speaks about entities like Sweden and Nokia'
document = nlp(text_data)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))
Output:
This is a text document that speaks about entities like and
回答2:
This will not handle entities covering multiple tokens.
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))
Output
'New York is in'
Here USA
is correctly removed but couldn't eliminate New York
Solution
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
print(" ".join([ent.text for ent in document if not ent.ent_type_]))
Output
'is in'
来源:https://stackoverflow.com/questions/59313461/removing-named-entities-from-a-document-using-spacy