Repeating entity in replacing entity with their entity label using spacy

南楼画角 提交于 2021-01-01 09:26:08

问题


Code:

import spacy
nlp = spacy.load("en_core_web_md")

#read txt file, each string on its own line
with open("./try.txt","r") as f:
    texts = f.read().splitlines()

#substitute entities with their TAGS
docs = nlp.pipe(texts)
out = []
for doc in docs:
    out_ = ""
    for tok in doc:
        text = tok.text
        if tok.ent_type_:
            text = tok.ent_type_
        out_ += text + tok.whitespace_
    out.append(out_)

# write to file
with open("./out_try.txt","w") as f:
    f.write("\n".join(out))

Contents of input file:

Georgia recently became the first U.S. state to "ban Muslim culture.
His friend Nicolas J. Smith is here with Bart Simpon and Fred.
Apple is looking at buying U.K. startup for $1 billion

Contents of output file:

GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON PERSON PERSON is here with PERSON PERSON and PERSON.
ORG is looking at buying GPE startup for MONEYMONEY MONEY

I need to avoid this problem in above sentences. for example in (in sentence 2 'PERSON PERSON PERSON' to become one entity PERSON. Thanks


回答1:


Lets try:

import spacy
from spacy.gold import biluo_tags_from_offsets, spans_from_biluo_tags
nlp = spacy.load("en_core_web_md")

#read txt file, each string on its own line
with open("./try.txt","r") as f:
    texts = f.read().splitlines()

docs = nlp.pipe(texts)
out_text = ""
for doc in docs:
    offsets = []
    for ent in doc.ents:
        offsets.append((ent.start_char, ent.end_char, ent.label_))
    tags = biluo_tags_from_offsets(doc, offsets)
    text = *zip([tok for tok in doc],tags),
    out = []
    for item in text:
        tag = item[1].split("-")
        if tag[0] == "O":
            out.append(item[0].text+item[0].whitespace_)
        if tag[0] == "U":
            out.append(item[0].ent_type_+item[0].whitespace_)
        elif tag[0] == "L":
            out.append(item[0].ent_type_+item[0].whitespace_)
    out_text += "".join(out)+"\n"

with open("out_try.txt","w") as f:
    f.write(out_text)

Contents of the output file:

GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON is here with PERSON and PERSON.
ORG is looking at buying GPE startup for MONEY


来源:https://stackoverflow.com/questions/65408563/repeating-entity-in-replacing-entity-with-their-entity-label-using-spacy

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!