How to import text from CoNNL format with named entities into spaCy, infer entities with my model and write them to the same dataset (with Python)?

问题

I have a dataset in CoNLL NER format which is basically a TSV file with two fields. The first field contains tokens from some text - one token per line (each punctuation symbol is also considered a token there) and the second field contains named entity tags for tokens in BIO format.

I would like to load this dataset into spaCy, infer new named entity tags for the text with my model and write these tags into the same TSV file as the new third column. All I know is that I can infer named entities with something like this:

nlp = spacy.load("some_spacy_ner_model")
text = "text from conll dataset"
doc = nlp(text)

Also I managed to convert the CoNLL dataset into spaCy's json format with this CLI command:

python -m spacy convert conll_dataset.tsv /Users/user/docs -t json -c ner

But I don't know where to go from here. Could not find how to load this json file into a spaCy Doc format. I tried this piece of code (found it in spaCy's documentation):

from spacy.tokens import Doc
from spacy.vocab import Vocab
doc = Doc(Vocab()).from_disk("sample.json")

but it throws an error saying ExtraData: unpack(b) received extra data..

Also I don't know how to write ners from doc object back into the same TSV file aligning tokens and NER tags with existing lines of the TSV file.

And here's an extract from the TSV file as an example of the data I am dealing with:

The O
epidermal   B-Protein
growth  I-Protein
factor  I-Protein
precursor   O
.   O

回答1:

There is a bit of gap in the spacy API here, since this format is usually only used for training models. It's possible, but it's not obvious. You have to load the corpus as it would be loaded for training as GoldCorpus, which gives you tokenized but otherwise unannotated Docs and GoldParses with the annotation in a raw format.

Then you can convert the raw GoldParse annotations to the right format and add them to the Doc by hand. Here's a sketch for entities:

import spacy
from spacy.gold import GoldCorpus
nlp = spacy.load('en')
gc = GoldCorpus("file.json", "file.json")
for doc, gold in gc.dev_docs(nlp, gold_preproc=True):
    doc.ents = spacy.gold.spans_from_biluo_tags(doc, gold.ner)
    spacy.displacy.serve(doc, style='ent')

dev_docs() is used here because it loads the docs without any further shuffling, augmenting, etc. as it might for training and it is loading the file in the second argument to GoldCorpus. GoldCorpus requires a training file and a dev file, so the first argument is necessary but we're not doing anything further with the data loaded from the first argument.

For now, use spacy 2.1.8 for this, since there's a bug for the gold_preproc option in 2.2.1. gold_preproc preserves your original tokenization rather than retokenizing with spacy. If you don't care about preserving the tokenization, you can set gold_preproc=False and then spacy's provided models will work slightly better because the tokenization is identical.

来源：https://stackoverflow.com/questions/58299682/how-to-import-text-from-connl-format-with-named-entities-into-spacy-infer-entit

标签

python

json

spacy

ner

conll