Train spaCy's existing POS tagger with my own training examples

I am trying to train the existing POS tagger on my own lexicon, not starting off from scratch (I do not want to create an "empty model"). In spaCy's documentation, it says "Load the model you want to stat with", and the next step is "Add the tag map to the tagger using add_label method". However, when I try to load the English small model, and add the tag map, it throws this error:

ValueError: [T003] Resizing pre-trained Tagger models is not currently supported.

I was wondering how it can be fixed.

I have also seen Implementing custom POS Tagger in Spacy over existing english model : NLP - Python but it suggests that we create an "empty model" which is not what I want.

Also, it is not very clear in spaCy's documentation if we need to have a mapping dictionary (TAG_MAP) even if our training examples tags are the same as the universal dependency tags. Any thoughts?

from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

TAG_MAP = {"noun": {"pos": "NOUN"}, "verb": {"pos": "VERB"}, "adj": {"pos": "ADJ"}, "adv": {"pos": "ADV"}}

TRAIN_DATA = [
    ('Afrotropical', {'tags': ['adj']}), ('Afrocentricity', {'tags': ['noun']}),
    ('Afrocentric', {'tags': ['adj']}), ('Afrocentrism', {'tags': ['noun']}),
    ('Anglomania', {'tags': ['noun']}), ('Anglocentric', {'tags': ['adj']}),
    ('apraxic', {'tags': ['adj']}), ('aglycosuric', {'tags': ['adj']}),
    ('asecretory', {'tags': ['adj']}), ('aleukaemic', {'tags': ['adj']}),
    ('agrin', {'tags': ['adj']}), ('Eurotransplant', {'tags': ['noun']}),
    ('Euromarket', {'tags': ['noun']}), ('Eurocentrism', {'tags': ['noun']}),
    ('adendritic', {'tags': ['adj']}), ('asynaptic', {'tags': ['adj']}),
    ('Asynapsis', {'tags': ['noun']}), ('ametabolic', {'tags': ['adj']})
]
@plac.annotations(
    lang=("ISO Code of language to use", "option", "l", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
)
def main(lang="en", output_dir=None, n_iter=25):
    nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
    tagger = nlp.get_pipe('tagger')
    for tag, values in TAG_MAP.items():
        tagger.add_label(tag, values)
    nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
    optimizer = nlp.begin_training()
    for i in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, losses=losses)
        print("Losses", losses)

    # test the trained model
    test_text = "I like Afrotropical apraxic blue eggs and Afrocentricity. A Eurotransplant is cool too. The agnathostomatous Euromarket and asypnapsis is even cooler. What about Eurocentrism?"
    doc = nlp(test_text)
    print("Tags", [(t.text, t.tag_, t.pos_) for t in doc])

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the save model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        doc = nlp2(test_text)
        print("Tags", [(t.text, t.tag_, t.pos_) for t in doc])


if __name__ == "__main__":
    plac.call(main)

The English model is trained on PTB tags, not UD tags. spacy's tag map gives you a pretty good idea about the correspondences, but the PTB tagset is more fine-grained that the UD tagset:

https://github.com/explosion/spaCy/blob/master/spacy/lang/en/tag_map.py

Skip the tag_map-related code (the PTB -> UD mapping is already there in the model), change your tags in your data to PTB tags (NN, NNS, JJ, etc.), and then this script should run. (You'll still have to check whether it performs well, of course.)

In general, it's better to provide training examples with full phrases or sentences, since that's what spacy will be tagging in real usage like your test sentence.

If you intend to create your own TAG_MAP, you should also disable the tagger from the model. That way, its training on the original tags won't get in the way of new learning.

This means you will have to create your own, just like with the blank example, then add it to the pipeline. I'm doing the same with pt model, here's the relevant code:

nlp = spacy.load('pt_core_news_sm', disable=['parser', 'ner', 'tagger'])

tagger = nlp.create_pipe("tagger")
for tag, values in TAG_MAP_alternate.items():
    tagger.add_label(tag, values)
nlp.add_pipe(tagger)

来源：https://stackoverflow.com/questions/56779217/train-spacys-existing-pos-tagger-with-my-own-training-examples

标签

machine-learning

nlp

spacy

pos-tagger