How do I create gold data for TextCategorizer training?

老子叫甜甜 提交于 2019-11-30 04:50:49

问题


I want to train a TextCategorizer model with the following (text, label) pairs.

Label COLOR:

  • The door is brown.
  • The barn is red.
  • The flower is yellow.

Label ANIMAL:

  • The horse is running.
  • The fish is jumping.
  • The chicken is asleep.

I am copying the example code in the documentation for TextCategorizer.

textcat = TextCategorizer(nlp.vocab)
losses = {}
optimizer = nlp.begin_training()
textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)

The doc variables will presumably be just nlp("The door is brown.") and so on. What should be in gold1 and gold2? I'm guessing they should be GoldParse objects, but I don't see how you represent text categorization information in those.


回答1:


According to this example train_textcat.py it should be something like {'cats': {'ANIMAL': 0, 'COLOR': 1}} if you want to train a multi-label model. Also, if you have only two classes, you can simply use {'cats': {'ANIMAL': 1}} for label ANIMAL and {'cats': {'ANIMAL': 0}} for label COLOR.

You can use the following minimal working example for a one category text classification;

import spacy

nlp = spacy.load('en')

train_data = [
    (u"That was very bad", {"cats": {"POSITIVE": 0}}),
    (u"it is so bad", {"cats": {"POSITIVE": 0}}),
    (u"so terrible", {"cats": {"POSITIVE": 0}}),
    (u"I like it", {"cats": {"POSITIVE": 1}}),
    (u"It is very good.", {"cats": {"POSITIVE": 1}}),
    (u"That was great!", {"cats": {"POSITIVE": 1}}),
]


textcat = nlp.create_pipe('textcat')
nlp.add_pipe(textcat, last=True)
textcat.add_label('POSITIVE')
optimizer = nlp.begin_training()
for itn in range(100):
    for doc, gold in train_data:
        nlp.update([doc], [gold], sgd=optimizer)

doc = nlp(u'It is good.')
print(doc.cats)


来源:https://stackoverflow.com/questions/48834832/how-do-i-create-gold-data-for-textcategorizer-training

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!