NLTK tag Dutch sentence [duplicate]

前端未结

关注

 2  1567

情话喂你

相关标签:

2条回答

情话喂你

2020-12-15 15:11

The default nltk.pos_tag was trained for English text, you would have to train a new tagger on the alpino corpus to roll your own Dutch tagger.

But note that the model will be as good as:

what data it is trained on
which algorithm it is trained with

From UnigramTagger and BigramTagger example:

>>> from nltk.corpus import alpino as alp
>>> from nltk.tag import UnigramTagger, BigramTagger
>>> training_corpus = alp.tagged_sents()
>>> unitagger = UnigramTagger(training_corpus)
>>> bitagger = BigramTagger(training_corpus, backoff=unitagger)
>>> pos_tag = bitagger.tag
>>> sent = 'NLTK is een goeda taal voor NLP'.split()
>>> pos_tag(sent)
[('NLTK', None), ('is', u'verb'), ('een', u'det'), ('goeda', None), ('taal', u'noun'), ('voor', u'prep'), ('NLP', None)]

With PerceptronTagger:

>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import alpino as alp
>>> training_corpus = list(alp.tagged_sents()) 
>>> tagger = PerceptronTagger(load=True)
>>> tagger.train(training_corpus)
>>> sent = 'NLTK is een goeda taal voor het leren over NLP'.split()
>>> tagger.tag(sent)
[('NLTK', u'noun'), ('is', u'verb'), ('een', u'det'), ('goeda', u'adj'), ('taal', u'noun'), ('voor', u'prep'), ('het', u'det'), ('leren', u'noun'), ('over', u'prep'), ('NLP', u'noun')

As @WasiAhmed noted, this is another good example: https://github.com/evanmiltenburg/Dutch-tagger and as @evanmiltenburg stated on the github, try to use a faster taggger in production.

EDITED

To evaluate a tagger, you can hold out a test_set as such:

>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import alpino as alp
>>> alp_tagged_sents = list(alp.tagged_sents())
>>> len(alp_tagged_sents)
7136
>>> last_train_sent = int(len(alp_tagged_sents) / 10 * 9)
>>> train_set = alp_tagged_sents[:last_train_sent]
>>> test_set = alp_tagged_sents[last_train_sent:]

Then use the tagger.evaluate() function to get the accuracy, the input for the .evaluate() function is the same as the input for the .train() function, i.e. a list of sentence, and each sentence is a list of ('word', 'tag') tuples:

>>> tagger = PerceptronTagger(load=False)
>>> tagger.train(train_set)
>>> tagger.evaluate(test_set)
0.927672285043738

0 讨论(0)

不知归路

2020-12-15 15:15

You can use this tagger (https://github.com/evanmiltenburg/Dutch-tagger) to tag dutch sentences. The accuracy is 97%.

Example (Using PerceptronTagger)

from nltk.tag.perceptron import PerceptronTagger

# This may take a few minutes. (But once loaded, the tagger is really fast!)
tagger = PerceptronTagger(load=False)
tagger.load('model.perc.dutch_tagger_small.pickle')

# Tag a sentence.
tagger.tag('Alle vogels zijn nesten begonnen , behalve ik en jij .'.split())

Output

[('Alle', 'det__indef'), ('vogels', 'nounpl'), ('zijn', 'verbprespl'), ('nesten', 'nounpl'), ('begonnen', 'verbpapa'), (',', 'punc'), ('behalve', 'conjsubo'), ('ik', 'pronpers'), ('en', 'conjcoord'), ('jij', 'pronpers'), ('.', '$.')]

0 讨论(0)

热议问题