How to avoid NLTK's sentence tokenizer splitting on abbreviations?

前端未结

关注

 1  520

北荒

I\'m currently using NLTK for language processing, but I have encountered a problem of sentence tokenizing.

Here\'s the problem: Assume I have a sentence: \"Fig. 2 s

相关标签:

1条回答

天涯浪人

2020-12-16 01:00

I think lower case for u.s.a in abbreviations list will work fine for you Try this,

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['u.s.a', 'fig']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize('Fig. 2 shows a U.S.A. map.')

It returns this to me:

['Fig. 2 shows a U.S.A. map.']

0 讨论(0)