How to avoid NLTK's sentence tokenizer splitting on abbreviations?

前端 未结 1 520
北荒
北荒 2020-12-16 00:10

I\'m currently using NLTK for language processing, but I have encountered a problem of sentence tokenizing.

Here\'s the problem: Assume I have a sentence: \"Fig. 2 s

相关标签:
1条回答
  • 2020-12-16 01:00

    I think lower case for u.s.a in abbreviations list will work fine for you Try this,

    from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
    punkt_param = PunktParameters()
    abbreviation = ['u.s.a', 'fig']
    punkt_param.abbrev_types = set(abbreviation)
    tokenizer = PunktSentenceTokenizer(punkt_param)
    tokenizer.tokenize('Fig. 2 shows a U.S.A. map.')
    

    It returns this to me:

    ['Fig. 2 shows a U.S.A. map.']
    
    0 讨论(0)
提交回复
热议问题