I\'m currently using NLTK for language processing, but I have encountered a problem of sentence tokenizing.
Here\'s the problem: Assume I have a sentence: \"Fig. 2 s
I think lower case for u.s.a in abbreviations list will work fine for you Try this,
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['u.s.a', 'fig']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize('Fig. 2 shows a U.S.A. map.')
It returns this to me:
['Fig. 2 shows a U.S.A. map.']