Use of PunktSentenceTokenizer in NLTK

后端 未结 4 1863
不思量自难忘°
不思量自难忘° 2020-12-07 16:55

I am learning Natural Language Processing using NLTK. I came across the code using PunktSentenceTokenizer whose actual use I cannot understand in the given code

4条回答
  •  自闭症患者
    2020-12-07 17:38

    PunktSentenceTokenizer is the abstract class for the default sentence tokenizer, i.e. sent_tokenize(), provided in NLTK. It is an implmentation of Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk (2005). See https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L79

    Given a paragraph with multiple sentence, e.g:

    >>> from nltk.corpus import state_union
    >>> train_text = state_union.raw("2005-GWBush.txt").split('\n')
    >>> train_text[11]
    u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. '
    

    You can use the sent_tokenize():

    >>> sent_tokenize(train_text[11])
    [u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.', u'This evening I will set forth policies to advance that ideal at home and around the world. ']
    >>> for sent in sent_tokenize(train_text[11]):
    ...     print sent
    ...     print '--------'
    ... 
    Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.
    --------
    This evening I will set forth policies to advance that ideal at home and around the world. 
    --------
    

    The sent_tokenize() uses a pre-trained model from nltk_data/tokenizers/punkt/english.pickle. You can also specify other languages, the list of available languages with pre-trained models in NLTK are:

    alvas@ubi:~/nltk_data/tokenizers/punkt$ ls
    czech.pickle     finnish.pickle  norwegian.pickle   slovene.pickle
    danish.pickle    french.pickle   polish.pickle      spanish.pickle
    dutch.pickle     german.pickle   portuguese.pickle  swedish.pickle
    english.pickle   greek.pickle    PY3                turkish.pickle
    estonian.pickle  italian.pickle  README
    

    Given a text in another language, do this:

    >>> german_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. "
    
    >>> for sent in sent_tokenize(german_text, language='german'):
    ...     print sent
    ...     print '---------'
    ... 
    Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter.
    ---------
    Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. 
    ---------
    

    To train your own punkt model, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py and training data format for nltk punkt

提交回复
热议问题