What to download in order to make nltk.tokenize.word_tokenize work?

后端 未结 2 1857
温柔的废话
温柔的废话 2020-12-30 02:23

I am going to use nltk.tokenize.word_tokenize on a cluster where my account is very limited by space quota. At home, I downloaded all nltk resource

相关标签:
2条回答
  • 2020-12-30 03:00

    In short:

    nltk.download('punkt')
    

    would suffice.


    In long:

    You don't necessary need to download all the models and corpora available in NLTk if you're just going to use NLTK for tokenization.

    Actually, if you're just using word_tokenize(), then you won't really need any of the resources from nltk.download(). If we look at the code, the default word_tokenize() that is basically the TreebankWordTokenizer shouldn't use any additional resources:

    alvas@ubi:~$ ls nltk_data/
    chunkers  corpora  grammars  help  models  stemmers  taggers  tokenizers
    alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data/
    alvas@ubi:~$ python
    Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
    [GCC 5.3.1 20160413] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from nltk import word_tokenize
    >>> from nltk.tokenize import TreebankWordTokenizer
    >>> tokenizer = TreebankWordTokenizer()
    >>> tokenizer.tokenize('This is a sentence.')
    ['This', 'is', 'a', 'sentence', '.']
    

    But:

    alvas@ubi:~$ ls nltk_data/
    chunkers  corpora  grammars  help  models  stemmers  taggers  tokenizers
    alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data
    alvas@ubi:~$ python
    Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
    [GCC 5.3.1 20160413] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from nltk import sent_tokenize
    >>> sent_tokenize('This is a sentence. This is another.')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
        tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
      File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
        opened_resource = _open(resource_url)
      File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
        return find(path_, path + ['']).open()
      File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
        raise LookupError(resource_not_found)
    LookupError: 
    **********************************************************************
      Resource u'tokenizers/punkt/english.pickle' not found.  Please
      use the NLTK Downloader to obtain the resource:  >>>
      nltk.download()
      Searched in:
        - '/home/alvas/nltk_data'
        - '/usr/share/nltk_data'
        - '/usr/local/share/nltk_data'
        - '/usr/lib/nltk_data'
        - '/usr/local/lib/nltk_data'
        - u''
    **********************************************************************
    
    >>> from nltk import word_tokenize
    >>> word_tokenize('This is a sentence.')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize
        return [token for sent in sent_tokenize(text, language)
      File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
        tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
      File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
        opened_resource = _open(resource_url)
      File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
        return find(path_, path + ['']).open()
      File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
        raise LookupError(resource_not_found)
    LookupError: 
    **********************************************************************
      Resource u'tokenizers/punkt/english.pickle' not found.  Please
      use the NLTK Downloader to obtain the resource:  >>>
      nltk.download()
      Searched in:
        - '/home/alvas/nltk_data'
        - '/usr/share/nltk_data'
        - '/usr/local/share/nltk_data'
        - '/usr/lib/nltk_data'
        - '/usr/local/lib/nltk_data'
        - u''
    **********************************************************************
    

    But it looks like that's not the case, if we look at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L93. It seems like word_tokenize has implicitly called sent_tokenize() which requires the punkt model.

    I am not sure whether this is a bug or a feature but it seems like the old idiom might be outdated given the current code:

    >>> from nltk import sent_tokenize, word_tokenize
    >>> sentences = 'This is a foo bar sentence. This is another sentence.'
    >>> tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(sentences)]
    >>> tokenized_sents
    [['This', 'is', 'a', 'foo', 'bar', 'sentence', '.'], ['This', 'is', 'another', 'sentence', '.']]
    

    It can simply be:

    >>> word_tokenize(sentences)
    ['This', 'is', 'a', 'foo', 'bar', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']
    

    But we see that the word_tokenize() flattens the list of list of string to a single list of string.


    Alternatively, you can try to use a new tokenizer that was added to NLTK toktok.py based on https://github.com/jonsafari/tok-tok that requires no pre-trained models.

    0 讨论(0)
  • 2020-12-30 03:20

    You are right. You need Punkt Tokenizer Models. It has 13 MB and nltk.download('punkt') should do the trick.

    0 讨论(0)
提交回复
热议问题