Programmatically install NLTK corpora / models, i.e. without the GUI downloader?

后端 未结 4 649
盖世英雄少女心
盖世英雄少女心 2020-12-12 15:56

My project uses the NLTK. How can I list the project\'s corpus & model requirements so they can be automatically installed? I don\'t want to click through the nltk

相关标签:
4条回答
  • 2020-12-12 16:31

    To install all NLTK corpora & models:

    python -m nltk.downloader all
    

    Alternatively, on Linux, you can use:

    sudo python -m nltk.downloader -d /usr/local/share/nltk_data all
    

    Replace all by popular if you just want to list the most popular corpora & models.


    You may also browse the corpora & models through the command line:

    mlee@server:/scratch/jjylee/tests$ sudo python -m nltk.downloader
    [sudo] password for jjylee:
    NLTK Downloader
    ---------------------------------------------------------------------------
        d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
    ---------------------------------------------------------------------------
    Downloader> d
    
    Download which package (l=list; x=cancel)?
      Identifier> l
    Packages:
      [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
      [ ] basque_grammars..... Grammars for Basque
      [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
      [ ] book_grammars....... Grammars from NLTK Book
      [ ] cess_esp............ CESS-ESP Treebank
      [ ] chat80.............. Chat-80 Data Files
      [ ] city_database....... City Database
      [ ] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
      [ ] comparative_sentences Comparative Sentence Dataset
      [ ] comtrans............ ComTrans Corpus Sample
      [ ] conll2000........... CONLL 2000 Chunking Corpus
      [ ] conll2002........... CONLL 2002 Named Entity Recognition Corpus
      [ ] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan
                               and Basque Subset)
      [ ] crubadan............ Crubadan Corpus
      [ ] dependency_treebank. Dependency Parsed Treebank
      [ ] europarl_raw........ Sample European Parliament Proceedings Parallel
                               Corpus
      [ ] floresta............ Portuguese Treebank
      [ ] framenet_v15........ FrameNet 1.5
    Hit Enter to continue: 
      [ ] framenet_v17........ FrameNet 1.7
      [ ] gazetteers.......... Gazeteer Lists
      [ ] genesis............. Genesis Corpus
      [ ] gutenberg........... Project Gutenberg Selections
      [ ] hmm_treebank_pos_tagger Treebank Part of Speech Tagger (HMM)
      [ ] ieer................ NIST IE-ER DATA SAMPLE
      [ ] inaugural........... C-Span Inaugural Address Corpus
      [ ] indian.............. Indian Language POS-Tagged Corpus
      [ ] jeita............... JEITA Public Morphologically Tagged Corpus (in
                               ChaSen format)
      [ ] kimmo............... PC-KIMMO Data Files
      [ ] knbc................ KNB Corpus (Annotated blog corpus)
      [ ] large_grammars...... Large context-free and feature-based grammars
                               for parser comparison
      [ ] lin_thesaurus....... Lin's Dependency Thesaurus
      [ ] mac_morpho.......... MAC-MORPHO: Brazilian Portuguese news text with
                               part-of-speech tags
      [ ] machado............. Machado de Assis -- Obra Completa
      [ ] masc_tagged......... MASC Tagged Corpus
      [ ] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
      [ ] moses_sample........ Moses Sample Models
    Hit Enter to continue: x
    
    
    Download which package (l=list; x=cancel)?
      Identifier> conll2002
        Downloading package conll2002 to
            /afs/mit.edu/u/m/mlee/nltk_data...
          Unzipping corpora/conll2002.zip.
    
    ---------------------------------------------------------------------------
        d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
    ---------------------------------------------------------------------------
    Downloader>
    
    0 讨论(0)
  • 2020-12-12 16:32

    In addition to the command line option already mentioned, you can programmatically install NLTK data in your Python script by adding an argument to the download() function.

    See the help(nltk.download) text, specifically:

    Individual packages can be downloaded by calling the ``download()``
    function with a single argument, giving the package identifier for the
    package that should be downloaded:
    
        >>> download('treebank') # doctest: +SKIP
        [nltk_data] Downloading package 'treebank'...
        [nltk_data]   Unzipping corpora/treebank.zip.
    

    I can confirm that this works for downloading one package at a time, or when passed a list or tuple.

    >>> import nltk
    >>> nltk.download('wordnet')
    [nltk_data] Downloading package 'wordnet' to
    [nltk_data]     C:\Users\_my-username_\AppData\Roaming\nltk_data...
    [nltk_data]   Unzipping corpora\wordnet.zip.
    True
    

    You may also try to download an already downloaded package without problems:

    >>> nltk.download('wordnet')
    [nltk_data] Downloading package 'wordnet' to
    [nltk_data]     C:\Users\_my-username_\AppData\Roaming\nltk_data...
    [nltk_data]   Package wordnet is already up-to-date!
    True
    

    Also, it appears the function returns a boolean value that you can use to see whether or not the download succeeded:

    >>> nltk.download('not-a-real-name')
    [nltk_data] Error loading not-a-real-name: Package 'not-a-real-name'
    [nltk_data]     not found in index
    False
    
    0 讨论(0)
  • 2020-12-12 16:37

    The NLTK site does list a command line interface for downloading packages and collections at the bottom of this page :

    http://www.nltk.org/data

    The command line usage varies by which version of Python you are using, but on my Python2.6 install I noticed I was missing the 'spanish_grammar' model and this worked fine:

    python -m nltk.downloader spanish_grammars
    

    You mention listing the project's corpus and model requirements and while I'm not sure of a way to automagically do that, I figured I would at least share this.

    0 讨论(0)
  • 2020-12-12 16:53

    I've managed to install the corpora and models inside a custom directory using the following code:

    import nltk
    nltk.download(info_or_id="popular", download_dir="/path/to/dir")
    nltk.data.path.append("/path/to/dir")
    

    this will install "all" corpora/models inside /path/to/dir, and will let know NLTK where to look for it (data.path.append).

    You can't «freeze» the data in a requirements file, but you could add this code to your __init__ besides come code to check if the files are already there.

    0 讨论(0)
提交回复
热议问题