Using NLTK corpora with AWS Lambda functions in Python

后端 未结 4 1795
遥遥无期
遥遥无期 2021-01-02 02:20

I\'m encountering a difficulty when using NLTK corpora (in particular stop words) in AWS Lambda. I\'m aware that the corpora need to be downloaded and have done so with NLTK

相关标签:
4条回答
  • 2021-01-02 02:44

    I had the same problem before but I solved it using the environment variable.

    1. Execute "nltk.download()" and copy it to the root folder of your AWS lambda application. (The folder should be called "nltk_data".)
    2. In the user interface of your lambda function (in the AWS console), you add "NLTK_DATA" = "./nltk_data". Please see the image.
    0 讨论(0)
  • 2021-01-02 02:48

    If your stopwords corpus is under /nltk_data (based on root, not under your home directory), you need to tell the nltk before you try to access a corpus:

    from nltk.corpus import stopwords
    nltk.data.path.append("/nltk_data")
    
    stopwords = stopwords.words('english')
    
    0 讨论(0)
  • 2021-01-02 02:51

    Another solution is to use Lambda's ephemeral storage at the location /tmp

    So, you would have something like this:

    import nltk
    import json
    from nltk.tokenize import word_tokenize
    
    nltk.data.path.append("/tmp")
    
    nltk.download("punkt", download_dir = "/tmp")
    

    At runtime punkt will download to the /tmp directory, which is writable. However, this likely isn't a great solution if you have huge concurrency.

    0 讨论(0)
  • 2021-01-02 02:56

    on AWS Lambda you need to include nltk python package with lambda and modify data.py:

    path += [
        str('/usr/share/nltk_data'),
        str('/usr/local/share/nltk_data'),
        str('/usr/lib/nltk_data'),
        str('/usr/local/lib/nltk_data')
    ]
    

    to

    path += [
        str('/var/task/nltk_data')
        #str('/usr/share/nltk_data'),
        #str('/usr/local/share/nltk_data'),
        #str('/usr/lib/nltk_data'),
        #str('/usr/local/lib/nltk_data')
    ]
    

    You cant include the entire nltk_data directory, delete all the zip files, and if you only need stopwords, save nltk_data -> corpora -> stopwords and dump the rest. If you need tokenizers save nltk_data -> tokenizers -> punkt. To download the nltk_data folder use anaconda Jupyter notebook and run

    nltk.download()

    or

    https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip

    or

    python -m nltk.downloader all
    
    0 讨论(0)
提交回复
热议问题