Using NLTK corpora with AWS Lambda functions in Python

后端未结

关注

 4  1803

I\'m encountering a difficulty when using NLTK corpora (in particular stop words) in AWS Lambda. I\'m aware that the corpora need to be downloaded and have done so with NLTK

相关标签:

4条回答

攒了一身酷

2021-01-02 02:44
I had the same problem before but I solved it using the environment variable.
1. Execute "nltk.download()" and copy it to the root folder of your AWS lambda application. (The folder should be called "nltk_data".)
2. In the user interface of your lambda function (in the AWS console), you add "NLTK_DATA" = "./nltk_data". Please see the image.
0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2021-01-02 02:48
If your stopwords corpus is under /nltk_data (based on root, not under your home directory), you need to tell the nltk before you try to access a corpus:
```
from nltk.corpus import stopwords
nltk.data.path.append("/nltk_data")

stopwords = stopwords.words('english')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦毁少年i

2021-01-02 02:51
Another solution is to use Lambda's ephemeral storage at the location /tmp

So, you would have something like this:
```
import nltk
import json
from nltk.tokenize import word_tokenize

nltk.data.path.append("/tmp")

nltk.download("punkt", download_dir = "/tmp")
```
At runtime punkt will download to the /tmp directory, which is writable. However, this likely isn't a great solution if you have huge concurrency.
0 讨论(0)
发布评论:

提交评论
- 加载中...
长发绾君心

2021-01-02 02:56
on AWS Lambda you need to include nltk python package with lambda and modify data.py:
```
path += [
    str('/usr/share/nltk_data'),
    str('/usr/local/share/nltk_data'),
    str('/usr/lib/nltk_data'),
    str('/usr/local/lib/nltk_data')
]
```
to
```
path += [
    str('/var/task/nltk_data')
    #str('/usr/share/nltk_data'),
    #str('/usr/local/share/nltk_data'),
    #str('/usr/lib/nltk_data'),
    #str('/usr/local/lib/nltk_data')
]
```
You cant include the entire nltk_data directory, delete all the zip files, and if you only need stopwords, save nltk_data -> corpora -> stopwords and dump the rest. If you need tokenizers save nltk_data -> tokenizers -> punkt. To download the nltk_data folder use anaconda Jupyter notebook and run

nltk.download()

or

https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip

or
```
python -m nltk.downloader all
```
0 讨论(0)
发布评论:

提交评论
- 加载中...