I\'m encountering a difficulty when using NLTK corpora (in particular stop words) in AWS Lambda. I\'m aware that the corpora need to be downloaded and have done so with NLTK
I had the same problem before but I solved it using the environment variable.
If your stopwords corpus is under /nltk_data
(based on root, not under your home directory), you need to tell the nltk before you try to access a corpus:
from nltk.corpus import stopwords
nltk.data.path.append("/nltk_data")
stopwords = stopwords.words('english')
Another solution is to use Lambda's ephemeral storage at the location /tmp
So, you would have something like this:
import nltk
import json
from nltk.tokenize import word_tokenize
nltk.data.path.append("/tmp")
nltk.download("punkt", download_dir = "/tmp")
At runtime punkt will download to the /tmp directory, which is writable. However, this likely isn't a great solution if you have huge concurrency.
on AWS Lambda you need to include nltk python package with lambda and modify data.py:
path += [
str('/usr/share/nltk_data'),
str('/usr/local/share/nltk_data'),
str('/usr/lib/nltk_data'),
str('/usr/local/lib/nltk_data')
]
to
path += [
str('/var/task/nltk_data')
#str('/usr/share/nltk_data'),
#str('/usr/local/share/nltk_data'),
#str('/usr/lib/nltk_data'),
#str('/usr/local/lib/nltk_data')
]
You cant include the entire nltk_data directory, delete all the zip files, and if you only need stopwords, save nltk_data -> corpora -> stopwords and dump the rest. If you need tokenizers save nltk_data -> tokenizers -> punkt. To download the nltk_data folder use anaconda Jupyter notebook and run
nltk.download()
or
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip
or
python -m nltk.downloader all