I\'m encountering a difficulty when using NLTK corpora (in particular stop words) in AWS Lambda. I\'m aware that the corpora need to be downloaded and have done so with NLTK
on AWS Lambda you need to include nltk python package with lambda and modify data.py:
path += [
str('/usr/share/nltk_data'),
str('/usr/local/share/nltk_data'),
str('/usr/lib/nltk_data'),
str('/usr/local/lib/nltk_data')
]
to
path += [
str('/var/task/nltk_data')
#str('/usr/share/nltk_data'),
#str('/usr/local/share/nltk_data'),
#str('/usr/lib/nltk_data'),
#str('/usr/local/lib/nltk_data')
]
You cant include the entire nltk_data directory, delete all the zip files, and if you only need stopwords, save nltk_data -> corpora -> stopwords and dump the rest. If you need tokenizers save nltk_data -> tokenizers -> punkt. To download the nltk_data folder use anaconda Jupyter notebook and run
nltk.download()
or
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip
or
python -m nltk.downloader all