Any way to import Python's nltk.download('punkt') into Google Cloud Functions?

问题

Any way to import Python's nltk.download('punkt') into Google Cloud Functions? I've found that adding the statement manually into my code block in main.py significantly slows down my function processing, since punkt has to be downloaded every time it is run. Is there any method to eliminate this by calling punkt in some other way?

EDIT#1:- I edited my code and program structure to match what Barak suggested, but I keep getting the same error:

Error: function terminated. Recommended action: inspect logs for termination reason. Details:

**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/tmp/nltk_data'
    - '/env/nltk_data'
    - '/env/share/nltk_data'
    - '/env/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

回答1:

Take a look at the instructions for uploading files with your Cloud function. Specifically since you can upload files, you can then modify nltk to just use these files:

Following the official NLTK documentation, you can "Set your NLTK_DATA environment variable to point to your top level nltk_data folder."

Combining these together, you'd get:

Download the data (on your computer) with python -m nltk.downloader punkt
Upload the NLTK directory (find it's path on your computer in the above documentation) as an nltk_data directory, created at the root of your function environment

Configure the code to find that folder:

import os
root = os.path.dirname(path.abspath(__file__))
nltk_dir = os.path.join(root, 'nltk_data')  # Your folder name here
os.environ['NLTK_DATA'] = nltk_dir

EDIT: Seems as if path export with the environment variable doesn't achieve the desired effect, so let's have the path explicit in the code

On your computer download the data

import os
download_dir = os.path.abspath('my_nltk_dir')
os.makedirs(download_dir)
import nltk
nltk.download('punkt', download_dir=download_dir)

Add the directory my_nltk_dir to be in the same folder of your python script. This would be
```
PROJECT_ROOT/
|-- my_code.py
|-- my_nltk_dir/
    |-- ...
```

In your code refer to the data using

import ntlk.data
root = os.path.dirname(path.abspath(__file__))
download_dir = os.path.join(root, 'my_nltk_dir')
nltk.data.load(
    os.path.join(download_dir, 'tokenizers/punkt/english.pickle')
)

回答2:

Add nltk to your requirements.txt;

Install nltk on your local machine, if you haven't already:

pip install nltk

Then download the nltk_data files. In my case for tokenizers, I needed the Punkt tokenizer module:

python -m nltk.downloader punkt

Copy them (they're inside Roaming/ for Windows) to your root folder (i.e. together with your functions):

cp -r C:\Users\<USER>\AppData\Roaming\nltk_data\* YOUR\ROOT\FOLDER\nltk_data\

At the beginning of your main python function, or just before using nltk, add the following code--Basically, it grabs the path where nltk_data is, and tells nltk to look inside this folder:

  root = os.path.dirname(os.path.abspath(__file__))
  download_dir = os.path.join(root, 'nltk_data')
  os.chdir(download_dir)
  nltk.data.path.append(download_dir)

Finally, after committing/pushing (if you're using Cloud Source Repos), (re)deploy your function!

来源：https://stackoverflow.com/questions/62209018/any-way-to-import-pythons-nltk-downloadpunkt-into-google-cloud-functions

标签

python

google-cloud-platform

google-cloud-functions

nltk