Any way to import Python's nltk.download('punkt') into Google Cloud Functions?

筅森魡賤 提交于 2020-12-15 05:02:01

问题


Any way to import Python's nltk.download('punkt') into Google Cloud Functions? I've found that adding the statement manually into my code block in main.py significantly slows down my function processing, since punkt has to be downloaded every time it is run. Is there any method to eliminate this by calling punkt in some other way?

EDIT#1:- I edited my code and program structure to match what Barak suggested, but I keep getting the same error:

Error: function terminated. Recommended action: inspect logs for termination reason. Details:

**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/tmp/nltk_data'
    - '/env/nltk_data'
    - '/env/share/nltk_data'
    - '/env/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

回答1:


Take a look at the instructions for uploading files with your Cloud function. Specifically since you can upload files, you can then modify nltk to just use these files:

Following the official NLTK documentation, you can "Set your NLTK_DATA environment variable to point to your top level nltk_data folder."

Combining these together, you'd get:

  1. Download the data (on your computer) with python -m nltk.downloader punkt
  2. Upload the NLTK directory (find it's path on your computer in the above documentation) as an nltk_data directory, created at the root of your function environment
  3. Configure the code to find that folder:

    import os
    root = os.path.dirname(path.abspath(__file__))
    nltk_dir = os.path.join(root, 'nltk_data')  # Your folder name here
    os.environ['NLTK_DATA'] = nltk_dir
    

EDIT: Seems as if path export with the environment variable doesn't achieve the desired effect, so let's have the path explicit in the code

  1. On your computer download the data

    import os
    download_dir = os.path.abspath('my_nltk_dir')
    os.makedirs(download_dir)
    import nltk
    nltk.download('punkt', download_dir=download_dir)
    
  2. Add the directory my_nltk_dir to be in the same folder of your python script. This would be

    PROJECT_ROOT/
    |-- my_code.py
    |-- my_nltk_dir/
        |-- ...
    
  3. In your code refer to the data using

    import ntlk.data
    root = os.path.dirname(path.abspath(__file__))
    download_dir = os.path.join(root, 'my_nltk_dir')
    nltk.data.load(
        os.path.join(download_dir, 'tokenizers/punkt/english.pickle')
    )
    



回答2:


Add nltk to your requirements.txt;

Install nltk on your local machine, if you haven't already:

pip install nltk

Then download the nltk_data files. In my case for tokenizers, I needed the Punkt tokenizer module:

python -m nltk.downloader punkt  

Copy them (they're inside Roaming/ for Windows) to your root folder (i.e. together with your functions):

cp -r C:\Users\<USER>\AppData\Roaming\nltk_data\* YOUR\ROOT\FOLDER\nltk_data\       

At the beginning of your main python function, or just before using nltk, add the following code--Basically, it grabs the path where nltk_data is, and tells nltk to look inside this folder:

  root = os.path.dirname(os.path.abspath(__file__))
  download_dir = os.path.join(root, 'nltk_data')
  os.chdir(download_dir)
  nltk.data.path.append(download_dir)

Finally, after committing/pushing (if you're using Cloud Source Repos), (re)deploy your function!



来源:https://stackoverflow.com/questions/62209018/any-way-to-import-pythons-nltk-downloadpunkt-into-google-cloud-functions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!