Use AWS Glue Python with NumPy and Pandas Python Packages

狂风中的少年 提交于 2019-12-17 16:28:08

问题


What is the easiest way to use packages such as NumPy and Pandas within the new ETL tool on AWS called Glue? I have a completed script within Python I would like to run in AWS Glue that utilizes NumPy and Pandas.


回答1:


I think the current answer is you cannot. According to AWS Glue Documentation:

Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.

But even when I try to include a normal python written library in S3, the Glue job failed because of some HDFS permission problem. If you find a way to solve this, please let me know as well.




回答2:


If you don't have pure python libraries and still want to use then you can use below script to use it in your Glue code:

import os
import site
from setuptools.command import easy_install
install_path = os.environ['GLUE_INSTALLATION']
easy_install.main( ["--install-dir", install_path, "<library-name>"] )
reload(site)


import <installed library>



回答3:


when you click run job you have a button Job parameters (optional) that is collapsed by default , when we click on it we have the following options which we can use to save the libraries in s3 and this works for me :

Python library path

s3://bucket-name/folder-name/file-name

Dependent jars path

s3://bucket-name/folder-name/file-name

Referenced files path s3://bucket-name/folder-name/file-name




回答4:


If you go to edit a job (or when you create a new one) there is an optional section that is collapsed called "Script libraries and job parameters (optional)". In there, you can specify an S3 bucket for Python libraries (as well as other things). I haven't tried it out myself for that part yet, but I think that's what you are looking for.




回答5:


There is an update:

...You can now use Python shell jobs... ...Python shell jobs in AWS Glue support scripts that are compatible with Python 2.7 and come pre-loaded with libraries such as the Boto3, NumPy, SciPy, pandas, and others.

https://aws.amazon.com/about-aws/whats-new/2019/01/introducing-python-shell-jobs-in-aws-glue/




回答6:


If you want to integrate python modules into your AWS GLUE ETL job, you can do. You can use whatever Python Module you want. Because Glue is nothing but serverless with Python run environment. SO all you need is to package the modules that your scrpt requires using pip install -t /path/to/your/dircetory. And then upload to your s3 bucket. And while creating AWS Glue job, after pointing s3 scripts, temp location, if you go to advanced job parrameters option, you will see python_libraries option there. enter image description here You can just point that to python module packages that you uploaded to s3.




回答7:


As of now, You can use Python extension modules and libraries with your AWS Glue ETL scripts as long as they are written in pure Python. C libraries such as pandas are not supported at the present time, nor are extensions written in other languages.




回答8:


In order to install a specific version (for instance, for AWS Glue python job), navigate to the website with python packages, for example to the page of package "pg8000" https://pypi.org/project/pg8000/1.12.5/#files

Then select an appropriate version, copy the link to the file, and paste it into the snippet below:

import os
import site
from setuptools.command import easy_install
install_path = os.environ['GLUE_INSTALLATION']

easy_install.main( ["--install-dir", install_path, "https://files.pythonhosted.org/packages/83/03/10902758730d5cc705c0d1dd47072b6216edc652bc2e63a078b58c0b32e6/pg8000-1.12.5.tar.gz"] )
reload(site)


来源:https://stackoverflow.com/questions/46329561/use-aws-glue-python-with-numpy-and-pandas-python-packages

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!