Can I test AWS Glue code locally?

隐身守侯 提交于 2019-12-03 08:50:19

问题


After reading Amazon docs, my understanding is that the only way to run/test a Glue script is to deploy it to a dev endpoint and debug remotely if necessary. At the same time, if the (Python) code consists of multiple files and packages, all except the main script need to be zipped. All this gives me the feeling that Glue is not suitable for any complex ETL task as development and testing is cumbersome. I could test my Spark code locally without having to upload the code to S3 every time, and verify the tests on a CI server without having to pay for a development Glue endpoint.


回答1:


I spoke to an AWS sales engineer and they said no, you can only test Glue code by running a Glue transform (in the cloud). He mentioned that there were testing out something called Outpost to allow on-prem operations, but that it wasn't publically available yet. So this seems like a solid "no" which is a shame because it otherwise seems pretty nice. But with out unit tests, its no-go for me.




回答2:


Eventually, as of Aug 28, 2019, Amazon allows you to download the binaries and

develop, compile, debug, and single-step Glue ETL scripts and complex Spark applications in Scala and Python locally.

Check out this link: https://aws.amazon.com/about-aws/whats-new/2019/08/aws-glue-releases-binaries-of-glue-etl-libraries-for-glue-jobs/




回答3:


You can keep glue and pyspark code in separate files and can unit-test pyspark code locally. For zipping dependency files, we wrote shell script which zips files and upload to s3 location and then applies CF template to deploy glue job. For detecting dependencies, we created (glue job)_dependency.txt file.




回答4:


Not that I know of, and if you have a lot of remote assets, it will be tricky. Using Windows, I normally run a development endpoint and a local zeppelin notebook while I am authoring my job. I shut it down each day.

You could use the job editor > script editor to edit, save, and run the job. Not sure of the cost difference.




回答5:


Adding to CedricB,

For development / testing purpose, its not necessary to upload the code to S3, and you can setup a zeppelin notebook locally, have an SSH connection established so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.

After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well. Once all the development/testing is completed, make sure to delete the dev endpoint, as we are charged even for the IDLE time.

Regards




回答6:


You can do this as follows:

  1. Install PySpark using

    >> pip install pyspark==2.4.3
    
  2. Prebuild AWS Glue-1.0 Jar with Python dependencies: Download_Prebuild_Glue_Jar

  3. Copy the awsglue folder and Jar file into your pycharm project from github

  4. Copy the Python code from my git repository

  5. Run the following on your console; make sure to enter your own path:

    >> python com/mypackage/pack/glue-spark-pycharm-example.py
    

From my own blog



来源:https://stackoverflow.com/questions/48314268/can-i-test-aws-glue-code-locally

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!