How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?

问题

I'm planning to write certain jobs in AWS Glue ETL using Pyspark, which I want to get triggered as and when a new file is dropped in an AWS S3 Location, just like we do for triggering AWS Lambda Functions using S3 Events.

But, I see very narrowed down options only, to trigger a Glue ETL script. Any help on this shall be highly appreciated.

回答1:

The following should work to trigger a Glue job from AWS Lambda. Have the lambda configured to the appropriate S3 bucket, and IAM roles / permissions assigned to AWS Lambda so that lambda can start the AWS Glue job on behalf of the user.

import boto3
print('Loading function')

def lambda_handler(event, context):
    source_bucket = event['Records'][0]['s3']['bucket']['name']
    s3 = boto3.client('s3')
    glue = boto3.client('glue')
    gluejobname = "YOUR GLUE JOB NAME"

    try:
        runId = glue.start_job_run(JobName=gluejobname)
        status = glue.get_job_run(JobName=gluejobname, RunId=runId['JobRunId'])
        print("Job Status : ", status['JobRun']['JobRunState'])
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist '
              'and your bucket is in the same region as this '
              'function.'.format(source_bucket, source_bucket))
    raise e

来源：https://stackoverflow.com/questions/57652485/how-to-trigger-glue-etl-pyspark-job-through-s3-events-or-aws-lambda

标签

amazon-web-services

amazon-s3

aws-lambda

aws-glue

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!