How do you automate pyspark jobs on emr using boto3 (or otherwise)?

前端 未结 4 1355
悲&欢浪女
悲&欢浪女 2021-02-01 08:44

I am creating a job to parse massive amounts of server data, and then upload it into a Redshift database.

My job flow is as follows:

  • Grab the
4条回答
  •  感情败类
    2021-02-01 08:58

    Take a look at boto3 EMR docs to create the cluster. You essentially have to call run_job_flow and create steps that runs the program you want.

    import boto3    
    
    client = boto3.client('emr', region_name='us-east-1')
    
    S3_BUCKET = 'MyS3Bucket'
    S3_KEY = 'spark/main.py'
    S3_URI = 's3://{bucket}/{key}'.format(bucket=S3_BUCKET, key=S3_KEY)
    
    # upload file to an S3 bucket
    s3 = boto3.resource('s3')
    s3.meta.client.upload_file("myfile.py", S3_BUCKET, S3_KEY)
    
    response = client.run_job_flow(
        Name="My Spark Cluster",
        ReleaseLabel='emr-4.6.0',
        Instances={
            'MasterInstanceType': 'm4.xlarge',
            'SlaveInstanceType': 'm4.xlarge',
            'InstanceCount': 4,
            'KeepJobFlowAliveWhenNoSteps': True,
            'TerminationProtected': False,
        },
        Applications=[
            {
                'Name': 'Spark'
            }
        ],
        BootstrapActions=[
            {
                'Name': 'Maximize Spark Default Config',
                'ScriptBootstrapAction': {
                    'Path': 's3://support.elasticmapreduce/spark/maximize-spark-default-config',
                }
            },
        ],
        Steps=[
        {
            'Name': 'Setup Debugging',
            'ActionOnFailure': 'TERMINATE_CLUSTER',
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': ['state-pusher-script']
            }
        },
        {
            'Name': 'setup - copy files',
            'ActionOnFailure': 'CANCEL_AND_WAIT',
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': ['aws', 's3', 'cp', S3_URI, '/home/hadoop/']
            }
        },
        {
            'Name': 'Run Spark',
            'ActionOnFailure': 'CANCEL_AND_WAIT',
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': ['spark-submit', '/home/hadoop/main.py']
            }
        }
        ],
        VisibleToAllUsers=True,
        JobFlowRole='EMR_EC2_DefaultRole',
        ServiceRole='EMR_DefaultRole'
    )
    

    You can also add steps to a running cluster if you know the job flow id:

    job_flow_id = response['JobFlowId']
    print("Job flow ID:", job_flow_id)
    
    step_response = client.add_job_flow_steps(JobFlowId=job_flow_id, Steps=SomeMoreSteps)
    
    step_ids = step_response['StepIds']
    
    print("Step IDs:", step_ids)
    

    For more configurations, check out sparksteps.

提交回复
热议问题