Add streaming step to MR job in boto3 running on AWS EMR 5.0

后端 未结 1 1219
逝去的感伤
逝去的感伤 2021-01-17 23:30

I\'m trying to migrate a couple of MR jobs that I have written in python from AWS EMR 2.4 to AWS EMR 5.0. Till now I was using boto 2.4, but it doesn\'t support EMR 5.0, so

相关标签:
1条回答
  • 2021-01-18 00:06

    It's unfortunate that boto3 and EMR API are rather poorly documented. Minimally, the word counting example would look as follows:

    import boto3
    
    emr = boto3.client('emr')
    
    resp = emr.run_job_flow(
        Name='myjob',
        ReleaseLabel='emr-5.0.0',
        Instances={
            'InstanceGroups': [
                {'Name': 'master',
                 'InstanceRole': 'MASTER',
                 'InstanceType': 'c1.medium',
                 'InstanceCount': 1,
                 'Configurations': [
                     {'Classification': 'yarn-site',
                      'Properties': {'yarn.nodemanager.vmem-check-enabled': 'false'}}]},
                {'Name': 'core',
                 'InstanceRole': 'CORE',
                 'InstanceType': 'c1.medium',
                 'InstanceCount': 1,
                 'Configurations': [
                     {'Classification': 'yarn-site',
                      'Properties': {'yarn.nodemanager.vmem-check-enabled': 'false'}}]},
            ]},
        Steps=[
            {'Name': 'My word count example',
             'HadoopJarStep': {
                 'Jar': 'command-runner.jar',
                 'Args': [
                     'hadoop-streaming',
                     '-files', 's3://mybucket/wordSplitter.py#wordSplitter.py',
                     '-mapper', 'python2.7 wordSplitter.py',
                     '-input', 's3://mybucket/input/',
                     '-output', 's3://mybucket/output/',
                     '-reducer', 'aggregate']}
             }
        ],
        JobFlowRole='EMR_EC2_DefaultRole',
        ServiceRole='EMR_DefaultRole',
    )
    

    I don't remember needing to do this with boto, but I have had issues running the simple streaming job properly without disabling vmem-check-enabled.

    Also, if your script is located somewhere on S3, download it using -files (appending #filename to the argument make the downloaded file available as filename in the cluster).

    0 讨论(0)
提交回复
热议问题