How to launch and configure an EMR cluster using boto

*爱你&永不变心* 提交于 2019-11-30 03:26:26

Boto and the underlying EMR API is currently mixing the terms cluster and job flow, and job flow is being deprecated. I consider them synonyms.

You create a new cluster by calling the boto.emr.connection.run_jobflow() function. It will return the cluster ID which EMR generates for you.

First all the mandatory things:

#!/usr/bin/env python

import boto
import boto.emr
from boto.emr.instance_group import InstanceGroup

conn = boto.emr.connect_to_region('us-east-1')

Then we specify instance groups, including the spot price we want to pay for the TASK nodes:

instance_groups = []
instance_groups.append(InstanceGroup(
    num_instances=1,
    role="MASTER",
    type="m1.small",
    market="ON_DEMAND",
    name="Main node"))
instance_groups.append(InstanceGroup(
    num_instances=2,
    role="CORE",
    type="m1.small",
    market="ON_DEMAND",
    name="Worker nodes"))
instance_groups.append(InstanceGroup(
    num_instances=2,
    role="TASK",
    type="m1.small",
    market="SPOT",
    name="My cheap spot nodes",
    bidprice="0.002"))

Finally we start a new cluster:

cluster_id = conn.run_jobflow(
    "Name for my cluster",
    instance_groups=instance_groups,
    action_on_failure='TERMINATE_JOB_FLOW',
    keep_alive=True,
    enable_debugging=True,
    log_uri="s3://mybucket/logs/",
    hadoop_version=None,
    ami_version="2.4.9",
    steps=[],
    bootstrap_actions=[],
    ec2_keyname="my-ec2-key",
    visible_to_all_users=True,
    job_flow_role="EMR_EC2_DefaultRole",
    service_role="EMR_DefaultRole")

We can also print the cluster ID if we care about that:

print "Starting cluster", cluster_id

I believe the minimum amount of Python that will launch an EMR cluster with boto3 is:

import boto3

client = boto3.client('emr', region_name='us-east-1')

response = client.run_job_flow(
    Name="Boto3 test cluster",
    ReleaseLabel='emr-5.12.0',
    Instances={
        'MasterInstanceType': 'm4.xlarge',
        'SlaveInstanceType': 'm4.xlarge',
        'InstanceCount': 3,
        'KeepJobFlowAliveWhenNoSteps': True,
        'TerminationProtected': False,
        'Ec2SubnetId': 'my-subnet-id',
        'Ec2KeyName': 'my-key',
    },
    VisibleToAllUsers=True,
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole'
)

Notes: you'll have to create EMR_EC2_DefaultRole and EMR_DefaultRole. The Amazon documentation claims that JobFlowRole and ServiceRole are optional, but omitting them did not work for me. That could be because my subnet is a VPC subnet, but I'm not sure.

I use the following code to create EMR with flink installed, and includes 3 instance groups. Reference document: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.run_job_flow

import boto3

masterInstanceType = 'm4.large'
coreInstanceType = 'c3.xlarge'
taskInstanceType = 'm4.large'
coreInstanceNum = 2
taskInstanceNum = 2
clusterName = 'my-emr-name'

emrClient = boto3.client('emr')

logUri = 's3://bucket/xxxxxx/'
releaseLabel = 'emr-5.17.0' #emr version
instances = {
    'Ec2KeyName': 'my_keyxxxxxx',
    'Ec2SubnetId': 'subnet-xxxxxx',
    'ServiceAccessSecurityGroup': 'sg-xxxxxx',
    'EmrManagedMasterSecurityGroup': 'sg-xxxxxx',
    'EmrManagedSlaveSecurityGroup': 'sg-xxxxxx',
    'KeepJobFlowAliveWhenNoSteps': True,
    'TerminationProtected': False,
    'InstanceGroups': [{
        'InstanceRole': 'MASTER',
        "InstanceCount": 1,
            "InstanceType": masterInstanceType,
            "Market": "SPOT",
            "Name": "Master"
        }, {
            'InstanceRole': 'CORE',
            "InstanceCount": coreInstanceNum,
            "InstanceType": coreInstanceType,
            "Market": "SPOT",
            "Name": "Core",
        }, {
            'InstanceRole': 'TASK',
            "InstanceCount": taskInstanceNum,
            "InstanceType": taskInstanceType,
            "Market": "SPOT",
            "Name": "Core",
        }
    ]
}
bootstrapActions = [{
    'Name': 'Log to Cloudwatch Logs',
    'ScriptBootstrapAction': {
        'Path': 's3://mybucket/bootstrap_cwl.sh'
    }
}, {
    'Name': 'Custom action',
    'ScriptBootstrapAction': {
        'Path': 's3://mybucket/install.sh'
    }
}]
applications = [{'Name': 'Flink'}]
serviceRole = 'EMR_DefaultRole'
jobFlowRole = 'EMR_EC2_DefaultRole'
tags = [{'Key': 'keyxxxxxx', 'Value': 'valuexxxxxx'},
        {'Key': 'key2xxxxxx', 'Value': 'value2xxxxxx'}
        ]
steps = [
    {
        'Name': 'Run Flink',
        'ActionOnFailure': 'TERMINATE_JOB_FLOW',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': ['flink', 'run',
                     '-m', 'yarn-cluster',
                     '-p', str(taskInstanceNum),
                     '-yjm', '1024',
                     '-ytm', '1024',
                     '/home/hadoop/test-1.0-SNAPSHOT.jar'
                     ]
        }
    },
]
response = emrClient.run_job_flow(
    Name=clusterName,
    LogUri=logUri,
    ReleaseLabel=releaseLabel,
    Instances=instances,
    Steps=steps,
    Configurations=configurations,
    BootstrapActions=bootstrapActions,
    Applications=applications,
    ServiceRole=serviceRole,
    JobFlowRole=jobFlowRole,
    Tags=tags
)

My Step Arguments are: bash -c /usr/bin/flink run -m yarn-cluster -yn 2 /home/hadoop/mysflinkjob.jar

Trying execute same run_job_flow, but getting error:

Cannot run program "/usr/bin/flink run -m yarn-cluster -yn 2 /home/hadoop/mysflinkjob.jar" (in directory "."): error=2, No such file or directory

Executing same command from Master node working fine, but not from Python boto3

Seems like issue is due to quotation marks which EMR or boto3 add into Arguments.

UPDATE:

Split ALL your Arguments with white-space. I mean if you need to execute "flink run myflinkjob.jar" pass your Arguments as this list:

['flink','run','myflinkjob.jar']

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!