amazon-emr

Security-Configuration Field For AWS Data Pipeline EmrCluster

不羁岁月 提交于 2019-12-11 06:24:28
问题 I created an AWS EMR Cluster through the regular EMR Cluster wizard on the AWS Management Console and I was able to select a security-configuration e.g., when you export the CLI command it's --security-configuration 'mySecurityConfigurationValue' . I now need to create a similar EMR through the AWS Data Pipeline but I don't see any options where I can specify this security-configuration field. The only similar fields I see are EmrManagedSlaveSecurityGroup, EmrManagedMasterSecurityGroup,

How to include jars in Hive (Amazon Hadoop env)

我是研究僧i 提交于 2019-12-11 05:23:23
问题 I need to include newer protobuf jar (newer than 2.5.0) in Hive. Somehow no matter where I put the jar - it's being pushed to the end of the classpath. How can I make sure that the jar is in the beginning of the classpath of Hive? 回答1: To add your own jar to the Hive classpath so that it's included in the beginning of the classpath and not overloaded by some hadoop jar you need to set the following Env variable - export HADOOP_USER_CLASSPATH_FIRST=true This indicates that the HADOOP_CLASSPATH

Getting “file does not exist” error when running an Amazon EMR job

穿精又带淫゛_ 提交于 2019-12-11 04:36:06
问题 I have uploaded my data genotype1_large_ind_large.txt phenotype1_large_ind_large_1.txt to the S3 system, and in the EMR UI, I set the parameter like below RunDear.run s3n://scalability/genotype1_large_ind_large.txt s3n://scalability/phenotype1_large_ind_large_1.txt s3n://scalability/output_1phe 33 10 4 In my class RunDear.run I will distribute the file genotype1_large_ind_large.txt and phenotype1_large_ind_large_1.txt to the cache However, after running the EMR, I get the following error:

Write 100 million files to s3

江枫思渺然 提交于 2019-12-11 02:32:36
问题 My main aim is to split out records into files according to the ids of each record, and there are over 15 billion records right now which can certainly increase. I need a scalable solution using Amazon EMR. I have already got this done for a smaller dataset having around 900 million records. Input files are in csv format, with one of the field which is need to be the file name in the output. So say that there are following input records: awesomeId1, somedetail1, somedetail2 awesomeID1,

Import external libraries in an Hadoop MapReduce script

孤街醉人 提交于 2019-12-11 02:30:42
问题 I am running a python MapReduce script on top of Amazons EMR Hadoop implementation. As a result from the main scripts, I get item item similiarities. In an aftercare step, I want to split this output into a seperate S3 bucket for each item, so each item-bucket contains a list of items similiar to it. To achieve this, I want to use Amazons boto python library in the reduce function of the aftercare step. How do I import external (python) libraries into hadoop, so that they can be used in a

About Livy session for Jupyterhub on AWS EMR Spark

為{幸葍}努か 提交于 2019-12-11 02:26:13
问题 My customer has a AD connector configured on Jupyterhub installed on AWS EMR so that different users will be authenticated on jupyterhub via AD. The current understanding is when different users submit their spark job through Jupyter notebook on Jupyterhub to the shared underlying EMR spark engine, the spark job will be submitted via Livy to spark engine. Each Livy session will has a related spark session mapped to it(that is my current understanding and correct me if I am wrong) The question

AWS Athena: does `msck repair table` incur costs?

懵懂的女人 提交于 2019-12-10 21:48:02
问题 I have ORC data in S3 that looks like this: s3://bucket/orc/clientId=client-1/year=2017/month=3/day=16/hour=20/ s3://bucket/orc/clientId=client-2/year=2017/month=3/day=16/hour=21/ s3://bucket/orc/clientId=client-3/year=2017/month=3/day=16/hour=22/ Every hour I run an EMR job that converts raw JSON in S3 to ORC, and write it out with the path partition convention (above) for Athena ingestion. After the EMR job completes, I run msck repair table so Athena can pick up the new partitions. I have

AWS EMR Spark: Error: Cannot load main class from JAR

▼魔方 西西 提交于 2019-12-10 20:44:07
问题 I am trying to submit a spark job to AWS EMR cluster using AWS console. But it fails with: Cannot load main class from JAR . The job runs successfully when I specify main class as --class in Arguments option in AWS EMR Console-> Add Step. On the local machine, the job seems to work perfectly fine when no main class is specified as below: ./spark-submit /home/astro/spark-programs/SpotEMR/MyJob.jar I have set main class to jar using run configuration. The main reason to avoid passing main class

Creating AWS EMR cluster with spark step using lambda function fails with “Local file does not exist”

纵饮孤独 提交于 2019-12-10 12:15:52
问题 I'm trying to spin up an EMR cluster with a Spark step using a Lambda function. Here is my lambda function (python 2.7): import boto3 def lambda_handler(event, context): conn = boto3.client("emr") cluster_id = conn.run_job_flow( Name='LSR Batch Testrun', ServiceRole='EMR_DefaultRole', JobFlowRole='EMR_EC2_DefaultRole', VisibleToAllUsers=True, LogUri='s3n://aws-logs-171256445476-ap-southeast-2/elasticmapreduce/', ReleaseLabel='emr-5.16.0', Instances={ "Ec2SubnetId": "<my-subnet>",

list S3 folder on EMR

不想你离开。 提交于 2019-12-10 10:42:13
问题 I fail to understand how to simply list the contents of an S3 bucket on EMR during a spark job. I wanted to do the following Configuration conf = spark.sparkContext().hadoopConfiguration(); FileSystem s3 = S3FileSystem.get(conf); List<LocatedFileStatus> list = toList(s3.listFiles(new Path("s3://mybucket"), false)) This always fails with the following error java.lang.IllegalArgumentException: Wrong FS: s3://*********/, expected: hdfs://**********.eu-central-1.compute.internal:8020 in the