emr | 易学教程

WARN ReliableDeliverySupervisor: Association with remote system has failed, address is now gated for [5000] ms. Reason: [Disassociated]

阅读更多关于 WARN ReliableDeliverySupervisor: Association with remote system has failed, address is now gated for [5000] ms. Reason: [Disassociated]

问题 I am running the following sentence on aws spark val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ case class Wiki(project: String, title: String, count: Int, byte_size: String) val data = sc.textFile("s3n://+++/").map(_.split(" ")).filter(_.size ==4 ).map(p => Wiki(p(0), p(1), p(2).trim.toInt, p(3))) val df = data.toDF() df.printSchema() val en_agg_df = df.filter("project = 'en'").select("title","count").groupBy("title").sum().collect() can after about 2

Spark 1.6 on EMR writing to S3 as Parquet hangs and fails

阅读更多关于 Spark 1.6 on EMR writing to S3 as Parquet hangs and fails

问题 I'm creating an uber jar spark application that I'm spark submitting to an EMR 4.3 cluster, I'm provisioning 4 r3.xlarge instances, one to be the master and the other three as the cores. I have hadoop 2.7.1, ganglia 3.7.2 spark 1.6, and hive 1.0.0 pre-installed from the console. I'm running the following command: spark-submit \ --deploy-mode cluster \ --executor-memory 4g \ --executor-cores 2 \ --num-executors 4 --driver-memory 4g --driver-cores 2 --conf "spark.driver.maxResultSize=2g" --conf

What's the aws cli command to create the default EMR-managed security groups?

阅读更多关于 What's the aws cli command to create the default EMR-managed security groups?

问题 When using the EMR web console, you can create a cluster and AWS automatically creates the EMR-managed security groups named "ElasticMapReduce-master" & "ElasticMapReduce-slave". How do you create those via the aws cli? I found aws emr create-default-roles but there's no aws emr create-default-security-groups . 回答1: As of right now, it looks like you can't. See http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-man-sec-groups.html section "To specify Amazon EMR–managed security groups

Getting “Existing lock /var/run/yum.pid: another copy is running as pid …” during bootstraping in EMR

阅读更多关于 Getting “Existing lock /var/run/yum.pid: another copy is running as pid …” during bootstraping in EMR

问题 I need to install python3 in my EMR cluster (AMI 3.1.1) as a part of bootstraping step. So I added the following command: sudo yum install -y python3 But everytime I got an error saying the following: Existing lock /var/run/yum.pid: another copy is running as pid 1829. Another app is currently holding the yum lock; waiting for it to exit... The other application is: yum How can I avoid this error? Or is there a way to install Python 3 without going through this route? 回答1: The issue is that

Unable to paginate EMR cluster using boto

阅读更多关于 Unable to paginate EMR cluster using boto

问题 I have about 55 EMR clusters (all of them were terminated) and have been trying to retrieve the entire 55 EMR clusters using the list_clusters method in boto . I've been searching for examples about paginating the number of result set from boto but couldn't find any examples. Given this statement: emr_object.list_clusters(cluster_states=["TERMINATED"], marker="what_should_i_use_here").clusters I kept getting InvalidRequestException error: boto.exception.EmrResponseError: EmrResponseError: 400

AWS SSE-KMS Encryption from Spark / Dataframes

阅读更多关于 AWS SSE-KMS Encryption from Spark / Dataframes

问题 I have configured encryption enabled EMR cluster (properties in emrfs-site.xml) I am using dataframe savemode.append to write into S3n://my-bucket/path/ to save in s3. But I am not able to see the object getting AWS KMS encrypted. However, when I do a simple insert from hive from EMR, I am able to see the objects getting aws kms encrypted. How can I encrypt files from dataframe in S3 using sse kms? 回答1: The problem was we were using s3a to save the files from spark program to EMR. AWS

Problems using distcp and s3distcp with my EMR job that outputs to HDFS

阅读更多关于 Problems using distcp and s3distcp with my EMR job that outputs to HDFS

问题 I've run a job on AWS's EMR, and stored the output in the EMR job's HDFS. I am then trying to copy the result to S3 via distcp or s3distcp, but both are failing as described below. (Note: the reason I'm not just sending my EMR job's output directly to S3 is due to the (currently unresolved) problem I describe in Where is my AWS EMR reducer output for my completed job (should be on S3, but nothing there)? For distcp, I run (following this post's recommendation): elastic-mapreduce --jobflow <MY

Getting “file does not exist” error when running an Amazon EMR job

阅读更多关于 Getting “file does not exist” error when running an Amazon EMR job

问题 I have uploaded my data genotype1_large_ind_large.txt phenotype1_large_ind_large_1.txt to the S3 system, and in the EMR UI, I set the parameter like below RunDear.run s3n://scalability/genotype1_large_ind_large.txt s3n://scalability/phenotype1_large_ind_large_1.txt s3n://scalability/output_1phe 33 10 4 In my class RunDear.run I will distribute the file genotype1_large_ind_large.txt and phenotype1_large_ind_large_1.txt to the cache However, after running the EMR, I get the following error:

How to restart HDFS on Amazon EMR

阅读更多关于 How to restart HDFS on Amazon EMR

问题 I have made some changes in the settings for HDFS on an Amazon EMR cluster. I want to restart the namenode and the datanode for the changes to take effect. I am not able to find any start and stop scripts to do so on neither the namenode(master) nor the datanodes. What should be the way to restart the cluster? 回答1: On EMR4 , run following on master host - sudo /sbin/start hadoop-hdfs-namenode ssh -i <key.pem> <slave-hostname1> "sudo /sbin/restart hadoop-hdfs-datanode" ssh -i <key.pem> <slave

Write 100 million files to s3

阅读更多关于 Write 100 million files to s3

问题 My main aim is to split out records into files according to the ids of each record, and there are over 15 billion records right now which can certainly increase. I need a scalable solution using Amazon EMR. I have already got this done for a smaller dataset having around 900 million records. Input files are in csv format, with one of the field which is need to be the file name in the output. So say that there are following input records: awesomeId1, somedetail1, somedetail2 awesomeID1,