emr

Spark 1.6 on EMR writing to S3 as Parquet hangs and fails

有些话、适合烂在心里 提交于 2019-12-03 09:14:12
I'm creating an uber jar spark application that I'm spark submitting to an EMR 4.3 cluster, I'm provisioning 4 r3.xlarge instances, one to be the master and the other three as the cores. I have hadoop 2.7.1, ganglia 3.7.2 spark 1.6, and hive 1.0.0 pre-installed from the console. I'm running the following command: spark-submit \ --deploy-mode cluster \ --executor-memory 4g \ --executor-cores 2 \ --num-executors 4 --driver-memory 4g --driver-cores 2 --conf "spark.driver.maxResultSize=2g" --conf "spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.parquet

How to set up Zeppelin to work with remote EMR Yarn cluster

守給你的承諾、 提交于 2019-12-03 08:30:25
I have Amazon EMR Hadoop v2.6 cluster with Spark 1.4.1, with Yarn resource manager. I want to deploy Zeppelin on separate machine to allow turning off EMR cluster when there is no jobs running. I tried following instruction from here https://zeppelin.incubator.apache.org/docs/install/yarn_install.html with not much of success. Can somebody demystify steps how Zeppelin should connect to existing Yarn cluster from different machine? [1] install Zeppelin with proper params: git clone https://github.com/apache/incubator-zeppelin.git ~/zeppelin; cd ~/zeppelin; mvn clean package -Pspark-1.4 -Dhadoop

Spark Job error: YarnAllocator: Exit status: -100. Diagnostics: Container released on a *lost* node

*爱你&永不变心* 提交于 2019-12-03 06:30:13
I am running a job on AWS-EMR 4.1, Spark 1.5 with the following conf: spark-submit --deploy-mode cluster --master yarn-cluster --driver-memory 200g --driver-cores 30 --executor-memory 70g --executor-cores 8 --num-executors 90 --conf spark.storage.memoryFraction=0.45 --conf spark.shuffle.memoryFraction=0.75 --conf spark.task.maxFailures=1 --conf spark.network.timeout=1800s Then I got the error below. Where can I find out what is "Exit status: -100" ? And how I might be able to fix this problem? Thanks! 15/12/05 05:54:24 INFO TaskSetManager: Finished task 176.0 in stage 957.0 (TID 128408) in

YARN: What is the difference between number-of-executors and executor-cores in Spark?

不问归期 提交于 2019-12-03 06:24:30
I am learning Spark on AWS EMR. In the process I am trying to understand the difference between number of executors(--num-executors) and executor cores (--executor-cores). Can any one please tell me here? Also when I am trying to submit the following job, I am getting error: spark-submit --deploy-mode cluster --master yarn --num-executors 1 --executor-cores 5 --executor-memory 1g -–conf spark.yarn.submit.waitAppCompletion=false wordcount.py s3://test/spark-example/input/input.txt s3://test/spark-example/output21 Error: Unrecognized option: -–conf Number of executors is the number of distinct

Boosting spark.yarn.executor.memoryOverhead

£可爱£侵袭症+ 提交于 2019-12-03 06:10:38
I'm trying to run a (py)Spark job on EMR that will process a large amount of data. Currently my job is failing with the following error message: Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. So I google'd how to do this, and found that I should pass along the spark.yarn.executor.memoryOverhead parameter with the --conf flag. I'm doing it this way: aws emr add-steps\ --cluster-id %s\ --profile EMR\ --region us-west-2\ --steps Name=Spark,Jar=command-runner.jar,\ Args=[\ /usr/lib/spark/bin

Any Scala SDK or interface for AWS?

房东的猫 提交于 2019-12-03 05:01:34
Does anyone know of a Scala SDK for Amazon Web Services? I am particularly interested in the EMR jobs. Take a look at AWScala (it's a simple wrapper on top of AWS SDK for Java): https://github.com/seratch/AWScala [UPDATE from 04/07/2015]: Another very promising library from @dwhjames: Asynchronous Scala Clients for Amazon Web Services https://dwhjames.github.io/aws-wrap/ You could use the standard Java SDK directly without any problems from Scala, however I'm not aware of any Scala specific SDKs. Atlassian's aws-scala is quite good. p.s. Currently the library has basic support for S3, DynamoDB

How to suppress INFO messages for spark-sql running on EMR?

半城伤御伤魂 提交于 2019-12-03 04:28:28
问题 I'm running Spark on EMR as described in Run Spark and Spark SQL on Amazon Elastic MapReduce: This tutorial walks you through installing and operating Spark, a fast and general engine for large-scale data processing, on an Amazon EMR cluster. You will also create and query a dataset in Amazon S3 using Spark SQL, and learn how to monitor Spark on an Amazon EMR cluster with Amazon CloudWatch. I'm trying to suppress the INFO logs by editing $HOME/spark/conf/log4j.properties to no avail. Output

How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

耗尽温柔 提交于 2019-12-03 03:24:28
I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custom jar. We can use the following way to specify these configurations when we run using external scripting languages like ruby or python: ruby elastic-mapreduce -j --stream --step-name "mystream" --jobconf mapred.task.timeout=0 --jobconf mapred.min.split.size=52880 --mapper s3://somepath/mapper.rb --reducer s3:somepath/reducer.rb --input s3://somepath/input --output s3://somepath/output I tried the following ways, but none of them worked:

How to restart yarn on AWS EMR

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-03 03:10:28
问题 I am using Hadoop 2.6.0 ( emr-4.2.0 image). I have made some changes in yarn-site.xml and want to restart yarn to bring the changes into effect. Is there a command using which I can do this? 回答1: Edit (10/26/2017): A more detailed Knowledge Center article on how to do this has been published here by AWS officially - https://aws.amazon.com/premiumsupport/knowledge-center/restart-service-emr/. You can ssh into the master node of your EMR cluster and run - "sudo /sbin/stop hadoop-yarn

terminating a spark step in aws

扶醉桌前 提交于 2019-12-03 02:58:16
I want to set up a series of spark steps on an EMR spark cluster, and terminate the current step if it's taking too long. However, when I ssh into the master node and run hadoop jobs -list, the master node seems to believe that there is no jobs running. I don't want to terminate the cluster, because doing so would force me to buy a whole new hour of whatever cluster I'm running. Can anyone please help me terminate a spark-step in EMR without terminating the entire cluster? That's easy: yarn application -kill [application id] you can list your running applications with yarn application -list