emr | 易学教程

Spark 1.6 on EMR writing to S3 as Parquet hangs and fails

阅读更多关于 Spark 1.6 on EMR writing to S3 as Parquet hangs and fails

I'm creating an uber jar spark application that I'm spark submitting to an EMR 4.3 cluster, I'm provisioning 4 r3.xlarge instances, one to be the master and the other three as the cores. I have hadoop 2.7.1, ganglia 3.7.2 spark 1.6, and hive 1.0.0 pre-installed from the console. I'm running the following command: spark-submit \ --deploy-mode cluster \ --executor-memory 4g \ --executor-cores 2 \ --num-executors 4 --driver-memory 4g --driver-cores 2 --conf "spark.driver.maxResultSize=2g" --conf "spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.parquet

How to set up Zeppelin to work with remote EMR Yarn cluster

阅读更多关于 How to set up Zeppelin to work with remote EMR Yarn cluster

I have Amazon EMR Hadoop v2.6 cluster with Spark 1.4.1, with Yarn resource manager. I want to deploy Zeppelin on separate machine to allow turning off EMR cluster when there is no jobs running. I tried following instruction from here https://zeppelin.incubator.apache.org/docs/install/yarn_install.html with not much of success. Can somebody demystify steps how Zeppelin should connect to existing Yarn cluster from different machine? [1] install Zeppelin with proper params: git clone https://github.com/apache/incubator-zeppelin.git ~/zeppelin; cd ~/zeppelin; mvn clean package -Pspark-1.4 -Dhadoop

Spark Job error: YarnAllocator: Exit status: -100. Diagnostics: Container released on a lost node

阅读更多关于 Spark Job error: YarnAllocator: Exit status: -100. Diagnostics: Container released on a *lost* node

I am running a job on AWS-EMR 4.1, Spark 1.5 with the following conf: spark-submit --deploy-mode cluster --master yarn-cluster --driver-memory 200g --driver-cores 30 --executor-memory 70g --executor-cores 8 --num-executors 90 --conf spark.storage.memoryFraction=0.45 --conf spark.shuffle.memoryFraction=0.75 --conf spark.task.maxFailures=1 --conf spark.network.timeout=1800s Then I got the error below. Where can I find out what is "Exit status: -100" ? And how I might be able to fix this problem? Thanks! 15/12/05 05:54:24 INFO TaskSetManager: Finished task 176.0 in stage 957.0 (TID 128408) in

YARN: What is the difference between number-of-executors and executor-cores in Spark?

阅读更多关于 YARN: What is the difference between number-of-executors and executor-cores in Spark?

I am learning Spark on AWS EMR. In the process I am trying to understand the difference between number of executors(--num-executors) and executor cores (--executor-cores). Can any one please tell me here? Also when I am trying to submit the following job, I am getting error: spark-submit --deploy-mode cluster --master yarn --num-executors 1 --executor-cores 5 --executor-memory 1g -–conf spark.yarn.submit.waitAppCompletion=false wordcount.py s3://test/spark-example/input/input.txt s3://test/spark-example/output21 Error: Unrecognized option: -–conf Number of executors is the number of distinct

Boosting spark.yarn.executor.memoryOverhead

阅读更多关于 Boosting spark.yarn.executor.memoryOverhead

I'm trying to run a (py)Spark job on EMR that will process a large amount of data. Currently my job is failing with the following error message: Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. So I google'd how to do this, and found that I should pass along the spark.yarn.executor.memoryOverhead parameter with the --conf flag. I'm doing it this way: aws emr add-steps\ --cluster-id %s\ --profile EMR\ --region us-west-2\ --steps Name=Spark,Jar=command-runner.jar,\ Args=[\ /usr/lib/spark/bin

Any Scala SDK or interface for AWS?

阅读更多关于 Any Scala SDK or interface for AWS?

Does anyone know of a Scala SDK for Amazon Web Services? I am particularly interested in the EMR jobs. Take a look at AWScala (it's a simple wrapper on top of AWS SDK for Java): https://github.com/seratch/AWScala [UPDATE from 04/07/2015]: Another very promising library from @dwhjames: Asynchronous Scala Clients for Amazon Web Services https://dwhjames.github.io/aws-wrap/ You could use the standard Java SDK directly without any problems from Scala, however I'm not aware of any Scala specific SDKs. Atlassian's aws-scala is quite good. p.s. Currently the library has basic support for S3, DynamoDB

How to suppress INFO messages for spark-sql running on EMR?

阅读更多关于 How to suppress INFO messages for spark-sql running on EMR?

问题 I'm running Spark on EMR as described in Run Spark and Spark SQL on Amazon Elastic MapReduce: This tutorial walks you through installing and operating Spark, a fast and general engine for large-scale data processing, on an Amazon EMR cluster. You will also create and query a dataset in Amazon S3 using Spark SQL, and learn how to monitor Spark on an Amazon EMR cluster with Amazon CloudWatch. I'm trying to suppress the INFO logs by editing $HOME/spark/conf/log4j.properties to no avail. Output

How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

阅读更多关于 How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custom jar. We can use the following way to specify these configurations when we run using external scripting languages like ruby or python: ruby elastic-mapreduce -j --stream --step-name "mystream" --jobconf mapred.task.timeout=0 --jobconf mapred.min.split.size=52880 --mapper s3://somepath/mapper.rb --reducer s3:somepath/reducer.rb --input s3://somepath/input --output s3://somepath/output I tried the following ways, but none of them worked:

How to restart yarn on AWS EMR

阅读更多关于 How to restart yarn on AWS EMR

问题 I am using Hadoop 2.6.0 ( emr-4.2.0 image). I have made some changes in yarn-site.xml and want to restart yarn to bring the changes into effect. Is there a command using which I can do this? 回答1: Edit (10/26/2017): A more detailed Knowledge Center article on how to do this has been published here by AWS officially - https://aws.amazon.com/premiumsupport/knowledge-center/restart-service-emr/. You can ssh into the master node of your EMR cluster and run - "sudo /sbin/stop hadoop-yarn

terminating a spark step in aws

阅读更多关于 terminating a spark step in aws

I want to set up a series of spark steps on an EMR spark cluster, and terminate the current step if it's taking too long. However, when I ssh into the master node and run hadoop jobs -list, the master node seems to believe that there is no jobs running. I don't want to terminate the cluster, because doing so would force me to buy a whole new hour of whatever cluster I'm running. Can anyone please help me terminate a spark-step in EMR without terminating the entire cluster? That's easy: yarn application -kill [application id] you can list your running applications with yarn application -list