emr

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

回眸只為那壹抹淺笑 提交于 2019-11-27 20:02:22
问题 I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly has allocated all the worker nodes to the spark job (with one for the driver, of course). I have the magic "maximizeResourceAllocation" property set to "true", and the spark property "spark.dynamicAllocation.enabled" also set to "true". However, if I resize the emr cluster by adding nodes to the CORE

collect() or toPandas() on a large DataFrame in pyspark/EMR

喜欢而已 提交于 2019-11-27 16:16:19
I have an EMR cluster of one machine "c3.8xlarge", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using pyspark, so I have configured the cluster as follow: One executor: spark.executor.memory 6g spark.executor.cores 10 spark.yarn.executor.memoryOverhead 4096 Driver: spark.driver.memory 21g When I cache() the DataFrame it takes about 3.6GB of memory. Now when I call collect() or toPandas() on the DataFrame, the process crashes. I know that I am bringing a large amount of data into the driver, but I think that it is not that

Spark on yarn mode end with “Exit status: -100. Diagnostics: Container released on a *lost* node”

扶醉桌前 提交于 2019-11-27 16:15:28
问题 I am trying to load a database with 1TB data to spark on AWS using the latest EMR. And the running time is so long that it doesn't finished in even 6 hours, but after running 6h30m , I get some error announcing that Container released on a lost node and then the job failed. Logs are like this: 16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks)

spark-1.4.1 saveAsTextFile to S3 is very slow on emr-4.0.0

人盡茶涼 提交于 2019-11-27 06:52:36
问题 I run spark 1.4.1 in amazom aws emr 4.0.0 For some reson spark saveAsTextFile is very slow on emr 4.0.0 in comparison to emr 3.8 (was 5 sec, now 95 sec) Actually saveAsTextFile says that it's done in 4.356 sec but after that I see lots of INFO messages with 404 error from com.amazonaws.latency logger for next 90 sec spark> sc.parallelize(List.range(0, 1600000),160).map(x => x + "\t" + "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20") 2015-09-01 21:16:17,637 INFO [dag-scheduler-event-loop

How to bootstrap installation of Python modules on Amazon EMR?

跟風遠走 提交于 2019-11-27 01:15:01
问题 I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). What is the most straightforward way of doing this? 回答1: The most straightforward way would be to create a bash script containing your installation commands, copy it to S3, and set a bootstrap action from the console to point to your script. Here's an example I'm using in production: s3://mybucket/bootstrap/install_python