emr

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

阅读更多关于 Why does Yarn on EMR not allocate all nodes to running Spark jobs?

问题 I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly has allocated all the worker nodes to the spark job (with one for the driver, of course). I have the magic "maximizeResourceAllocation" property set to "true", and the spark property "spark.dynamicAllocation.enabled" also set to "true". However, if I resize the emr cluster by adding nodes to the CORE

collect() or toPandas() on a large DataFrame in pyspark/EMR

阅读更多关于 collect() or toPandas() on a large DataFrame in pyspark/EMR

I have an EMR cluster of one machine "c3.8xlarge", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using pyspark, so I have configured the cluster as follow: One executor: spark.executor.memory 6g spark.executor.cores 10 spark.yarn.executor.memoryOverhead 4096 Driver: spark.driver.memory 21g When I cache() the DataFrame it takes about 3.6GB of memory. Now when I call collect() or toPandas() on the DataFrame, the process crashes. I know that I am bringing a large amount of data into the driver, but I think that it is not that

Spark on yarn mode end with “Exit status: -100. Diagnostics: Container released on a lost node”

阅读更多关于 Spark on yarn mode end with “Exit status: -100. Diagnostics: Container released on a *lost* node”

问题 I am trying to load a database with 1TB data to spark on AWS using the latest EMR. And the running time is so long that it doesn't finished in even 6 hours, but after running 6h30m , I get some error announcing that Container released on a lost node and then the job failed. Logs are like this: 16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks)

spark-1.4.1 saveAsTextFile to S3 is very slow on emr-4.0.0

阅读更多关于 spark-1.4.1 saveAsTextFile to S3 is very slow on emr-4.0.0

问题 I run spark 1.4.1 in amazom aws emr 4.0.0 For some reson spark saveAsTextFile is very slow on emr 4.0.0 in comparison to emr 3.8 (was 5 sec, now 95 sec) Actually saveAsTextFile says that it's done in 4.356 sec but after that I see lots of INFO messages with 404 error from com.amazonaws.latency logger for next 90 sec spark> sc.parallelize(List.range(0, 1600000),160).map(x => x + "\t" + "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20") 2015-09-01 21:16:17,637 INFO [dag-scheduler-event-loop

How to bootstrap installation of Python modules on Amazon EMR?

阅读更多关于 How to bootstrap installation of Python modules on Amazon EMR?

问题 I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). What is the most straightforward way of doing this? 回答1: The most straightforward way would be to create a bash script containing your installation commands, copy it to S3, and set a bootstrap action from the console to point to your script. Here's an example I'm using in production: s3://mybucket/bootstrap/install_python

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

collect() or toPandas() on a large DataFrame in pyspark/EMR

Spark on yarn mode end with “Exit status: -100. Diagnostics: Container released on a *lost* node”

spark-1.4.1 saveAsTextFile to S3 is very slow on emr-4.0.0

How to bootstrap installation of Python modules on Amazon EMR?

Spark on yarn mode end with “Exit status: -100. Diagnostics: Container released on a lost node”