emr

“Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used” on an EMR cluster with 75GB of memory

十年热恋 提交于 2019-12-17 21:25:13
问题 I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error: 16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container

collect() or toPandas() on a large DataFrame in pyspark/EMR

徘徊边缘 提交于 2019-12-17 14:53:28
问题 I have an EMR cluster of one machine "c3.8xlarge", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using pyspark, so I have configured the cluster as follow: One executor: spark.executor.memory 6g spark.executor.cores 10 spark.yarn.executor.memoryOverhead 4096 Driver: spark.driver.memory 21g When I cache() the DataFrame it takes about 3.6GB of memory. Now when I call collect() or toPandas() on the DataFrame, the process crashes.

NoSuchMethodError for Scala Seq line in Spark

风格不统一 提交于 2019-12-14 03:02:31
问题 I am having an error when trying to run plain Scala code in Spark similar to these posts: this and this Their problem was that they were using the wrong Scala version to compile their Spark project. However, mine is the correct version. I have Spark 1.6.0 installed on an AWS EMR cluster to run the program. The project is compiled on my local machine with Scala 2.11 installed and 2.11 listed in all dependencies and build files without any references to 2.10. This is the exact line that throws

How to force Hadoop to unzip inputs regadless of their extension?

独自空忆成欢 提交于 2019-12-14 02:02:18
问题 I'm running map-reduce and my inputs are gzipped, but do not have a .gz (file name) extension. Normally, when they do have the .gz extension, Hadoop takes care of unzipping them on the fly before passing them to the mapper. However, without the extension it doesn't do so. I can't rename my files, so I need some way of "forcing" Hadoop to unzip them, even though they do not have the .gz extension. I tried passing the following flags to Hadoop: step_args=[ "-jobconf", "stream.recordreader

collect() or toPandas() on a large DataFrame in pyspark/EMR

不问归期 提交于 2019-12-13 14:06:17
问题 I have an EMR cluster of one machine "c3.8xlarge", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using pyspark, so I have configured the cluster as follow: One executor: spark.executor.memory 6g spark.executor.cores 10 spark.yarn.executor.memoryOverhead 4096 Driver: spark.driver.memory 21g When I cache() the DataFrame it takes about 3.6GB of memory. Now when I call collect() or toPandas() on the DataFrame, the process crashes.

Amazon EMR cluster matplotlib error

感情迁移 提交于 2019-12-13 12:28:38
问题 I'm using an AWS cluster EMR 5.3.1 with Hadoop + Spark + Hive + Zeppelin When I use Zeppelin and type command: %python import matplotlib.pyplot as plt plt.plot([1, 2, 3]) I get error: ImportError: Gtk3 backend requires pygobject to be installed. How to solve it? 回答1: Before importing pyplot module you need to change matplotlib's backend to Agg import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt plt.plot([1,2,3]) 来源: https://stackoverflow.com/questions/42481911/amazon-emr

Number of executors and cores

北城以北 提交于 2019-12-13 07:57:52
问题 I am new to spark and would like to know how many cores and executors have to be used in a spark job and AWS if we have 2 slave c4.8xlarge nodes and 1 c4.8x large master node. I have tried different combinations but not able to understand the concept. Thank you. 回答1: Cloudera guys gave good explanation on that https://www.youtube.com/watch?v=vfiJQ7wg81Y If, let's say you have 16 cores on your node(I think this is exactly your case), then you give 1 for yarn to manage this node, then you

Spark on EMR : Time for running data in EMR didn't reduce when no of nodes increases

倾然丶 夕夏残阳落幕 提交于 2019-12-13 06:44:56
问题 My Spark program take a large amount of zip files that contain JSON data from S3. It performs some cleaning on the data in the form of spark transforms. After that, I saved it as parquet files. When I run my program with 1GB data in 10 nodes 8GB configurations in AWS it takes about 11 min. I changed it to 20 nodes 32GB configuration. Still it takes about 10 min. Reduced only around 1 min. Why this kind of behavior? 回答1: Because adding more machines isn't always the solution, adding more

jar containing org.apache.hadoop.hive.dynamodb

不羁岁月 提交于 2019-12-13 01:44:18
问题 I was trying to programmatically Load a dynamodb table into HDFS (via java, and not hive), I couldnt find examples online on how to do it, so thought I'd download the jar containing org.apache.hadoop.hive.dynamodb and reverse engineer the process. Unfortunately, I couldn't find the file as well :(. Could someone answer the following questions for me (listed in order of priority). Java example that loads a dynamodb table into HDFS (that can be passed to a mapper as a table input format). the

Create EMR Cluster with No Public IP Addresses

夙愿已清 提交于 2019-12-12 15:04:03
问题 I wish to create an EMR cluster where none of the instances are assigned a public IP address for security reasons. I have been able to launch the cluster in my VPC, and using my own custom security group, but for some reason all the nodes are assigned a public IP address by default. I cant find anything in the EMR CLI documentation about how to disable this: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html Any ideas? Is there some EMR specific reason why