amazon-emr

Folder won't delete on Amazon S3

两盒软妹~` 提交于 2019-12-03 10:51:06
I'm trying to delete a folder created as a result of a MapReduce job. Other files in the bucket delete just fine, but this folder won't delete. When I try to delete it from the console, the progress bar next to its status just stays at 0. Have made multiple attempts, including with logout/login in between. Steffen Opel First and foremost, Amazon S3 doesn't actually have a native concept of folders/directories, rather is a flat storage architecture comprised of buckets and objects/keys only - the directory style presentation seen in most tools for S3 (including the AWS Management Console itself

Boosting spark.yarn.executor.memoryOverhead

£可爱£侵袭症+ 提交于 2019-12-03 06:10:38
I'm trying to run a (py)Spark job on EMR that will process a large amount of data. Currently my job is failing with the following error message: Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. So I google'd how to do this, and found that I should pass along the spark.yarn.executor.memoryOverhead parameter with the --conf flag. I'm doing it this way: aws emr add-steps\ --cluster-id %s\ --profile EMR\ --region us-west-2\ --steps Name=Spark,Jar=command-runner.jar,\ Args=[\ /usr/lib/spark/bin

Can't get a SparkContext in new AWS EMR Cluster

北慕城南 提交于 2019-12-03 06:09:36
i just set up an AWS EMR Cluster (EMR Version 5.18 with Spark 2.3.2). I ssh into the master maschine and run spark-shell or pyspark and get the following error: $ spark-shell log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /stderr (Permission denied) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:133) at org.apache.log4j.FileAppender.setFile(FileAppender.java:294) at org.apache.log4j

Any Scala SDK or interface for AWS?

房东的猫 提交于 2019-12-03 05:01:34
Does anyone know of a Scala SDK for Amazon Web Services? I am particularly interested in the EMR jobs. Take a look at AWScala (it's a simple wrapper on top of AWS SDK for Java): https://github.com/seratch/AWScala [UPDATE from 04/07/2015]: Another very promising library from @dwhjames: Asynchronous Scala Clients for Amazon Web Services https://dwhjames.github.io/aws-wrap/ You could use the standard Java SDK directly without any problems from Scala, however I'm not aware of any Scala specific SDKs. Atlassian's aws-scala is quite good. p.s. Currently the library has basic support for S3, DynamoDB

Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

北城余情 提交于 2019-12-03 04:25:20
问题 Recently we migrated from "EMR on HDFS" --> "EMR on S3" (EMRFS with consistent view enabled) and we realized the Spark 'SaveAsTable' (parquet format) writes to S3 were ~4x slower as compared to HDFS but we found a workaround of using the DirectParquetOutputCommitter -[1] w/ Spark 1.6. Reason for S3 slowness - We had to pay the so called Parquet tax-[2] where the default output committer writes to a temporary table and renames it later where the rename operation in S3 is very expensive Also we

Can we add more Amazon Elastic Mapreduce instances into an existing Amazon Elastic Mapreduce instances?

感情迁移 提交于 2019-12-03 03:58:38
I am new to Amazon Services and facing some issues. Suppose I am running some Job Flow on Amazon Elastic Mapreduce with total 3 instances. While running my job flow on it I found that my job is taking more time to execute. And in such case I need to add more instances into it so that my instances will increase and hence job will execute fast. My question is that How to add such instance into an existing instances? Because If we terminate existed instance and again create the new instances with more number is time consuming. Is there anyway to do it? If yes then please suggest me. I am doing

Extremely slow S3 write times from EMR/ Spark

家住魔仙堡 提交于 2019-12-03 03:13:53
问题 I'm writing to see if anyone knows how to speed up S3 write times from Spark running in EMR? My Spark Job takes over 4 hours to complete, however the cluster is only under load during the first 1.5 hours. I was curious into what Spark was doing all this time. I looked at the logs and I found many s3 mv commands, one for each file. Then taking a look directly at S3 I see all my files are in a _temporary directory. Secondary, I'm concerned with my cluster cost, it appears I need to buy 2 hours

How to make EMR to keep running [duplicate]

大憨熊 提交于 2019-12-02 21:17:36
问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: Re-use Amazon Elastic MapReduce instance Can I keep a launched EMR cluster running and keep submitting new jobs to it until I am done (say after a couple of days) and then shut down the cluster or do I have to lanuch my own cluster in EC2 to do so? 回答1: Yes. In particular, I use the CLI client. Here is a snippet from one of my scripts: JOBFLOW_ID=`elastic-mapreduce --create --alive --name cluster --num-instances

How to tune spark job on EMR to write huge data quickly on S3

浪尽此生 提交于 2019-12-02 19:16:37
I have a spark job where i am doing outer join between two data frames . Size of first data frame is 260 GB,file format is text files which is split into 2200 files and the size of second data frame is 2GB . Then writing data frame output which is about 260 GB into S3 takes very long time is more than 2 hours after that i cancelled because i have been changed heavily on EMR . Here is my cluster info . emr-5.9.0 Master: m3.2xlarge Core: r4.16xlarge 10 machines (each machine has 64 vCore, 488 GiB memory,EBS Storage:100 GiB) This is my cluster config that i am setting capacity-scheduler yarn

AWS EMR Parallel Mappers?

倾然丶 夕夏残阳落幕 提交于 2019-12-02 18:17:07
问题 I am trying to determine how many nodes I need for my EMR cluster. As part of best practices the recommendations are: (Total Mappers needed for your job + Time taken to process) / (per instance capacity + desired time) as outlined here: http://www.slideshare.net/AmazonWebServices/amazon-elastic-mapreduce-deep-dive-and-best-practices-bdt404-aws-reinvent-2013, page 89. The question is how to determine how many parallel mappers the instance will support since AWS don't publish? https://aws