amazon-emr

How to execute spark submit on amazon EMR from Lambda function?

核能气质少年 提交于 2019-11-28 18:29:11
I want to execute spark submit job on AWS EMR cluster based on the file upload event on S3. I am using AWS Lambda function to capture the event but I have no idea how to submit spark submit job on EMR cluster from Lambda function. Most of the answers that i searched talked about adding a step in the EMR cluster. But I do not know if I can add add any step to fire "spark submit --with args" in the added step. You can, I had to same thing last week! Using boto3 for Python (other languages would definitely have a similar solution) you can either start a cluster with the defined step, or attach a

Strange spark ERROR on AWS EMR

ぐ巨炮叔叔 提交于 2019-11-28 18:19:13
I have a really simple PySpark script that creates a dataframe from some parquet data on S3 and then call count() method and print out the number of records. I run the script on AWS EMR cluster and I'm seeing following strange WARN information: 17/12/04 14:20:26 WARN ServletHandler: javax.servlet.ServletException: java.util.NoSuchElementException: None.get at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

老子叫甜甜 提交于 2019-11-28 17:38:46
I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly has allocated all the worker nodes to the spark job (with one for the driver, of course). I have the magic "maximizeResourceAllocation" property set to "true", and the spark property "spark.dynamicAllocation.enabled" also set to "true". However, if I resize the emr cluster by adding nodes to the CORE pool of worker machines, YARN only adds some of the new nodes to the spark job. For example, this

“Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used” on an EMR cluster with 75GB of memory

半腔热情 提交于 2019-11-28 15:18:20
I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error: 16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting

AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

若如初见. 提交于 2019-11-28 13:10:54
问题 I am running into this problem w/ Apache Arrow Spark Integration. Using AWS EMR w/ Spark 2.4.3 Tested this problem on both local spark single machine instance and a Cloudera cluster and everything works fine. set these in spark-env.sh export PYSPARK_PYTHON=python3 export PYSPARK_PYTHON_DRIVER=python3 confirmed this in spark shell spark.version 2.4.3 sc.pythonExec python3 SC.pythonVer python3 running basic pandas_udf with apache arrow integration results in error from pyspark.sql.functions

Amazon EC2 vs. Amazon EMR [closed]

ⅰ亾dé卋堺 提交于 2019-11-28 06:28:32
I have implemented a task in Hive. Currently it is working fine on my single node cluster. Now I am planning to deploy it on AWS. I don't know anything about the AWS. If I plan to deploy it then what should I choose Amazon EC2 or Amazon EMR? I want to improve the performance of my task. Which one is better and reliable for me? How to approach towards them? I heard that we can also register our VM setting as it is on AWS. Is it possible? Please suggest me as soon as possible. Many Thanks. EMR is a collection of EC2 instances with Hadoop (and optionally Hive and/or Pig) installed and configured

hadoop copying from hdfs to S3

别说谁变了你拦得住时间么 提交于 2019-11-28 06:24:18
问题 I've successfully completed mahout vectorizing job on Amazon EMR (using Mahout on Elastic MapReduce as reference). Now I want to copy results from HDFS to S3 (to use it in future clustering). For that I've used hadoop distcp: den@aws:~$ elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \ > --arg hdfs://my.bucket/prj1/seqfiles \ > --arg s3n://ACCESS_KEY:SECRET_KEY@my.bucket/prj1/seqfiles \ > -j $JOBID Failed. Found that suggestion: Use s3distcp Tried it also: elastic

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

回眸只為那壹抹淺笑 提交于 2019-11-27 20:02:22
问题 I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly has allocated all the worker nodes to the spark job (with one for the driver, of course). I have the magic "maximizeResourceAllocation" property set to "true", and the spark property "spark.dynamicAllocation.enabled" also set to "true". However, if I resize the emr cluster by adding nodes to the CORE

Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

岁酱吖の 提交于 2019-11-27 18:42:05
I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html I am running as below command on ec2 instance : ./spark/bin/spark-submit --class org.apache.spark.examples.streaming.myclassname --master yarn-cluster --num-executors 2 --driver-memory 1g --executor-memory 1g --executor-cores 1 /home/hadoop/test.jar I have installed spark on EMR. EMR details Master instance group - 1 Running MASTER m1.medium 1 Core instance group - 2 Running CORE m1.medium I am getting below INFO and it never ends. 15/06/14 11:33:23 INFO yarn.Client: Requesting a

Does an EMR master node know its cluster ID?

删除回忆录丶 提交于 2019-11-27 17:23:34
问题 I want to be able to create EMR clusters, and for those clusters to send messages back to some central queue. In order for this to work, I need to have some sort of agent running on each master node. Each one of those agents will have to identify itself in this message so that the recipient knows which cluster the message is about. Does the master node know its ID ( j-************* )? If not, then is there some other piece of identifying information that could allow the message recipient to