emr

AWS Athena: does `msck repair table` incur costs?

懵懂的女人 提交于 2019-12-10 21:48:02
问题 I have ORC data in S3 that looks like this: s3://bucket/orc/clientId=client-1/year=2017/month=3/day=16/hour=20/ s3://bucket/orc/clientId=client-2/year=2017/month=3/day=16/hour=21/ s3://bucket/orc/clientId=client-3/year=2017/month=3/day=16/hour=22/ Every hour I run an EMR job that converts raw JSON in S3 to ORC, and write it out with the path partition convention (above) for Athena ingestion. After the EMR job completes, I run msck repair table so Athena can pick up the new partitions. I have

How to prevent EMR Spark step from retrying?

穿精又带淫゛_ 提交于 2019-12-10 15:59:51
问题 I have an AWS EMR cluster (emr-4.2.0, Spark 1.5.2), where I am submitting steps from aws cli. My problem is, that if the Spark application fails, then YARN is trying to run the application again (under the same EMR step). How can I prevent this? I was trying to set --conf spark.yarn.maxAppAttempts=1 , which is correctly set in Environment/Spark Properties, but it doesn't prevent YARN from restarting the application. 回答1: You should try to set spark.task.maxFailures to 1 (4 by default).

KryoSerializer cannot find my SparkKryoRegistrator

[亡魂溺海] 提交于 2019-12-10 10:24:53
问题 I am using Spark 2.0.2 on Amazon emr-5.2.1 in client mode. I use Kryo serialisation and register our classes in our own KryoRegistrator: val sparkConf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrator", classOf[de.gaf.ric.workflow.RicKryoRegistrator].getName) .set("spark.kryo.registrationRequired", "true") .set("spark.kryoserializer.buffer.max", "512m") implicit val sc = new SparkContext(sparkConf) The process starts fine,

Can't instantiate SparkSession on EMR 5.0 HUE

旧城冷巷雨未停 提交于 2019-12-10 10:16:32
问题 I'm running an EMR 5.0 cluster and I'm using HUE to create an OOZIE workflow to submit a SPARK 2.0 job. I have ran the job with a spark-submit directly on the YARN and as a step on the same cluster. No problem. But when I do it with HUE I get the following error: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.internal.SessionState': at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:949) at org.apache.spark

Adding postgresql jar though spark-submit on amazon EMR

别来无恙 提交于 2019-12-09 13:54:30
问题 I've tried spark-submit with --driver-class-path, with --jars as well as tried this method https://petz2000.wordpress.com/2015/08/18/get-blas-working-with-spark-on-amazon-emr/ On using SPARK_CLASSPATH in the commandline as in SPARK_CLASSPATH=/home/hadoop/pg_jars/postgresql-9.4.1208.jre7.jar pyspark I get this error Found both spark.executor.extraClassPath and SPARK_CLASSPATH. Use only the former. But I'm not able to add it. How do I add postgresql JDBC jar file to use it from pyspark? I'm

Airflow - Task Instance in EMR operator

别等时光非礼了梦想. 提交于 2019-12-09 11:32:49
问题 In Airflow, I'm facing the issue that I need to pass the job_flow_id to one of my emr-steps. I am capable of retrieving the job_flow_id from the operator but when I am going to create the steps to submit to the cluster, the task_instance value is not right. I have the following code: def issue_step(name, args): return [ { "Name": name, "ActionOnFailure": "CONTINUE", "HadoopJarStep": { "Jar": "s3://....", "Args": args } } ] dag = DAG('example', description='My dag', schedule_interval='0 8 * *

Spark Job error: YarnAllocator: Exit status: -100. Diagnostics: Container released on a *lost* node

蓝咒 提交于 2019-12-09 06:05:47
问题 I am running a job on AWS-EMR 4.1, Spark 1.5 with the following conf: spark-submit --deploy-mode cluster --master yarn-cluster --driver-memory 200g --driver-cores 30 --executor-memory 70g --executor-cores 8 --num-executors 90 --conf spark.storage.memoryFraction=0.45 --conf spark.shuffle.memoryFraction=0.75 --conf spark.task.maxFailures=1 --conf spark.network.timeout=1800s Then I got the error below. Where can I find out what is "Exit status: -100" ? And how I might be able to fix this problem

External hive metastore for EMR

无人久伴 提交于 2019-12-08 06:51:21
问题 I am creating a EMR cluster with default hive meta store , after which i am overriding the hive-site.xml with some property which are pointing the aws rds instance as hive metastore , everything is fine , but after restarting the hive server , i am not able to use RDS as hive metastore. It is still usin the default hive metastore created by EMR. 回答1: You can override the default configurations for applications by supplying a configuration object for applications when you create a cluster. The

Running Spark app on EMR is slow

僤鯓⒐⒋嵵緔 提交于 2019-12-08 06:45:52
问题 I am new to Spark and MApReduce and I have a problem running Spark on Elastic Map Reduce (EMR) AWS cluster. Th problem is that running on EMR taking for me a lot of time. For, example, I have a few millions record in .csv file, that I read and converted in JavaRDD. For Spark, it took 104.99 seconds to calculate simple mapToDouble() and sum() functions on this dataset. While, when I did the same calculations without Spark, using Java8 and converting .csv file to List, it took only 0.5 seconds.

EMR with multiple encryption key providers

折月煮酒 提交于 2019-12-08 06:17:01
问题 I'm running EMR cluster with enabled s3 client-side encryption using custom key provider. But now I need to write data to multiple s3 destinations using different encryption schemas: CSE custom key provider CSE-KMS Is it possible to configure EMR to use both encryption types by defining some kind of mapping between s3 bucket and encryption type? Alternatively since I use spark structured streaming to process and write data to s3 I'm wondering if it's possible to disable encryption on EMRFS