emr | 易学教程

AWS Athena: does `msck repair table` incur costs?

阅读更多关于 AWS Athena: does `msck repair table` incur costs?

问题 I have ORC data in S3 that looks like this: s3://bucket/orc/clientId=client-1/year=2017/month=3/day=16/hour=20/ s3://bucket/orc/clientId=client-2/year=2017/month=3/day=16/hour=21/ s3://bucket/orc/clientId=client-3/year=2017/month=3/day=16/hour=22/ Every hour I run an EMR job that converts raw JSON in S3 to ORC, and write it out with the path partition convention (above) for Athena ingestion. After the EMR job completes, I run msck repair table so Athena can pick up the new partitions. I have

How to prevent EMR Spark step from retrying?

阅读更多关于 How to prevent EMR Spark step from retrying?

问题 I have an AWS EMR cluster (emr-4.2.0, Spark 1.5.2), where I am submitting steps from aws cli. My problem is, that if the Spark application fails, then YARN is trying to run the application again (under the same EMR step). How can I prevent this? I was trying to set --conf spark.yarn.maxAppAttempts=1 , which is correctly set in Environment/Spark Properties, but it doesn't prevent YARN from restarting the application. 回答1: You should try to set spark.task.maxFailures to 1 (4 by default).

KryoSerializer cannot find my SparkKryoRegistrator

阅读更多关于 KryoSerializer cannot find my SparkKryoRegistrator

问题 I am using Spark 2.0.2 on Amazon emr-5.2.1 in client mode. I use Kryo serialisation and register our classes in our own KryoRegistrator: val sparkConf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrator", classOf[de.gaf.ric.workflow.RicKryoRegistrator].getName) .set("spark.kryo.registrationRequired", "true") .set("spark.kryoserializer.buffer.max", "512m") implicit val sc = new SparkContext(sparkConf) The process starts fine,

Can't instantiate SparkSession on EMR 5.0 HUE

阅读更多关于 Can't instantiate SparkSession on EMR 5.0 HUE

问题 I'm running an EMR 5.0 cluster and I'm using HUE to create an OOZIE workflow to submit a SPARK 2.0 job. I have ran the job with a spark-submit directly on the YARN and as a step on the same cluster. No problem. But when I do it with HUE I get the following error: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.internal.SessionState': at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:949) at org.apache.spark

Adding postgresql jar though spark-submit on amazon EMR

阅读更多关于 Adding postgresql jar though spark-submit on amazon EMR

问题 I've tried spark-submit with --driver-class-path, with --jars as well as tried this method https://petz2000.wordpress.com/2015/08/18/get-blas-working-with-spark-on-amazon-emr/ On using SPARK_CLASSPATH in the commandline as in SPARK_CLASSPATH=/home/hadoop/pg_jars/postgresql-9.4.1208.jre7.jar pyspark I get this error Found both spark.executor.extraClassPath and SPARK_CLASSPATH. Use only the former. But I'm not able to add it. How do I add postgresql JDBC jar file to use it from pyspark? I'm

Airflow - Task Instance in EMR operator

阅读更多关于 Airflow - Task Instance in EMR operator

问题 In Airflow, I'm facing the issue that I need to pass the job_flow_id to one of my emr-steps. I am capable of retrieving the job_flow_id from the operator but when I am going to create the steps to submit to the cluster, the task_instance value is not right. I have the following code: def issue_step(name, args): return [ { "Name": name, "ActionOnFailure": "CONTINUE", "HadoopJarStep": { "Jar": "s3://....", "Args": args } } ] dag = DAG('example', description='My dag', schedule_interval='0 8 * *

Spark Job error: YarnAllocator: Exit status: -100. Diagnostics: Container released on a lost node

阅读更多关于 Spark Job error: YarnAllocator: Exit status: -100. Diagnostics: Container released on a *lost* node

问题 I am running a job on AWS-EMR 4.1, Spark 1.5 with the following conf: spark-submit --deploy-mode cluster --master yarn-cluster --driver-memory 200g --driver-cores 30 --executor-memory 70g --executor-cores 8 --num-executors 90 --conf spark.storage.memoryFraction=0.45 --conf spark.shuffle.memoryFraction=0.75 --conf spark.task.maxFailures=1 --conf spark.network.timeout=1800s Then I got the error below. Where can I find out what is "Exit status: -100" ? And how I might be able to fix this problem

External hive metastore for EMR

阅读更多关于 External hive metastore for EMR

问题 I am creating a EMR cluster with default hive meta store , after which i am overriding the hive-site.xml with some property which are pointing the aws rds instance as hive metastore , everything is fine , but after restarting the hive server , i am not able to use RDS as hive metastore. It is still usin the default hive metastore created by EMR. 回答1: You can override the default configurations for applications by supplying a configuration object for applications when you create a cluster. The

Running Spark app on EMR is slow

阅读更多关于 Running Spark app on EMR is slow

问题 I am new to Spark and MApReduce and I have a problem running Spark on Elastic Map Reduce (EMR) AWS cluster. Th problem is that running on EMR taking for me a lot of time. For, example, I have a few millions record in .csv file, that I read and converted in JavaRDD. For Spark, it took 104.99 seconds to calculate simple mapToDouble() and sum() functions on this dataset. While, when I did the same calculations without Spark, using Java8 and converting .csv file to List, it took only 0.5 seconds.

EMR with multiple encryption key providers

阅读更多关于 EMR with multiple encryption key providers

问题 I'm running EMR cluster with enabled s3 client-side encryption using custom key provider. But now I need to write data to multiple s3 destinations using different encryption schemas: CSE custom key provider CSE-KMS Is it possible to configure EMR to use both encryption types by defining some kind of mapping between s3 bucket and encryption type? Alternatively since I use spark structured streaming to process and write data to s3 I'm wondering if it's possible to disable encryption on EMRFS