amazon-emr

Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

一世执手 提交于 2019-12-02 17:39:31
Recently we migrated from "EMR on HDFS" --> "EMR on S3" (EMRFS with consistent view enabled) and we realized the Spark 'SaveAsTable' (parquet format) writes to S3 were ~4x slower as compared to HDFS but we found a workaround of using the DirectParquetOutputCommitter -[1] w/ Spark 1.6. Reason for S3 slowness - We had to pay the so called Parquet tax-[2] where the default output committer writes to a temporary table and renames it later where the rename operation in S3 is very expensive Also we do understand the risk of using 'DirectParquetOutputCommitter' which is possibility of data corruption

Extremely slow S3 write times from EMR/ Spark

我与影子孤独终老i 提交于 2019-12-02 17:12:49
I'm writing to see if anyone knows how to speed up S3 write times from Spark running in EMR? My Spark Job takes over 4 hours to complete, however the cluster is only under load during the first 1.5 hours. I was curious into what Spark was doing all this time. I looked at the logs and I found many s3 mv commands, one for each file. Then taking a look directly at S3 I see all my files are in a _temporary directory. Secondary, I'm concerned with my cluster cost, it appears I need to buy 2 hours of compute for this specific task. However, I end up buying unto 5 hours. I'm curious if EMR

Spark History Server behind Load Balancer is redirecting to HTTP

我是研究僧i 提交于 2019-12-02 12:25:52
I am currently using Spark on AWS EMR, but when this is behind a Load Balancer (AWS ELB), it is redirecting the traffic from https to http, which then ends up getting denied because I don't allow http traffic through the load balancer for the given port. It appears that this might derive from Yarn being a proxy as well, but I have no idea. 来源: https://stackoverflow.com/questions/56412083/spark-history-server-behind-load-balancer-is-redirecting-to-http

avro error on AWS EMR

心不动则不痛 提交于 2019-12-02 07:54:11
问题 I'm using spark-redshift (https://github.com/databricks/spark-redshift) which uses avro for transfer. Reading from Redshift is OK, while writing I'm getting Caused by: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter tried using Amazon EMR 4.1.0 (Spark 1.5.0) and 4.0.0 (Spark 1.4.1). Cannot do import org.apache.avro.generic.GenericData.createDatumWriter either, just import org.apache.avro.generic

AWS EMR Parallel Mappers?

南笙酒味 提交于 2019-12-02 07:53:56
I am trying to determine how many nodes I need for my EMR cluster. As part of best practices the recommendations are: (Total Mappers needed for your job + Time taken to process) / (per instance capacity + desired time) as outlined here: http://www.slideshare.net/AmazonWebServices/amazon-elastic-mapreduce-deep-dive-and-best-practices-bdt404-aws-reinvent-2013 , page 89. The question is how to determine how many parallel mappers the instance will support since AWS don't publish? https://aws.amazon.com/emr/pricing/ Sorry if i missed something obvious. Wayne To determine the number of parallel

avro error on AWS EMR

蹲街弑〆低调 提交于 2019-12-02 04:11:43
I'm using spark-redshift ( https://github.com/databricks/spark-redshift ) which uses avro for transfer. Reading from Redshift is OK, while writing I'm getting Caused by: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter tried using Amazon EMR 4.1.0 (Spark 1.5.0) and 4.0.0 (Spark 1.4.1). Cannot do import org.apache.avro.generic.GenericData.createDatumWriter either, just import org.apache.avro.generic.GenericData I'm using scala shell Tried download several others avro-mapred and avro jars, tried setting {

pyspark error does not exist in the jvm error when initializing SparkContext

别来无恙 提交于 2019-12-01 17:46:25
I am using spark over emr and writing a pyspark script, I am getting an error when trying to from pyspark import SparkContext sc = SparkContext() this is the error File "pyex.py", line 5, in <module> sc = SparkContext() File "/usr/local/lib/python3.4/site-packages/pyspark/context.py", line 118, in __init__ conf, jsc, profiler_cls) File "/usr/local/lib/python3.4/site-packages/pyspark/context.py", line 195, in _do_init self._encryption_enabled = self._jvm.PythonUtils.getEncryptionEnabled(self._jsc) File "/usr/local/lib/python3.4/site-packages/py4j/java_gateway.py", line 1487, in __getattr__ "{0}

pyspark error does not exist in the jvm error when initializing SparkContext

这一生的挚爱 提交于 2019-12-01 15:33:02
问题 I am using spark over emr and writing a pyspark script, I am getting an error when trying to from pyspark import SparkContext sc = SparkContext() this is the error File "pyex.py", line 5, in <module> sc = SparkContext() File "/usr/local/lib/python3.4/site-packages/pyspark/context.py", line 118, in __init__ conf, jsc, profiler_cls) File "/usr/local/lib/python3.4/site-packages/pyspark/context.py", line 195, in _do_init self._encryption_enabled = self._jvm.PythonUtils.getEncryptionEnabled(self.

How to avoid reading old files from S3 when appending new data?

心不动则不痛 提交于 2019-12-01 14:04:53
Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet' for reading 16/11/27 14:06:15 INFO S3NativeFileSystem: Stream for key 'foo.parquet/id=123/day=2016-11-26

AWS EMR 5.11.0 - Apache Hive on Spark

佐手、 提交于 2019-12-01 08:01:55
I am trying to setup Apache Hive on Spark on AWS EMR 5.11.0. Apache Spark Version - 2.2.1 Apache Hive Version - 2.3.2 Yarn logs show below error: 18/01/28 21:55:28 ERROR ApplicationMaster: User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS at org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47) at org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134) at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native