amazon-emr | 易学教程

Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

阅读更多关于 Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

Recently we migrated from "EMR on HDFS" --> "EMR on S3" (EMRFS with consistent view enabled) and we realized the Spark 'SaveAsTable' (parquet format) writes to S3 were ~4x slower as compared to HDFS but we found a workaround of using the DirectParquetOutputCommitter -[1] w/ Spark 1.6. Reason for S3 slowness - We had to pay the so called Parquet tax-[2] where the default output committer writes to a temporary table and renames it later where the rename operation in S3 is very expensive Also we do understand the risk of using 'DirectParquetOutputCommitter' which is possibility of data corruption

Extremely slow S3 write times from EMR/ Spark

阅读更多关于 Extremely slow S3 write times from EMR/ Spark

I'm writing to see if anyone knows how to speed up S3 write times from Spark running in EMR? My Spark Job takes over 4 hours to complete, however the cluster is only under load during the first 1.5 hours. I was curious into what Spark was doing all this time. I looked at the logs and I found many s3 mv commands, one for each file. Then taking a look directly at S3 I see all my files are in a _temporary directory. Secondary, I'm concerned with my cluster cost, it appears I need to buy 2 hours of compute for this specific task. However, I end up buying unto 5 hours. I'm curious if EMR

Spark History Server behind Load Balancer is redirecting to HTTP

阅读更多关于 Spark History Server behind Load Balancer is redirecting to HTTP

I am currently using Spark on AWS EMR, but when this is behind a Load Balancer (AWS ELB), it is redirecting the traffic from https to http, which then ends up getting denied because I don't allow http traffic through the load balancer for the given port. It appears that this might derive from Yarn being a proxy as well, but I have no idea. 来源： https://stackoverflow.com/questions/56412083/spark-history-server-behind-load-balancer-is-redirecting-to-http

avro error on AWS EMR

阅读更多关于 avro error on AWS EMR

问题 I'm using spark-redshift (https://github.com/databricks/spark-redshift) which uses avro for transfer. Reading from Redshift is OK, while writing I'm getting Caused by: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter tried using Amazon EMR 4.1.0 (Spark 1.5.0) and 4.0.0 (Spark 1.4.1). Cannot do import org.apache.avro.generic.GenericData.createDatumWriter either, just import org.apache.avro.generic

AWS EMR Parallel Mappers?

阅读更多关于 AWS EMR Parallel Mappers?

I am trying to determine how many nodes I need for my EMR cluster. As part of best practices the recommendations are: (Total Mappers needed for your job + Time taken to process) / (per instance capacity + desired time) as outlined here: http://www.slideshare.net/AmazonWebServices/amazon-elastic-mapreduce-deep-dive-and-best-practices-bdt404-aws-reinvent-2013 , page 89. The question is how to determine how many parallel mappers the instance will support since AWS don't publish? https://aws.amazon.com/emr/pricing/ Sorry if i missed something obvious. Wayne To determine the number of parallel

avro error on AWS EMR

阅读更多关于 avro error on AWS EMR

I'm using spark-redshift ( https://github.com/databricks/spark-redshift ) which uses avro for transfer. Reading from Redshift is OK, while writing I'm getting Caused by: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter tried using Amazon EMR 4.1.0 (Spark 1.5.0) and 4.0.0 (Spark 1.4.1). Cannot do import org.apache.avro.generic.GenericData.createDatumWriter either, just import org.apache.avro.generic.GenericData I'm using scala shell Tried download several others avro-mapred and avro jars, tried setting {

pyspark error does not exist in the jvm error when initializing SparkContext

阅读更多关于 pyspark error does not exist in the jvm error when initializing SparkContext

I am using spark over emr and writing a pyspark script, I am getting an error when trying to from pyspark import SparkContext sc = SparkContext() this is the error File "pyex.py", line 5, in <module> sc = SparkContext() File "/usr/local/lib/python3.4/site-packages/pyspark/context.py", line 118, in __init__ conf, jsc, profiler_cls) File "/usr/local/lib/python3.4/site-packages/pyspark/context.py", line 195, in _do_init self._encryption_enabled = self._jvm.PythonUtils.getEncryptionEnabled(self._jsc) File "/usr/local/lib/python3.4/site-packages/py4j/java_gateway.py", line 1487, in __getattr__ "{0}

pyspark error does not exist in the jvm error when initializing SparkContext

阅读更多关于 pyspark error does not exist in the jvm error when initializing SparkContext

问题 I am using spark over emr and writing a pyspark script, I am getting an error when trying to from pyspark import SparkContext sc = SparkContext() this is the error File "pyex.py", line 5, in <module> sc = SparkContext() File "/usr/local/lib/python3.4/site-packages/pyspark/context.py", line 118, in __init__ conf, jsc, profiler_cls) File "/usr/local/lib/python3.4/site-packages/pyspark/context.py", line 195, in _do_init self._encryption_enabled = self._jvm.PythonUtils.getEncryptionEnabled(self.

How to avoid reading old files from S3 when appending new data?

阅读更多关于 How to avoid reading old files from S3 when appending new data?

Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet' for reading 16/11/27 14:06:15 INFO S3NativeFileSystem: Stream for key 'foo.parquet/id=123/day=2016-11-26

AWS EMR 5.11.0 - Apache Hive on Spark

阅读更多关于 AWS EMR 5.11.0 - Apache Hive on Spark

I am trying to setup Apache Hive on Spark on AWS EMR 5.11.0. Apache Spark Version - 2.2.1 Apache Hive Version - 2.3.2 Yarn logs show below error: 18/01/28 21:55:28 ERROR ApplicationMaster: User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS at org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47) at org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134) at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native