amazon-emr | 易学教程

Can't get a SparkContext in new AWS EMR Cluster

阅读更多关于 Can't get a SparkContext in new AWS EMR Cluster

问题 i just set up an AWS EMR Cluster (EMR Version 5.18 with Spark 2.3.2). I ssh into the master maschine and run spark-shell or pyspark and get the following error: $ spark-shell log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /stderr (Permission denied) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>

S3 SlowDown error in Spark on EMR

阅读更多关于 S3 SlowDown error in Spark on EMR

问题 I am getting this error when writing a parquet file, this has started to happen recently com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: 2CA496E2AB87DC16), S3 Extended Request ID: 1dBrcqVGJU9VgoT79NAVGyN0fsbj9+6bipC7op97ZmP+zSFIuH72lN03ZtYabNIA2KaSj18a8ho= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.handleErrorResponse

AWS EMR bootstrap action as sudo

阅读更多关于 AWS EMR bootstrap action as sudo

I need to update /etc/hosts for all instances in my EMR cluster (EMR AMI 4.3). The whole script is nothing more than: #!/bin/bash echo -e 'ip1 uri1' >> /etc/hosts echo -e 'ip2 uri2' >> /etc/hosts ... This script needs to run as sudo or it fails. From here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html#bootstrapUses Bootstrap actions execute as the Hadoop user by default. You can execute a bootstrap action with root privileges by using sudo . Great news... but I can't figure out how to do this, and I can't find an example. I've tried a bunch of things...

Spark streaming 1.6.1 is not working with Kinesis asl 1.6.1 and asl 2.0.0-preview

阅读更多关于 Spark streaming 1.6.1 is not working with Kinesis asl 1.6.1 and asl 2.0.0-preview

问题 I am trying to run spark streaming job on EMR with Kinesis. Spark 1.6.1 with Kinesis ASL 1.6.1. Writing a plain sample wordcount example. <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kinesis-asl_2.10</artifactId> <version>1.6.1</version> </dependency> <dependency> <groupId>com.amazonaws</groupId> <artifactId>amazon-kinesis-client</artifactId> <version>1.6.3</version> </dependency> <dependency> <groupId>com.amazonaws</groupId> <artifactId>amazon-kinesis-producer

Spark not installed on EMR cluster

阅读更多关于 Spark not installed on EMR cluster

问题 I have been using Spark on an EMR cluster for a few weeks now without problems - the setup was with the AMI 3.8.0 and Spark 1.3.1, and I passed '-x' as an argument to Spark (without this it didn't seem to be installed). I want to upgrade to a more recent version of Spark and today spun up a cluster with the emr-4.1.0 AMI, containing Spark 1.5.0. When the cluster is up it claims to have successfully installed Spark (at least on the cluster management page on AWS) but when I ssh into 'hadoop@

Hive insert overwrite and Insert into are very slow with S3 external table

阅读更多关于 Hive insert overwrite and Insert into are very slow with S3 external table

问题 I am using AWS EMR . I have created external tables pointing to S3 location. The "INSERT INTO TABLE" and "INSERT OVERWRITE" statements are very slow when using destination table as external table pointing to S3. The main issue is that Hive first writes data to a staging directory and then moves the data to the original location. Does anyone have a better solution for this? Using S3 is really slowing down our jobs. Cloudera recommends to use the setting hive.mv.files.threads . But looks like

Running Spark app on EMR is slow

阅读更多关于 Running Spark app on EMR is slow

问题 I am new to Spark and MApReduce and I have a problem running Spark on Elastic Map Reduce (EMR) AWS cluster. Th problem is that running on EMR taking for me a lot of time. For, example, I have a few millions record in .csv file, that I read and converted in JavaRDD. For Spark, it took 104.99 seconds to calculate simple mapToDouble() and sum() functions on this dataset. While, when I did the same calculations without Spark, using Java8 and converting .csv file to List, it took only 0.5 seconds.

EMR with multiple encryption key providers

阅读更多关于 EMR with multiple encryption key providers

问题 I'm running EMR cluster with enabled s3 client-side encryption using custom key provider. But now I need to write data to multiple s3 destinations using different encryption schemas: CSE custom key provider CSE-KMS Is it possible to configure EMR to use both encryption types by defining some kind of mapping between s3 bucket and encryption type? Alternatively since I use spark structured streaming to process and write data to s3 I'm wondering if it's possible to disable encryption on EMRFS

How to access statistics endpoint for a Spark Streaming application?

阅读更多关于 How to access statistics endpoint for a Spark Streaming application?

As of Spark 2.2.0, there's are new endpoints in the API for getting information about streaming jobs. I run Spark on EMR clusters, using Spark 2.2.0 in cluster mode. When I hit the endpoint for my streaming jobs, all it gives me is the error message: no streaming listener attached to <stream name> I've dug through the Spark codebase a bit, but this feature is not very well documented. So I'm curious if this is a bug? Is there some configuration I need to do to get this endpoint working? This appears to be an issue specifically when running on the cluster. The same code running on Spark 2.2.0

Spark Clustered By/Bucket by dataset not using memory

阅读更多关于 Spark Clustered By/Bucket by dataset not using memory

问题 I recently came across Spark bucketby/clusteredby here. I tried to mimic this for a 1.1TB source file from S3 (already in parquet). Plan is to completely avoid shuffle as most of the datasets are always joined on "id" column. Here are is what I am doing: myDf.repartition(20) .write.partitionBy("day") .option("mode", "DROPMALFORMED") .option("compression", "snappy") .option("path","s3://my-bucket/folder/1year_data_bucketed/").mode("overwrite") .format("parquet").bucketBy(20,"id").sortBy("id")