amazon-emr | 易学教程

How to set PYTHONHASHSEED on AWS EMR

阅读更多关于 How to set PYTHONHASHSEED on AWS EMR

Is there any way to set an environment variable on all nodes of an EMR cluster? I am getting an error when trying to use reduceByKey() in Python3 PySpark, and getting an error regarding the hash seed. I can see this is a known error, and that the environment varialbe PYTHONHASHSEED needs to be set to the same value on all nodes of the cluster, but I haven't had any luck with it. I have tried adding a variable to spark-env through the cluster configuration: [ { "Classification": "spark-env", "Configurations": [ { "Classification": "export", "Properties": { "PYSPARK_PYTHON": "/usr/bin/python3",

Adding JDBC driver to Spark on EMR

阅读更多关于 Adding JDBC driver to Spark on EMR

问题 I'm trying to add a JDBC driver to a Spark cluster that is executing on top Amazon EMR but I keep getting the: java.sql.SQLException: No suitable driver found for exception. I tried the following things: Use addJar to add the driver Jar explicitly from the code. Using spark.executor.extraClassPath spark.driver.extraClassPath parameters. Using spark.driver.userClassPathFirst=true, when I used this option I'm getting a different error because mix of dependencies with Spark, Anyway this option

How to set PYTHONHASHSEED on AWS EMR

阅读更多关于 How to set PYTHONHASHSEED on AWS EMR

问题 Is there any way to set an environment variable on all nodes of an EMR cluster? I am getting an error when trying to use reduceByKey() in Python3 PySpark, and getting an error regarding the hash seed. I can see this is a known error, and that the environment varialbe PYTHONHASHSEED needs to be set to the same value on all nodes of the cluster, but I haven't had any luck with it. I have tried adding a variable to spark-env through the cluster configuration: [ { "Classification": "spark-env",

Amazon EMR Pyspark Module not found

阅读更多关于 Amazon EMR Pyspark Module not found

I created an Amazon EMR cluster with Spark already on it. When I run pyspark from the terminal it goes into the pyspark terminal when I ssh into my cluster. I uploaded a file using scp, and when I try to run it with python FileName.py, I get an import error: from pyspark import SparkContext ImportError: No module named pyspark How do I fix this? I add the following lines to ~/.bashrc for emr 4.3: export SPARK_HOME=/usr/lib/spark export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.XXX-src.zip:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH Here py4j-0.XXX-src

How to use Hadoop Streaming with LZO-compressed Sequence Files?

阅读更多关于 How to use Hadoop Streaming with LZO-compressed Sequence Files?

问题 I'm trying to play around with the Google ngrams dataset using Amazon's Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670, and I want to use Hadoop streaming. For the input files, it says "We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable."

S3 SlowDown error in Spark on EMR

阅读更多关于 S3 SlowDown error in Spark on EMR

I am getting this error when writing a parquet file, this has started to happen recently com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: 2CA496E2AB87DC16), S3 Extended Request ID: 1dBrcqVGJU9VgoT79NAVGyN0fsbj9+6bipC7op97ZmP+zSFIuH72lN03ZtYabNIA2KaSj18a8ho= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1389) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient

Amazon EMR Pyspark Module not found

阅读更多关于 Amazon EMR Pyspark Module not found

问题 I created an Amazon EMR cluster with Spark already on it. When I run pyspark from the terminal it goes into the pyspark terminal when I ssh into my cluster. I uploaded a file using scp, and when I try to run it with python FileName.py, I get an import error: from pyspark import SparkContext ImportError: No module named pyspark How do I fix this? 回答1: I add the following lines to ~/.bashrc for emr 4.3: export SPARK_HOME=/usr/lib/spark export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.XXX-src.zip

Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

阅读更多关于 Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

I am using an EMR Activity in AWS data pipeline. This EMR Activity is running a hive script in EMR Cluster. It takes dynamo DB as input and stores data in S3. This is the EMR step used in EMR Activity s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://my-s3-bucket/hive/my_hive_script.q,-d,DYNAMODB_INPUT_TABLE1=MyTable,-d,S3_OUTPUT_BUCKET=#{output.directoryPath} where out.direcoryPath is : s3://my-s3-bucket/output/#{format(@scheduledStartTime,"YYYY-MM-dd")} So this creates one folder and

Optimizing GC on EMR cluster

阅读更多关于 Optimizing GC on EMR cluster

I am running a Spark Job written in Scala on EMR and the stdout of each executor is filled with GC allocation failures. 2016-12-07T23:42:20.614+0000: [GC (Allocation Failure) 2016-12-07T23:42:20.614+0000: [ParNew: 909549K->432K(1022400K), 0.0089234 secs] 2279433K->1370373K(3294336K), 0.0090530 secs] [Times: user=0.11 sys=0.00, real=0.00 secs] 2016-12-07T23:42:21.572+0000: [GC (Allocation Failure) 2016-12-07T23:42:21.572+0000: [ParNew: 909296K->435K(1022400K), 0.0089298 secs] 2279237K->1370376K(3294336K), 0.0091147 secs] [Times: user=0.11 sys=0.01, real=0.00 secs] 2016-12-07T23:42:22.525+0000:

Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

阅读更多关于 Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

问题 I am using an EMR Activity in AWS data pipeline. This EMR Activity is running a hive script in EMR Cluster. It takes dynamo DB as input and stores data in S3. This is the EMR step used in EMR Activity s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://my-s3-bucket/hive/my_hive_script.q,-d,DYNAMODB_INPUT_TABLE1=MyTable,-d,S3_OUTPUT_BUCKET=#{output.directoryPath} where out.direcoryPath is