amazon-emr

How to set PYTHONHASHSEED on AWS EMR

寵の児 提交于 2019-12-01 06:38:30
Is there any way to set an environment variable on all nodes of an EMR cluster? I am getting an error when trying to use reduceByKey() in Python3 PySpark, and getting an error regarding the hash seed. I can see this is a known error, and that the environment varialbe PYTHONHASHSEED needs to be set to the same value on all nodes of the cluster, but I haven't had any luck with it. I have tried adding a variable to spark-env through the cluster configuration: [ { "Classification": "spark-env", "Configurations": [ { "Classification": "export", "Properties": { "PYSPARK_PYTHON": "/usr/bin/python3",

Adding JDBC driver to Spark on EMR

吃可爱长大的小学妹 提交于 2019-12-01 06:16:35
问题 I'm trying to add a JDBC driver to a Spark cluster that is executing on top Amazon EMR but I keep getting the: java.sql.SQLException: No suitable driver found for exception. I tried the following things: Use addJar to add the driver Jar explicitly from the code. Using spark.executor.extraClassPath spark.driver.extraClassPath parameters. Using spark.driver.userClassPathFirst=true, when I used this option I'm getting a different error because mix of dependencies with Spark, Anyway this option

How to set PYTHONHASHSEED on AWS EMR

余生颓废 提交于 2019-12-01 04:11:12
问题 Is there any way to set an environment variable on all nodes of an EMR cluster? I am getting an error when trying to use reduceByKey() in Python3 PySpark, and getting an error regarding the hash seed. I can see this is a known error, and that the environment varialbe PYTHONHASHSEED needs to be set to the same value on all nodes of the cluster, but I haven't had any luck with it. I have tried adding a variable to spark-env through the cluster configuration: [ { "Classification": "spark-env",

Amazon EMR Pyspark Module not found

☆樱花仙子☆ 提交于 2019-12-01 04:04:01
I created an Amazon EMR cluster with Spark already on it. When I run pyspark from the terminal it goes into the pyspark terminal when I ssh into my cluster. I uploaded a file using scp, and when I try to run it with python FileName.py, I get an import error: from pyspark import SparkContext ImportError: No module named pyspark How do I fix this? I add the following lines to ~/.bashrc for emr 4.3: export SPARK_HOME=/usr/lib/spark export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.XXX-src.zip:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH Here py4j-0.XXX-src

How to use Hadoop Streaming with LZO-compressed Sequence Files?

一世执手 提交于 2019-12-01 03:48:27
问题 I'm trying to play around with the Google ngrams dataset using Amazon's Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670, and I want to use Hadoop streaming. For the input files, it says "We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable."

S3 SlowDown error in Spark on EMR

Deadly 提交于 2019-12-01 02:24:51
I am getting this error when writing a parquet file, this has started to happen recently com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: 2CA496E2AB87DC16), S3 Extended Request ID: 1dBrcqVGJU9VgoT79NAVGyN0fsbj9+6bipC7op97ZmP+zSFIuH72lN03ZtYabNIA2KaSj18a8ho= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1389) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient

Amazon EMR Pyspark Module not found

杀马特。学长 韩版系。学妹 提交于 2019-12-01 01:46:01
问题 I created an Amazon EMR cluster with Spark already on it. When I run pyspark from the terminal it goes into the pyspark terminal when I ssh into my cluster. I uploaded a file using scp, and when I try to run it with python FileName.py, I get an import error: from pyspark import SparkContext ImportError: No module named pyspark How do I fix this? 回答1: I add the following lines to ~/.bashrc for emr 4.3: export SPARK_HOME=/usr/lib/spark export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.XXX-src.zip

Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

吃可爱长大的小学妹 提交于 2019-11-30 22:35:24
I am using an EMR Activity in AWS data pipeline. This EMR Activity is running a hive script in EMR Cluster. It takes dynamo DB as input and stores data in S3. This is the EMR step used in EMR Activity s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://my-s3-bucket/hive/my_hive_script.q,-d,DYNAMODB_INPUT_TABLE1=MyTable,-d,S3_OUTPUT_BUCKET=#{output.directoryPath} where out.direcoryPath is : s3://my-s3-bucket/output/#{format(@scheduledStartTime,"YYYY-MM-dd")} So this creates one folder and

Optimizing GC on EMR cluster

*爱你&永不变心* 提交于 2019-11-30 18:45:40
I am running a Spark Job written in Scala on EMR and the stdout of each executor is filled with GC allocation failures. 2016-12-07T23:42:20.614+0000: [GC (Allocation Failure) 2016-12-07T23:42:20.614+0000: [ParNew: 909549K->432K(1022400K), 0.0089234 secs] 2279433K->1370373K(3294336K), 0.0090530 secs] [Times: user=0.11 sys=0.00, real=0.00 secs] 2016-12-07T23:42:21.572+0000: [GC (Allocation Failure) 2016-12-07T23:42:21.572+0000: [ParNew: 909296K->435K(1022400K), 0.0089298 secs] 2279237K->1370376K(3294336K), 0.0091147 secs] [Times: user=0.11 sys=0.01, real=0.00 secs] 2016-12-07T23:42:22.525+0000:

Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

眉间皱痕 提交于 2019-11-30 17:36:45
问题 I am using an EMR Activity in AWS data pipeline. This EMR Activity is running a hive script in EMR Cluster. It takes dynamo DB as input and stores data in S3. This is the EMR step used in EMR Activity s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://my-s3-bucket/hive/my_hive_script.q,-d,DYNAMODB_INPUT_TABLE1=MyTable,-d,S3_OUTPUT_BUCKET=#{output.directoryPath} where out.direcoryPath is