emr | 易学教程

Detected Guava issue #1635 which indicates that a version of Guava less than 16.01 is in use

阅读更多关于 Detected Guava issue #1635 which indicates that a version of Guava less than 16.01 is in use

问题 I am running spark job on emr and using datastax connector to connect to cassandra cluster. I am facing issues with the guava jar please find the details as below I am using below cassandra deps cqlsh 5.0.1 | Cassandra 3.0.1 | CQL spec 3.3.1 Running spark job on EMR 4.4 with below maven deps org.apache.spark spark-streaming_2.10 1.5.0 <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.5.0</version> </dependency> <dependency> <groupId>org

EMR Spark - TransportClient: Failed to send RPC

阅读更多关于 EMR Spark - TransportClient: Failed to send RPC

问题 I'm getting this error, I tried to increase memory on cluster instances and in the executor and driver parameters without success. 17/05/07 23:17:07 ERROR TransportClient: Failed to send RPC 6465703946954088562 to ip-172-30-12-164.eu-central-1.compute.internal/172.30.12.164:34706: java.nio.channels.ClosedChannelException Does anyone have any clue to fix this error? BTW I'm using YARN as cluster manager Thanks in advance 回答1: Finally I resolved the problem. It was due to insufficient disk

How to avoid reading old files from S3 when appending new data?

阅读更多关于 How to avoid reading old files from S3 when appending new data?

问题 Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet'

How to avoid reading old files from S3 when appending new data?

阅读更多关于 How to avoid reading old files from S3 when appending new data?

Pyspark - Load file: Path does not exist

阅读更多关于 Pyspark - Load file: Path does not exist

问题 I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one: spark = SparkSession \ .builder \ .appName("Protob Conversion to Parquet") \ .config("spark.some.config.option", "some-value") \ .getOrCreate()\ df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True) When I run the script raises the following error message: pyspark.sql.utils.AnalysisException: u'Path does not exist:

Optimizing GC on EMR cluster

阅读更多关于 Optimizing GC on EMR cluster

问题 I am running a Spark Job written in Scala on EMR and the stdout of each executor is filled with GC allocation failures. 2016-12-07T23:42:20.614+0000: [GC (Allocation Failure) 2016-12-07T23:42:20.614+0000: [ParNew: 909549K->432K(1022400K), 0.0089234 secs] 2279433K->1370373K(3294336K), 0.0090530 secs] [Times: user=0.11 sys=0.00, real=0.00 secs] 2016-12-07T23:42:21.572+0000: [GC (Allocation Failure) 2016-12-07T23:42:21.572+0000: [ParNew: 909296K->435K(1022400K), 0.0089298 secs] 2279237K-

Optimizing GC on EMR cluster

阅读更多关于 Optimizing GC on EMR cluster

Pyspark --py-files doesn't work

阅读更多关于 Pyspark --py-files doesn't work

问题 I use this as document suggests http://spark.apache.org/docs/1.1.1/submitting-applications.html spsark version 1.1.0 ./spark/bin/spark-submit --py-files /home/hadoop/loganalysis/parser-src.zip \ /home/hadoop/loganalysis/ship-test.py and conf in code : conf = (SparkConf() .setMaster("yarn-client") .setAppName("LogAnalysis") .set("spark.executor.memory", "1g") .set("spark.executor.cores", "4") .set("spark.executor.num", "2") .set("spark.driver.memory", "4g") .set("spark.kryoserializer.buffer.mb

How do you make a HIVE table out of JSON data?

阅读更多关于 How do you make a HIVE table out of JSON data?

问题 I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible? I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get the JSON file to be a Hive table? Does anyone have some example command to get me started, I can't find anything useful with Google ... 回答1: You'll need to use a JSON serde in order for Hive to map your JSON to the columns in your table. A really good

Too many open files in EMR

阅读更多关于 Too many open files in EMR

问题 I am getting the following excpetion in my reducers: EMFILE: Too many open files at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:296) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:257) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth