emr

Detected Guava issue #1635 which indicates that a version of Guava less than 16.01 is in use

霸气de小男生 提交于 2019-12-19 19:43:07
问题 I am running spark job on emr and using datastax connector to connect to cassandra cluster. I am facing issues with the guava jar please find the details as below I am using below cassandra deps cqlsh 5.0.1 | Cassandra 3.0.1 | CQL spec 3.3.1 Running spark job on EMR 4.4 with below maven deps org.apache.spark spark-streaming_2.10 1.5.0 <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.5.0</version> </dependency> <dependency> <groupId>org

EMR Spark - TransportClient: Failed to send RPC

£可爱£侵袭症+ 提交于 2019-12-19 13:07:28
问题 I'm getting this error, I tried to increase memory on cluster instances and in the executor and driver parameters without success. 17/05/07 23:17:07 ERROR TransportClient: Failed to send RPC 6465703946954088562 to ip-172-30-12-164.eu-central-1.compute.internal/172.30.12.164:34706: java.nio.channels.ClosedChannelException Does anyone have any clue to fix this error? BTW I'm using YARN as cluster manager Thanks in advance 回答1: Finally I resolved the problem. It was due to insufficient disk

How to avoid reading old files from S3 when appending new data?

允我心安 提交于 2019-12-19 12:06:15
问题 Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet'

How to avoid reading old files from S3 when appending new data?

筅森魡賤 提交于 2019-12-19 12:05:26
问题 Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet'

Pyspark - Load file: Path does not exist

99封情书 提交于 2019-12-19 03:39:14
问题 I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one: spark = SparkSession \ .builder \ .appName("Protob Conversion to Parquet") \ .config("spark.some.config.option", "some-value") \ .getOrCreate()\ df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True) When I run the script raises the following error message: pyspark.sql.utils.AnalysisException: u'Path does not exist:

Optimizing GC on EMR cluster

回眸只為那壹抹淺笑 提交于 2019-12-18 19:33:09
问题 I am running a Spark Job written in Scala on EMR and the stdout of each executor is filled with GC allocation failures. 2016-12-07T23:42:20.614+0000: [GC (Allocation Failure) 2016-12-07T23:42:20.614+0000: [ParNew: 909549K->432K(1022400K), 0.0089234 secs] 2279433K->1370373K(3294336K), 0.0090530 secs] [Times: user=0.11 sys=0.00, real=0.00 secs] 2016-12-07T23:42:21.572+0000: [GC (Allocation Failure) 2016-12-07T23:42:21.572+0000: [ParNew: 909296K->435K(1022400K), 0.0089298 secs] 2279237K-

Optimizing GC on EMR cluster

孤人 提交于 2019-12-18 19:33:01
问题 I am running a Spark Job written in Scala on EMR and the stdout of each executor is filled with GC allocation failures. 2016-12-07T23:42:20.614+0000: [GC (Allocation Failure) 2016-12-07T23:42:20.614+0000: [ParNew: 909549K->432K(1022400K), 0.0089234 secs] 2279433K->1370373K(3294336K), 0.0090530 secs] [Times: user=0.11 sys=0.00, real=0.00 secs] 2016-12-07T23:42:21.572+0000: [GC (Allocation Failure) 2016-12-07T23:42:21.572+0000: [ParNew: 909296K->435K(1022400K), 0.0089298 secs] 2279237K-

Pyspark --py-files doesn't work

老子叫甜甜 提交于 2019-12-18 14:14:44
问题 I use this as document suggests http://spark.apache.org/docs/1.1.1/submitting-applications.html spsark version 1.1.0 ./spark/bin/spark-submit --py-files /home/hadoop/loganalysis/parser-src.zip \ /home/hadoop/loganalysis/ship-test.py and conf in code : conf = (SparkConf() .setMaster("yarn-client") .setAppName("LogAnalysis") .set("spark.executor.memory", "1g") .set("spark.executor.cores", "4") .set("spark.executor.num", "2") .set("spark.driver.memory", "4g") .set("spark.kryoserializer.buffer.mb

How do you make a HIVE table out of JSON data?

穿精又带淫゛_ 提交于 2019-12-18 10:05:49
问题 I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible? I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get the JSON file to be a Hive table? Does anyone have some example command to get me started, I can't find anything useful with Google ... 回答1: You'll need to use a JSON serde in order for Hive to map your JSON to the columns in your table. A really good

Too many open files in EMR

蹲街弑〆低调 提交于 2019-12-18 06:57:03
问题 I am getting the following excpetion in my reducers: EMFILE: Too many open files at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:296) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:257) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth