pyspark

spark Athena connector

☆樱花仙子☆ 提交于 2020-01-30 03:44:30
问题 I need to use Athena in spark but spark uses preparedStatement when using JDBC drivers and it gives me an exception "com.amazonaws.athena.jdbc.NotImplementedException: Method Connection.prepareStatement is not yet implemented" Can you please let me know how can I connect Athena in spark 回答1: I don't know how you'd connect to Athena from Spark, but you don't need to - you can very easily query the data that Athena contains (or, more correctly, "registers") from Spark. There are two parts to

Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

大憨熊 提交于 2020-01-28 10:23:49
问题 I am working on updating a mysql database using pyspark framework, and running on AWS Glue services. I have a dataframe as follows: df2= sqlContext.createDataFrame([("xxx1","81A01","TERR NAME 55","NY"),("xxx2","81A01","TERR NAME 55","NY"),("x103","81A01","TERR NAME 01","NJ")], ["zip_code","territory_code","territory_name","state"]) # Print out information about this data df2.show() +--------+--------------+--------------+-----+ |zip_code|territory_code|territory_name|state| +--------+--------

Failed to find leader for topics; java.lang.NullPointerException NullPointerException at org.apache.kafka.common.utils.Utils.formatAddress

坚强是说给别人听的谎言 提交于 2020-01-28 03:03:44
问题 When we are trying to stream the data from SSL enabled Kafka topic we are facing below error . Can you please help us on this issue . 19/11/07 13:26:54 INFO ConsumerFetcherManager: [ConsumerFetcherManager-1573151189884] Added fetcher for partitions ArrayBuffer() 19/11/07 13:26:54 WARN ConsumerFetcherManager$LeaderFinderThread: [spark-streaming-consumer_dvtcbddc101.corp.cox.com-1573151189725-d40a510f-leader-finder-thread], Failed to find leader for Set([inst_monitor_status_test,2], [inst

Sum of array elements depending on value condition pyspark

a 夏天 提交于 2020-01-28 02:31:14
问题 I have a pyspark dataframe: id | column ------------------------------ 1 | [0.2, 2, 3, 4, 3, 0.5] ------------------------------ 2 | [7, 0.3, 0.3, 8, 2,] ------------------------------ I would like to create a 3 columns: Column 1 : contain the sum of the elements < 2 Column 2 : contain the sum of the elements > 2 Column 3 : contain the sum of the elements = 2 (some times I have duplicate values so I do their sum) In case if I don't have a values I put null. Expect result: id | column | column

Splitting Date into Year, Month and Day, with inconsistent delimiters

别等时光非礼了梦想. 提交于 2020-01-26 03:09:07
问题 I am trying to split my Date Column which is a String Type right now into 3 columns Year, Month and Date. I use (PySpark): split_date=pyspark.sql.functions.split(df['Date'], '-') df= df.withColumn('Year', split_date.getItem(0)) df= df.withColumn('Month', split_date.getItem(1)) df= df.withColumn('Day', split_date.getItem(2)) I run into an issue, because half my dates are separated by '-' and the other half are separated by '/'. How can I use and or operation to split the Date by either '-' or

Separate multi line record with start and end delimiter

孤街醉人 提交于 2020-01-25 10:14:12
问题 I have a file like this(I am providing you sample data, but file is very large): QQ 1 2 3 ZZ b QQ 4 5 6 ZZ a QQ 9 8 23 I want to read data between QQ and ZZ, So I want dataframe should look like : [1,2,3] [4,5,6] [9,8] Code which I have tried is as below,but this is taking failing for large data. from pyspark.sql.types import * from pyspark import SparkContext from pyspark.sql import SQLContext path ="/tmp/Poonam.Raskar/Sample.txt" sc =SparkContext() sqlContext = SQLContext(sc) sc.setLogLevel

Separate multi line record with start and end delimiter

随声附和 提交于 2020-01-25 10:13:37
问题 I have a file like this(I am providing you sample data, but file is very large): QQ 1 2 3 ZZ b QQ 4 5 6 ZZ a QQ 9 8 23 I want to read data between QQ and ZZ, So I want dataframe should look like : [1,2,3] [4,5,6] [9,8] Code which I have tried is as below,but this is taking failing for large data. from pyspark.sql.types import * from pyspark import SparkContext from pyspark.sql import SQLContext path ="/tmp/Poonam.Raskar/Sample.txt" sc =SparkContext() sqlContext = SQLContext(sc) sc.setLogLevel

fs.s3 configuration with two s3 account with EMR

若如初见. 提交于 2020-01-25 10:10:23
问题 I have pipeline using lambda and EMR, where I read csv from one s3 account A and write parquet to another s3 in account B. I created EMR in account B and has access to s3 in account B. I cannot add account A s3 bucket access in EMR_EC2_DefaultRole(as this account is enterprise wide data storage), so i use accessKey, secret key to access account A s3 bucket.This is done through congnito token. METHOD1 I am using fs.s3 protocol to read csv from s3 from account A and writing to s3 on account B.

Spark: write a CSV with null values as empty columns

被刻印的时光 ゝ 提交于 2020-01-25 09:48:11
问题 I'm using PySpark to write a dataframe to a CSV file like this: df.write.csv(PATH, nullValue='') There is a column in that dataframe of type string. Some of the values are null. These null values display like this: ...,"",... I would like them to be display like this instead: ...,,... Is this possible with an option in csv.write ()? Thanks! 回答1: Easily with emptyValue option setted emptyValue : sets the string representation of an empty value. If None is set, it use the default value, "" .

Spark Stream - 'utf8' codec can't decode bytes

这一生的挚爱 提交于 2020-01-25 09:07:05
问题 I'm fairly new to stream programming. We have Kafka stream which use Avro. I want to connect a Kafka Stream to Spark Stream. I used bellow code. kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}) lines = kvs.map(lambda x: x[1]) I got bellow error. return s.decode('utf-8') File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 57-58: