pyspark | 易学教程

spark Athena connector

阅读更多关于 spark Athena connector

问题 I need to use Athena in spark but spark uses preparedStatement when using JDBC drivers and it gives me an exception "com.amazonaws.athena.jdbc.NotImplementedException: Method Connection.prepareStatement is not yet implemented" Can you please let me know how can I connect Athena in spark 回答1: I don't know how you'd connect to Athena from Spark, but you don't need to - you can very easily query the data that Athena contains (or, more correctly, "registers") from Spark. There are two parts to

Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

阅读更多关于 Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

问题 I am working on updating a mysql database using pyspark framework, and running on AWS Glue services. I have a dataframe as follows: df2= sqlContext.createDataFrame([("xxx1","81A01","TERR NAME 55","NY"),("xxx2","81A01","TERR NAME 55","NY"),("x103","81A01","TERR NAME 01","NJ")], ["zip_code","territory_code","territory_name","state"]) # Print out information about this data df2.show() +--------+--------------+--------------+-----+ |zip_code|territory_code|territory_name|state| +--------+--------

Failed to find leader for topics; java.lang.NullPointerException NullPointerException at org.apache.kafka.common.utils.Utils.formatAddress

阅读更多关于 Failed to find leader for topics; java.lang.NullPointerException NullPointerException at org.apache.kafka.common.utils.Utils.formatAddress

问题 When we are trying to stream the data from SSL enabled Kafka topic we are facing below error . Can you please help us on this issue . 19/11/07 13:26:54 INFO ConsumerFetcherManager: [ConsumerFetcherManager-1573151189884] Added fetcher for partitions ArrayBuffer() 19/11/07 13:26:54 WARN ConsumerFetcherManager$LeaderFinderThread: [spark-streaming-consumer_dvtcbddc101.corp.cox.com-1573151189725-d40a510f-leader-finder-thread], Failed to find leader for Set([inst_monitor_status_test,2], [inst

Sum of array elements depending on value condition pyspark

阅读更多关于 Sum of array elements depending on value condition pyspark

问题 I have a pyspark dataframe: id | column ------------------------------ 1 | [0.2, 2, 3, 4, 3, 0.5] ------------------------------ 2 | [7, 0.3, 0.3, 8, 2,] ------------------------------ I would like to create a 3 columns: Column 1 : contain the sum of the elements < 2 Column 2 : contain the sum of the elements > 2 Column 3 : contain the sum of the elements = 2 (some times I have duplicate values so I do their sum) In case if I don't have a values I put null. Expect result: id | column | column

Splitting Date into Year, Month and Day, with inconsistent delimiters

阅读更多关于 Splitting Date into Year, Month and Day, with inconsistent delimiters

问题 I am trying to split my Date Column which is a String Type right now into 3 columns Year, Month and Date. I use (PySpark): split_date=pyspark.sql.functions.split(df['Date'], '-') df= df.withColumn('Year', split_date.getItem(0)) df= df.withColumn('Month', split_date.getItem(1)) df= df.withColumn('Day', split_date.getItem(2)) I run into an issue, because half my dates are separated by '-' and the other half are separated by '/'. How can I use and or operation to split the Date by either '-' or

Separate multi line record with start and end delimiter

阅读更多关于 Separate multi line record with start and end delimiter

问题 I have a file like this(I am providing you sample data, but file is very large): QQ 1 2 3 ZZ b QQ 4 5 6 ZZ a QQ 9 8 23 I want to read data between QQ and ZZ, So I want dataframe should look like : [1,2,3] [4,5,6] [9,8] Code which I have tried is as below,but this is taking failing for large data. from pyspark.sql.types import * from pyspark import SparkContext from pyspark.sql import SQLContext path ="/tmp/Poonam.Raskar/Sample.txt" sc =SparkContext() sqlContext = SQLContext(sc) sc.setLogLevel

Separate multi line record with start and end delimiter

阅读更多关于 Separate multi line record with start and end delimiter

fs.s3 configuration with two s3 account with EMR

阅读更多关于 fs.s3 configuration with two s3 account with EMR

问题 I have pipeline using lambda and EMR, where I read csv from one s3 account A and write parquet to another s3 in account B. I created EMR in account B and has access to s3 in account B. I cannot add account A s3 bucket access in EMR_EC2_DefaultRole(as this account is enterprise wide data storage), so i use accessKey, secret key to access account A s3 bucket.This is done through congnito token. METHOD1 I am using fs.s3 protocol to read csv from s3 from account A and writing to s3 on account B.

Spark: write a CSV with null values as empty columns

阅读更多关于 Spark: write a CSV with null values as empty columns

问题 I'm using PySpark to write a dataframe to a CSV file like this: df.write.csv(PATH, nullValue='') There is a column in that dataframe of type string. Some of the values are null. These null values display like this: ...,"",... I would like them to be display like this instead: ...,,... Is this possible with an option in csv.write ()? Thanks! 回答1: Easily with emptyValue option setted emptyValue : sets the string representation of an empty value. If None is set, it use the default value, "" .

Spark Stream - 'utf8' codec can't decode bytes

阅读更多关于 Spark Stream - 'utf8' codec can't decode bytes

问题 I'm fairly new to stream programming. We have Kafka stream which use Avro. I want to connect a Kafka Stream to Spark Stream. I used bellow code. kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}) lines = kvs.map(lambda x: x[1]) I got bellow error. return s.decode('utf-8') File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 57-58: