spark-structured-streaming

dataframe look up and optimization

阅读更多关于 dataframe look up and optimization

问题 I am using spark-sql-2.4.3v with java. I have scenario below val data = List( ("20", "score", "school", 14 ,12), ("21", "score", "school", 13 , 13), ("22", "rate", "school", 11 ,14), ("23", "score", "school", 11 ,14), ("24", "rate", "school", 12 ,12), ("25", "score", "school", 11 ,14) ) val df = data.toDF("id", "code", "entity", "value1","value2") df.show //this look up data populated from DB. val ll = List( ("aaaa", 11), ("aaa", 12), ("aa", 13), ("a", 14) ) val codeValudeDf = ll.toDF( "code"

dataframe look up and optimization

阅读更多关于 dataframe look up and optimization

How to read stream of structured data and write to Hive table

阅读更多关于 How to read stream of structured data and write to Hive table

问题 There is a need to read the stream of structured data from Kafka stream and write it to the already existing Hive table. Upon analysis, it appears that one of the options is to do readStream of Kafka source and then do writeStream to a File sink in HDFS file path. My question here is- is it possible to directly write to a Hive table? Or, Is there a workaround approach that can be followed for this use-case? EDIT1: .foreachBatch - seems to be working but it is having the issue mentioned below

Spark Streaming: Kafka group id not permitted in Spark Structured Streaming

阅读更多关于 Spark Streaming: Kafka group id not permitted in Spark Structured Streaming

问题 I am writing a Spark structured streaming application in PySpark to read data from Kafka. However, the current version of Spark is 2.1.0, which does not allow me to set group id as a parameter and will generate a unique id for each query. But the Kafka connection is group-based authorization which requires a pre-set group id. Hence, is there any workaround to establish the connection without the need to update Spark to 2.2 since my team does not want it. My Code: if __name__ == "__main__":

Spark Streaming: Kafka group id not permitted in Spark Structured Streaming

阅读更多关于 Spark Streaming: Kafka group id not permitted in Spark Structured Streaming

How to pass configuration file that hosted in HDFS to Spark Application?

阅读更多关于 How to pass configuration file that hosted in HDFS to Spark Application?

问题 I'm working with Spark Structured Streaming. Also, I'm working with Scala . I want to pass config file to my spark application. This configuration file hosted in HDFS . For example; spark_job.conf (HOCON) spark { appName: "", master: "", shuffle.size: 4 etc.. } kafkaSource { servers: "", topic: "", etc.. } redisSink { host: "", port: 999, timeout: 2000, checkpointLocation: "hdfs location", etc.. } How can I pass it to Spark Application? How can I read this file( hosted HDFS ) in Spark? 回答1:

How to compute difference between timestamps with PySpark Structured Streaming

阅读更多关于 How to compute difference between timestamps with PySpark Structured Streaming

问题 I have the following problem with PySpark Structured Streaming. Every line in my stream data has a user ID and a timestamp. Now, for every line and for every user, I want to add a column with the difference of the timestamps. For example, suppose the first line that I receive says: "User A, 08:00:00". If the second line says "User A, 08:00:10" then I want to add a column in the second line called "Interval" saying "10 seconds". Is there anyone who knows how to achieve this? I tried to use the

How pass Basic Authentication to Confluent Schema Registry?

阅读更多关于 How pass Basic Authentication to Confluent Schema Registry?

问题 I want to read data from a confluent cloud topic and then write in another topic. At localhost, I haven't had any major problems. But the schema registry of confluent cloud requires to pass some authentication data that I don't know how to enter them: basic.auth.credentials.source=USER_INFO schema.registry.basic.auth.user.info=: schema.registry.url=https://xxxxxxxxxx.confluent.cloudBlockquote Below is the current code: import com.databricks.spark.avro.SchemaConverters import io.confluent

How pass Basic Authentication to Confluent Schema Registry?

阅读更多关于 How pass Basic Authentication to Confluent Schema Registry?

Spark - Reading JSON from Partitioned Folders using Firehose

阅读更多关于 Spark - Reading JSON from Partitioned Folders using Firehose

问题 Kinesis firehose manages the persistence of files, in this case time series JSON, into a folder hierarchy that is partitioned by YYYY/MM/DD/HH (down to the hour in 24 numbering)...great. How using Spark 2.0 then can I read these nested sub folders and create a static Dataframe from all the leaf json files? Is there an 'option' to the dataframe reader? My next goal is for this to be a streaming DF, where new files persisted by Firehose into s3 naturally become part of the streaming dataframe