spark-structured-streaming

dataframe look up and optimization

天大地大妈咪最大 提交于 2020-07-25 03:48:11
问题 I am using spark-sql-2.4.3v with java. I have scenario below val data = List( ("20", "score", "school", 14 ,12), ("21", "score", "school", 13 , 13), ("22", "rate", "school", 11 ,14), ("23", "score", "school", 11 ,14), ("24", "rate", "school", 12 ,12), ("25", "score", "school", 11 ,14) ) val df = data.toDF("id", "code", "entity", "value1","value2") df.show //this look up data populated from DB. val ll = List( ("aaaa", 11), ("aaa", 12), ("aa", 13), ("a", 14) ) val codeValudeDf = ll.toDF( "code"

dataframe look up and optimization

半世苍凉 提交于 2020-07-25 03:45:18
问题 I am using spark-sql-2.4.3v with java. I have scenario below val data = List( ("20", "score", "school", 14 ,12), ("21", "score", "school", 13 , 13), ("22", "rate", "school", 11 ,14), ("23", "score", "school", 11 ,14), ("24", "rate", "school", 12 ,12), ("25", "score", "school", 11 ,14) ) val df = data.toDF("id", "code", "entity", "value1","value2") df.show //this look up data populated from DB. val ll = List( ("aaaa", 11), ("aaa", 12), ("aa", 13), ("a", 14) ) val codeValudeDf = ll.toDF( "code"

How to read stream of structured data and write to Hive table

六眼飞鱼酱① 提交于 2020-07-07 11:25:27
问题 There is a need to read the stream of structured data from Kafka stream and write it to the already existing Hive table. Upon analysis, it appears that one of the options is to do readStream of Kafka source and then do writeStream to a File sink in HDFS file path. My question here is- is it possible to directly write to a Hive table? Or, Is there a workaround approach that can be followed for this use-case? EDIT1: .foreachBatch - seems to be working but it is having the issue mentioned below

Spark Streaming: Kafka group id not permitted in Spark Structured Streaming

天涯浪子 提交于 2020-07-03 08:09:06
问题 I am writing a Spark structured streaming application in PySpark to read data from Kafka. However, the current version of Spark is 2.1.0, which does not allow me to set group id as a parameter and will generate a unique id for each query. But the Kafka connection is group-based authorization which requires a pre-set group id. Hence, is there any workaround to establish the connection without the need to update Spark to 2.2 since my team does not want it. My Code: if __name__ == "__main__":

Spark Streaming: Kafka group id not permitted in Spark Structured Streaming

≡放荡痞女 提交于 2020-07-03 08:09:04
问题 I am writing a Spark structured streaming application in PySpark to read data from Kafka. However, the current version of Spark is 2.1.0, which does not allow me to set group id as a parameter and will generate a unique id for each query. But the Kafka connection is group-based authorization which requires a pre-set group id. Hence, is there any workaround to establish the connection without the need to update Spark to 2.2 since my team does not want it. My Code: if __name__ == "__main__":

How to pass configuration file that hosted in HDFS to Spark Application?

心已入冬 提交于 2020-06-29 08:03:05
问题 I'm working with Spark Structured Streaming. Also, I'm working with Scala . I want to pass config file to my spark application. This configuration file hosted in HDFS . For example; spark_job.conf (HOCON) spark { appName: "", master: "", shuffle.size: 4 etc.. } kafkaSource { servers: "", topic: "", etc.. } redisSink { host: "", port: 999, timeout: 2000, checkpointLocation: "hdfs location", etc.. } How can I pass it to Spark Application? How can I read this file( hosted HDFS ) in Spark? 回答1:

How to compute difference between timestamps with PySpark Structured Streaming

耗尽温柔 提交于 2020-06-28 04:44:46
问题 I have the following problem with PySpark Structured Streaming. Every line in my stream data has a user ID and a timestamp. Now, for every line and for every user, I want to add a column with the difference of the timestamps. For example, suppose the first line that I receive says: "User A, 08:00:00". If the second line says "User A, 08:00:10" then I want to add a column in the second line called "Interval" saying "10 seconds". Is there anyone who knows how to achieve this? I tried to use the

How pass Basic Authentication to Confluent Schema Registry?

巧了我就是萌 提交于 2020-06-26 07:03:50
问题 I want to read data from a confluent cloud topic and then write in another topic. At localhost, I haven't had any major problems. But the schema registry of confluent cloud requires to pass some authentication data that I don't know how to enter them: basic.auth.credentials.source=USER_INFO schema.registry.basic.auth.user.info=: schema.registry.url=https://xxxxxxxxxx.confluent.cloudBlockquote Below is the current code: import com.databricks.spark.avro.SchemaConverters import io.confluent

How pass Basic Authentication to Confluent Schema Registry?

寵の児 提交于 2020-06-26 07:01:17
问题 I want to read data from a confluent cloud topic and then write in another topic. At localhost, I haven't had any major problems. But the schema registry of confluent cloud requires to pass some authentication data that I don't know how to enter them: basic.auth.credentials.source=USER_INFO schema.registry.basic.auth.user.info=: schema.registry.url=https://xxxxxxxxxx.confluent.cloudBlockquote Below is the current code: import com.databricks.spark.avro.SchemaConverters import io.confluent

Spark - Reading JSON from Partitioned Folders using Firehose

二次信任 提交于 2020-06-22 11:50:52
问题 Kinesis firehose manages the persistence of files, in this case time series JSON, into a folder hierarchy that is partitioned by YYYY/MM/DD/HH (down to the hour in 24 numbering)...great. How using Spark 2.0 then can I read these nested sub folders and create a static Dataframe from all the leaf json files? Is there an 'option' to the dataframe reader? My next goal is for this to be a streaming DF, where new files persisted by Firehose into s3 naturally become part of the streaming dataframe