spark-streaming

Spark Streaming - processing binary data file

巧了我就是萌 提交于 2021-02-07 14:39:33
问题 I'm using pyspark 1.6.0. I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data. In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......") This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the

Spark Streaming - processing binary data file

二次信任 提交于 2021-02-07 14:37:38
问题 I'm using pyspark 1.6.0. I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data. In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......") This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the

Enabling SSL between Apache spark and Kafka broker

你说的曾经没有我的故事 提交于 2021-02-07 10:43:15
问题 I am trying to enable the SSL between my Apache Spark 1.4.1 and Kafka 0.9.0.0 and I am using spark-streaming-kafka_2.10 Jar to connect to Kafka and I am using KafkaUtils.createDirectStream method to read the data from Kafka topic. Initially, I got OOM issue and I have resolved it by increasing the Driver memory, after that I am seeing below issue, I have done little bit of reading and found out that spark-streaming-kafka_2.10 uses Kafka 0.8.2.1 API, which doesn't support SSL (Kafka supports

Enabling SSL between Apache spark and Kafka broker

你。 提交于 2021-02-07 10:43:11
问题 I am trying to enable the SSL between my Apache Spark 1.4.1 and Kafka 0.9.0.0 and I am using spark-streaming-kafka_2.10 Jar to connect to Kafka and I am using KafkaUtils.createDirectStream method to read the data from Kafka topic. Initially, I got OOM issue and I have resolved it by increasing the Driver memory, after that I am seeing below issue, I have done little bit of reading and found out that spark-streaming-kafka_2.10 uses Kafka 0.8.2.1 API, which doesn't support SSL (Kafka supports

Escape quotes is not working in spark 2.2.0 while reading csv

断了今生、忘了曾经 提交于 2021-02-07 10:34:18
问题 I am trying to read my delimited file which is tab separated but not able to read all records. Here is my input records: head1 head2 head3 a b c a2 a3 a4 a1 "b1 "c1 My code: var inputDf = sparkSession.read .option("delimiter","\t") .option("header", "true") // .option("inferSchema", "true") .option("nullValue", "") .option("escape","\"") .option("multiLine", true) .option("nullValue", null) .option("nullValue", "NULL") .schema(finalSchema) .csv("file:///C:/Users/prhasija/Desktop

How to handle small file problem in spark structured streaming?

半世苍凉 提交于 2021-02-06 02:59:53
问题 I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store. I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files. These parquet files need to be read latter by hive

How to handle small file problem in spark structured streaming?

拥有回忆 提交于 2021-02-06 02:59:49
问题 I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store. I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files. These parquet files need to be read latter by hive

Does the state also gets removed on event timeout with spark structured streaming?

試著忘記壹切 提交于 2021-02-05 09:26:35
问题 Q. Does the state gets timed out and also gets removed at the same time or only the state gets timed out and state still remains for both ProcessingTimeout and EventTimeout? I was doing some experiment with mapGroupsWithState/flatmapGroupsWithState and having some confusion with the state timeout. Considering I am maintaining a state with a watermark of 10 seconds and applying time out based on event time say : ds.withWatermark("timestamp", "10 seconds") .groupByKey(...) .mapGroupsWithState(

Spark Structural Streaming with Confluent Cloud Kafka connectivity issue

落爺英雄遲暮 提交于 2021-02-04 16:41:16
问题 I am writing a Spark structured streaming application in PySpark to read data from Kafka in Confluent Cloud. The documentation for the spark readstream() function is too shallow and didn't specify much on the optional parameter part especially on the auth mechanism part. I am not sure what parameter goes wrong and crash the connectivity. Can anyone have experience in Spark help me to start this connection? Required Parameter > Consumer({'bootstrap.servers': > 'cluster.gcp.confluent.cloud:9092

Killing spark streaming job when no activity

本秂侑毒 提交于 2021-01-29 13:40:30
问题 I want to kill my spark streaming job when there is no activity (i.e. the receivers are not receiving messages) for a certain time. I tried doing this var counter = 0 myDStream.foreachRDD { rdd => if (rdd.count() == 0L) { counter = counter + 1 if (counter == 40) { ssc.stop(true, true) } } else { counter = 0 } } Is there a better way of doing this? How would I make a variable available to all receivers and update the variable by 1 whenever there is no activity? 回答1: Use a NoSQL Table like