spark-streaming | 易学教程

Spark Streaming - processing binary data file

阅读更多关于 Spark Streaming - processing binary data file

问题 I'm using pyspark 1.6.0. I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data. In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......") This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the

Spark Streaming - processing binary data file

阅读更多关于 Spark Streaming - processing binary data file

Enabling SSL between Apache spark and Kafka broker

阅读更多关于 Enabling SSL between Apache spark and Kafka broker

问题 I am trying to enable the SSL between my Apache Spark 1.4.1 and Kafka 0.9.0.0 and I am using spark-streaming-kafka_2.10 Jar to connect to Kafka and I am using KafkaUtils.createDirectStream method to read the data from Kafka topic. Initially, I got OOM issue and I have resolved it by increasing the Driver memory, after that I am seeing below issue, I have done little bit of reading and found out that spark-streaming-kafka_2.10 uses Kafka 0.8.2.1 API, which doesn't support SSL (Kafka supports

Enabling SSL between Apache spark and Kafka broker

阅读更多关于 Enabling SSL between Apache spark and Kafka broker

Escape quotes is not working in spark 2.2.0 while reading csv

阅读更多关于 Escape quotes is not working in spark 2.2.0 while reading csv

问题 I am trying to read my delimited file which is tab separated but not able to read all records. Here is my input records: head1 head2 head3 a b c a2 a3 a4 a1 "b1 "c1 My code: var inputDf = sparkSession.read .option("delimiter","\t") .option("header", "true") // .option("inferSchema", "true") .option("nullValue", "") .option("escape","\"") .option("multiLine", true) .option("nullValue", null) .option("nullValue", "NULL") .schema(finalSchema) .csv("file:///C:/Users/prhasija/Desktop

How to handle small file problem in spark structured streaming?

阅读更多关于 How to handle small file problem in spark structured streaming?

问题 I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store. I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files. These parquet files need to be read latter by hive

How to handle small file problem in spark structured streaming?

阅读更多关于 How to handle small file problem in spark structured streaming?

Does the state also gets removed on event timeout with spark structured streaming?

阅读更多关于 Does the state also gets removed on event timeout with spark structured streaming?

问题 Q. Does the state gets timed out and also gets removed at the same time or only the state gets timed out and state still remains for both ProcessingTimeout and EventTimeout? I was doing some experiment with mapGroupsWithState/flatmapGroupsWithState and having some confusion with the state timeout. Considering I am maintaining a state with a watermark of 10 seconds and applying time out based on event time say : ds.withWatermark("timestamp", "10 seconds") .groupByKey(...) .mapGroupsWithState(

Spark Structural Streaming with Confluent Cloud Kafka connectivity issue

阅读更多关于 Spark Structural Streaming with Confluent Cloud Kafka connectivity issue

问题 I am writing a Spark structured streaming application in PySpark to read data from Kafka in Confluent Cloud. The documentation for the spark readstream() function is too shallow and didn't specify much on the optional parameter part especially on the auth mechanism part. I am not sure what parameter goes wrong and crash the connectivity. Can anyone have experience in Spark help me to start this connection? Required Parameter > Consumer({'bootstrap.servers': > 'cluster.gcp.confluent.cloud:9092

Killing spark streaming job when no activity

阅读更多关于 Killing spark streaming job when no activity

问题 I want to kill my spark streaming job when there is no activity (i.e. the receivers are not receiving messages) for a certain time. I tried doing this var counter = 0 myDStream.foreachRDD { rdd => if (rdd.count() == 0L) { counter = counter + 1 if (counter == 40) { ssc.stop(true, true) } } else { counter = 0 } } Is there a better way of doing this? How would I make a variable available to all receivers and update the variable by 1 whenever there is no activity? 回答1: Use a NoSQL Table like