spark-streaming

Error reading Kafka SSL client truststore file from Spark streaming

百般思念 提交于 2021-02-10 18:30:39
问题 I have a Spark streaming application reading from Kafka. I am running it from EMR. Recently I implemented Kafka SSL. I am creating the Kafka client as shown below. I am getting a strange error running the application when it tries to read the truststore file. Error is: - Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: /tmp/kafka.client.truststore.jks (No such file or directory) What is causing this issue?

Error reading Kafka SSL client truststore file from Spark streaming

末鹿安然 提交于 2021-02-10 18:29:23
问题 I have a Spark streaming application reading from Kafka. I am running it from EMR. Recently I implemented Kafka SSL. I am creating the Kafka client as shown below. I am getting a strange error running the application when it tries to read the truststore file. Error is: - Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: /tmp/kafka.client.truststore.jks (No such file or directory) What is causing this issue?

pyspark structured streaming write to parquet in batches

牧云@^-^@ 提交于 2021-02-08 09:51:55
问题 I am doing some transformation on the spark structured streaming dataframe. I am storing the transformed dataframe as parquet files in hdfs. Now I want that the write to hdfs should happen in batches instead of transforming the whole dataframe first and then storing the dataframe. 回答1: Here is a parquet sink example: # parquet sink example targetParquetHDFS = sourceTopicKAFKA .writeStream .format("parquet") # can be "orc", "json", "csv", etc. .outputMode("append") # can only be "append"

pyspark structured streaming write to parquet in batches

拟墨画扇 提交于 2021-02-08 09:51:54
问题 I am doing some transformation on the spark structured streaming dataframe. I am storing the transformed dataframe as parquet files in hdfs. Now I want that the write to hdfs should happen in batches instead of transforming the whole dataframe first and then storing the dataframe. 回答1: Here is a parquet sink example: # parquet sink example targetParquetHDFS = sourceTopicKAFKA .writeStream .format("parquet") # can be "orc", "json", "csv", etc. .outputMode("append") # can only be "append"

spark streaming only executors on one machine is working

大憨熊 提交于 2021-02-08 08:51:11
问题 I'm using Spark Streaming to deal with message delivered by Kafka. and now I came across with a problem. There are several executors set on different machines to process those tasks, but there's only one executor, or to be specific, only executors on one machine is actually working while others remain free. Now tasks are heavily queued and I got oom alerts often. Here is my config: —driver-cores 1 —driver-memory 512m —executor-memory 512m —conf spark.memory.useLegacyMode=true —conf spark

How to deduplicate and keep latest based on timestamp field in spark structured streaming?

天大地大妈咪最大 提交于 2021-02-08 08:44:17
问题 Spark dropDuplicates keeps the first instance and ignores all subsequent occurrences for that key. Is it possible to do remove duplicates while keeping the most recent occurrence? For example if below are the micro batches that I get, then I want to keep the most recent record (sorted on timestamp field) for each country. batchId: 0 Australia, 10, 2020-05-05 00:00:06 Belarus, 10, 2020-05-05 00:00:06 batchId: 1 Australia, 10, 2020-05-05 00:00:08 Belarus, 10, 2020-05-05 00:00:03 Then output

MapWithStateRDDRecord with kryo

旧城冷巷雨未停 提交于 2021-02-08 07:54:49
问题 How can I register MapWithStateRDDRecord in kryo? When I'm trying to do. `sparkConfiguration.registerKryoClasses(Array(classOf[org.apache.spark.streaming.rdd.MapWithStateRDD))` I get an error class MapWithStateRDDRecord in package rdd cannot be accessed in package org.apache.spark.streaming.rdd [error] classOf[org.apache.spark.streaming.rdd.MapWithStateRDDRecord] I'd like to make sure that all serialization is done via kryo thus I set SparkConf().set("spark.kryo.registrationRequired", "true")

MapWithStateRDDRecord with kryo

[亡魂溺海] 提交于 2021-02-08 07:54:05
问题 How can I register MapWithStateRDDRecord in kryo? When I'm trying to do. `sparkConfiguration.registerKryoClasses(Array(classOf[org.apache.spark.streaming.rdd.MapWithStateRDD))` I get an error class MapWithStateRDDRecord in package rdd cannot be accessed in package org.apache.spark.streaming.rdd [error] classOf[org.apache.spark.streaming.rdd.MapWithStateRDDRecord] I'd like to make sure that all serialization is done via kryo thus I set SparkConf().set("spark.kryo.registrationRequired", "true")

How to integrate Spark and Kafka for direct stream

元气小坏坏 提交于 2021-02-08 05:58:16
问题 I am having difficulties creating a basic spark streaming application. Right now, am trying it on my local machine. I have done following setup. -Setup Zookeeper -Setup Kafka ( Version : kafka_2.10-0.9.0.1) -Created a topic using below command kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test -Started producer and consumer on two different cmd terminals using below commands Producer : kafka-console-producer.bat --broker-list localhost:9092

How to execute async operations (i.e. returning a Future) from map/filter/etc.?

十年热恋 提交于 2021-02-07 20:20:23
问题 I have a DataSet.map operation that needs to pull data in from an external REST API. The REST API client returns a Future[Int] . Is it possible to have the DataSet.map operation somehow await this Future asynchronously? Or will I need to block the thread using Await.result ? Or is this just not the done thing... i.e. should I instead try and load the data held by the API into a DataSet of its own, and perform a join ? Thanks in advance! EDIT: Different from: Spark job with Async HTTP call