spark-streaming | 易学教程

Error reading Kafka SSL client truststore file from Spark streaming

阅读更多关于 Error reading Kafka SSL client truststore file from Spark streaming

问题 I have a Spark streaming application reading from Kafka. I am running it from EMR. Recently I implemented Kafka SSL. I am creating the Kafka client as shown below. I am getting a strange error running the application when it tries to read the truststore file. Error is: - Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: /tmp/kafka.client.truststore.jks (No such file or directory) What is causing this issue?

Error reading Kafka SSL client truststore file from Spark streaming

阅读更多关于 Error reading Kafka SSL client truststore file from Spark streaming

pyspark structured streaming write to parquet in batches

阅读更多关于 pyspark structured streaming write to parquet in batches

问题 I am doing some transformation on the spark structured streaming dataframe. I am storing the transformed dataframe as parquet files in hdfs. Now I want that the write to hdfs should happen in batches instead of transforming the whole dataframe first and then storing the dataframe. 回答1: Here is a parquet sink example: # parquet sink example targetParquetHDFS = sourceTopicKAFKA .writeStream .format("parquet") # can be "orc", "json", "csv", etc. .outputMode("append") # can only be "append"

pyspark structured streaming write to parquet in batches

阅读更多关于 pyspark structured streaming write to parquet in batches

spark streaming only executors on one machine is working

阅读更多关于 spark streaming only executors on one machine is working

问题 I'm using Spark Streaming to deal with message delivered by Kafka. and now I came across with a problem. There are several executors set on different machines to process those tasks, but there's only one executor, or to be specific, only executors on one machine is actually working while others remain free. Now tasks are heavily queued and I got oom alerts often. Here is my config: —driver-cores 1 —driver-memory 512m —executor-memory 512m —conf spark.memory.useLegacyMode=true —conf spark

How to deduplicate and keep latest based on timestamp field in spark structured streaming?

阅读更多关于 How to deduplicate and keep latest based on timestamp field in spark structured streaming?

问题 Spark dropDuplicates keeps the first instance and ignores all subsequent occurrences for that key. Is it possible to do remove duplicates while keeping the most recent occurrence? For example if below are the micro batches that I get, then I want to keep the most recent record (sorted on timestamp field) for each country. batchId: 0 Australia, 10, 2020-05-05 00:00:06 Belarus, 10, 2020-05-05 00:00:06 batchId: 1 Australia, 10, 2020-05-05 00:00:08 Belarus, 10, 2020-05-05 00:00:03 Then output

MapWithStateRDDRecord with kryo

阅读更多关于 MapWithStateRDDRecord with kryo

问题 How can I register MapWithStateRDDRecord in kryo? When I'm trying to do. `sparkConfiguration.registerKryoClasses(Array(classOf[org.apache.spark.streaming.rdd.MapWithStateRDD))` I get an error class MapWithStateRDDRecord in package rdd cannot be accessed in package org.apache.spark.streaming.rdd [error] classOf[org.apache.spark.streaming.rdd.MapWithStateRDDRecord] I'd like to make sure that all serialization is done via kryo thus I set SparkConf().set("spark.kryo.registrationRequired", "true")

MapWithStateRDDRecord with kryo

阅读更多关于 MapWithStateRDDRecord with kryo

How to integrate Spark and Kafka for direct stream

阅读更多关于 How to integrate Spark and Kafka for direct stream

问题 I am having difficulties creating a basic spark streaming application. Right now, am trying it on my local machine. I have done following setup. -Setup Zookeeper -Setup Kafka ( Version : kafka_2.10-0.9.0.1) -Created a topic using below command kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test -Started producer and consumer on two different cmd terminals using below commands Producer : kafka-console-producer.bat --broker-list localhost:9092

How to execute async operations (i.e. returning a Future) from map/filter/etc.?

阅读更多关于 How to execute async operations (i.e. returning a Future) from map/filter/etc.?

问题 I have a DataSet.map operation that needs to pull data in from an external REST API. The REST API client returns a Future[Int] . Is it possible to have the DataSet.map operation somehow await this Future asynchronously? Or will I need to block the thread using Await.result ? Or is this just not the done thing... i.e. should I instead try and load the data held by the API into a DataSet of its own, and perform a join ? Thanks in advance! EDIT: Different from: Spark job with Async HTTP call