spark-streaming

Spark Streaming - read json from Kafka and write json to other Kafka topic

不羁岁月 提交于 2019-12-07 06:22:56
问题 I'm trying prepare application for Spark streaming (Spark 2.1, Kafka 0.10) I need to read data from Kafka topic "input", find correct data and write result to topic "output" I can read data from Kafka base on KafkaUtils.createDirectStream method. I converted the RDD to json and prepare filters: val messages = KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) val elementDstream = messages.map(v => v.value).foreachRDD { rdd =>

For each RDD in a DStream how do I convert this to an array or some other typical Java data type?

谁说我不能喝 提交于 2019-12-07 03:10:13
问题 I would like to convert a DStream into an array, list, etc. so I can then translate it to json and serve it on an endpoint. I'm using apache spark, injecting twitter data. How do I preform this operation on the Dstream statuses ? I can't seem to get anything to work other than print() . import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.streaming._ import org.apache.spark.streaming.twitter._ import org.apache.spark.streaming.StreamingContext._ import

Is it possible to obtain specific message offset in Kafka+SparkStreaming?

匆匆过客 提交于 2019-12-07 02:53:19
问题 I'm trying to obtain and store the offset for a specific message in Kafka by using Spark Direct Stream. Looking at the Spark documentation is simple to obtain the range offsets for each partition but what I need is to store the start offset for each message of a topic after a full scan of the queue. 回答1: Yes, you can use MessageAndMetadata version of createDirectStream which allows you to access message metadata . You can find example here which returns Dstream of tuple3 . val ssc = new

Can't access kafka.serializer.StringDecoder

给你一囗甜甜゛ 提交于 2019-12-07 02:04:26
I have added the sbt packages fro kafka and spark streaming as follow: "org.apache.spark" % "spark-streaming_2.10" % "1.6.1", "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.1" however when I wanna use the kafkadirect streams..I cant access it.. val topics="CCN_TOPIC,GGSN_TOPIC" val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaBrokers) val messages= org.apache.spark.streaming.kafka.KafkaUtils[String, String, kafka.serializer.StringDecoder, StringDecoder]( ssc, kafkaParams, topicsSet) The compiler doesn't recognize kafka

Websphere MQ as a data source for Apache Spark Streaming

末鹿安然 提交于 2019-12-06 18:20:09
问题 I was digging into the possibilities for Websphere MQ as a data source for spark-streaming becuase it is needed in one of our use case. I got to know that MQTT is the protocol that supports the communication from MQ data structures but since I am a newbie to spark streaming I need some working examples for the same. Did anyone try to connect the MQ with spark streaming. Please devise the best way for doing so. 回答1: So, I am posting here the working code for CustomMQReceiver which connects the

Structured Streaming Kafka Source Offset Storage

杀马特。学长 韩版系。学妹 提交于 2019-12-06 16:39:33
问题 I am using the Structured Streaming source for Kafka (Integration guide), which as stated does not commit any offset. One of my goals is to monitor it (check if its lagging behind etc). Even though it does not commit the offsets it handles them by querying kafka from time to time and checking which is the next one to process. According to the documentation the offsets are written to HDFS so in case of failure it can be recovered, but the question is: Where are they being stored? Is there any

Spark Kafka Streaming CommitAsync Error [duplicate]

无人久伴 提交于 2019-12-06 16:12:55
问题 This question already has an answer here : Exception while accessing KafkaOffset from RDD (1 answer) Closed last year . I am new to Scala and RDD concept. Reading message from kafka using Kafka stream api in Spark and trying to commit after business work. but I am getting error. Note: Using repartition for Parallel work How to read offset from stream APi and commit it to Kafka ? scalaVersion := "2.11.8" val sparkVersion = "2.2.0" val connectorVersion = "2.0.7" val kafka_stream_version = "1.6

How to check if n consecutive events from kafka stream is greater or less than threshold limit

人盡茶涼 提交于 2019-12-06 16:11:19
I an new to pyspark. I have written a pyspark program to read kafka stream using window operation. I am publishing the below message to kafka every second with different sources and temperatures along with the timestamp. {"temperature":34,"time":"2019-04-17 12:53:02","source":"1010101"} {"temperature":29,"time":"2019-04-17 12:53:03","source":"1010101"} {"temperature":28,"time":"2019-04-17 12:53:04","source":"1010101"} {"temperature":34,"time":"2019-04-17 12:53:05","source":"1010101"} {"temperature":45,"time":"2019-04-17 12:53:06","source":"1010101"} {"temperature":34,"time":"2019-04-17 12:53

How to convert spark streaming output into dataframe or storing in table

我怕爱的太早我们不能终老 提交于 2019-12-06 15:38:10
My code is: val lines = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer-group", Map("hello" -> 5)) val data=lines.map(_._2) data.print() My output has 50 different values in a format as below {"id:st04","data:26-02-2018 20:30:40","temp:30", "press:20"} Can anyone help me in storing this data in a table form as | id |date |temp|press| |st01|26-02-2018 20:30:40| 30 |20 | |st01|26-02-2018 20:30:45| 80 |70 | I will really appreciate. T. Gawęda You can use foreachRDD function, together with normal Dataset API: data.foreachRDD(rdd => { // rdd is RDD[String] // foreachRDD is

Two node DSE spark cluster error setting up second node. Why?

醉酒当歌 提交于 2019-12-06 15:33:15
I have DSE spark cluster with 2 nodes. One DSE analytics node with spark cannot start after I install it. Without spark it starts just fine. But on my other node spark is enabled and it can start and works just fine. Why is that and how can I solve that? Thanks. Here is my error log: ERROR [main] 2016-02-27 20:35:43,353 CassandraDaemon.java:294 - Fatal exception during initialization org.apache.cassandra.exceptions.ConfigurationException: Cannot start node if snitch's data center (Analytics) differs from previous data center (Cassandra). Please fix the snitch configuration, decommission and