I\'m looking for a way to write a Dstream in an output kafka topic, only when the micro-batch RDDs spit out something.
I\'m using Spark Streaming and spark-streaming
In my example I want to send events took from a specific kafka topic to another one. I do a simple wordcount. That means, I take data from kafka input topic, count them and output them in a output kafka topic. Don't forget the goal is to write results of JavaPairDStream into output kafka topic using Spark Streaming.
//Spark Configuration
SparkConf sparkConf = new SparkConf().setAppName("SendEventsToKafka");
String brokerUrl = "locahost:9092"
String inputTopic = "receiverTopic";
String outputTopic = "producerTopic";
//Create the java streaming context
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
//Prepare the list of topics we listen for
Set topicList = new TreeSet<>();
topicList.add(inputTopic);
//Kafka direct stream parameters
Map kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", brokerUrl);
kafkaParams.put("group.id", "kafka-cassandra" + new SecureRandom().nextInt(100));
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
//Kafka output topic specific properties
Properties props = new Properties();
props.put("bootstrap.servers", brokerUrl);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("acks", "1");
props.put("retries", "3");
props.put("linger.ms", 5);
//Here we create a direct stream for kafka input data.
final JavaInputDStream> messages = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topicList, kafkaParams));
JavaPairDStream results = messages
.mapToPair(new PairFunction, String, String>() {
@Override
public Tuple2 call(ConsumerRecord record) {
return new Tuple2<>(record.key(), record.value());
}
});
JavaDStream lines = results.map(new Function, String>() {
@Override
public String call(Tuple2 tuple2) {
return tuple2._2();
}
});
JavaDStream words = lines.flatMap(new FlatMapFunction() {
@Override
public Iterator call(String x) {
log.info("Line retrieved {}", x);
return Arrays.asList(SPACE.split(x)).iterator();
}
});
JavaPairDStream wordCounts = words.mapToPair(new PairFunction() {
@Override
public Tuple2 call(String s) {
log.info("Word to count {}", s);
return new Tuple2<>(s, 1);
}
}).reduceByKey(new Function2() {
@Override
public Integer call(Integer i1, Integer i2) {
log.info("Count with reduceByKey {}", i1 + i2);
return i1 + i2;
}
});
//Here we iterrate over the JavaPairDStream to write words and their count into kafka
wordCounts.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaPairRDD arg0) throws Exception {
Map wordCountMap = arg0.collectAsMap();
List topicList = new ArrayList<>();
for (String key : wordCountMap.keySet()) {
//Here we send event to kafka output topic
publishToKafka(key, wordCountMap.get(key), outputTopic);
}
JavaRDD WordOccurenceRDD = jssc.sparkContext().parallelize(topicList);
CassandraJavaUtil.javaFunctions(WordOccurenceRDD)
.writerBuilder(keyspace, table, CassandraJavaUtil.mapToRow(WordOccurence.class))
.saveToCassandra();
log.info("Words successfully added : {}, keyspace {}, table {}", words, keyspace, table);
}
});
jssc.start();
jssc.awaitTermination();
wordCounts variable is of type JavaPairDStream, I just ierrate using foreachRDD and write into kafka using a specific function:
public static void publishToKafka(String word, Long count, String topic, Properties props) {
KafkaProducer producer = new KafkaProducer(props);
try {
ObjectMapper mapper = new ObjectMapper();
String jsonInString = mapper.writeValueAsString(word + " " + count);
String event = "{\"word_stats\":" + jsonInString + "}";
log.info("Message to send to kafka : {}", event);
producer.send(new ProducerRecord(topic, event));
log.info("Event : " + event + " published successfully to kafka!!");
} catch (Exception e) {
log.error("Problem while publishing the event to kafka : " + e.getMessage());
}
producer.close();
}
Hope that helps!