apache-kafka-streams

Kafka Streams - reducing the memory footprint for large state stores

放肆的年华 提交于 2019-12-06 12:28:46
问题 I have a topology (see below) that reads off a very large topic (over a billion messages per day). The memory usage of this Kafka Streams app is pretty high, and I was looking for some suggestions on how I might reduce the footprint of the state stores (more details below). Note: I am not trying to scape goat the state stores, I just think there may be a way for me to improve my topology - see below. // stream receives 1 billion+ messages per day stream .flatMap((key, msg) -> rekeyMessages

Kafka Stream: Graceful shutdown

馋奶兔 提交于 2019-12-06 10:42:55
问题 If we start the KafkaStream app in the background (say Linux), is there a way to signal from external, to the app, that can initiate the graceful shutdown? 回答1: As describe in the docs (https://kafka.apache.org/11/documentation/streams/tutorial), it's recommended to register a shutdown hook that calls KafkaStreams#close() for a clean shutdown: final CountDownLatch latch = new CountDownLatch(1); // attach shutdown handler to catch control-c Runtime.getRuntime().addShutdownHook(new Thread(

How to discover and filter out duplicate records in Kafka Streams

十年热恋 提交于 2019-12-06 10:35:47
Say you have a topic with a null key and the value is {id:1, name:Chris, age:99} Lets say you want to count up the number of people by name. You would do something like below: nameStream.groupBy((key,value) -> value.getName()) .count(); Now lets says it is valid you can get duplicate records and you can tell it is a duplicate based on the id. For example: {id:1, name:Chris, age:99} {id:1, name:Chris, age:xx} Should result in a count of one and {id:1, name:Chris, age:99} {id:2, name:Chris, age:xx} should result in a count of 2. How would you accomplish this? I thought reduce would work, but

using kafka-streams to conditionally sort a json input stream

会有一股神秘感。 提交于 2019-12-06 10:26:05
I am new to developing kafka-streams applications. My stream processor is meant to sort json messages based on a value of a user key in the input json message. Message 1: {"UserID": "1", "Score":"123", "meta":"qwert"} Message 2: {"UserID": "5", "Score":"780", "meta":"mnbvs"} Message 3: {"UserID": "2", "Score":"0", "meta":"fghjk"} I have read here Dynamically connecting a Kafka input stream to multiple output streams that there is no dynamic solution. In my use-case I know the user keys and output topics that I need to sort the input stream. So I am writing separate processor applications

Monitoring number of consumer for the Kafka topic

淺唱寂寞╮ 提交于 2019-12-06 09:53:25
问题 We are using Prometheus and Grafana for monitoring our Kafka cluster. In our application, we use Kafka streams and there is a chance that Kafka stream getting stopped due to exception. We are logging the event setUnCaughtExceptionHandler but, we also need some kind of alerting when the stream stops. What we currently have is, jmx_exporter running as a agent and exposes Kafka metrics through an endpoint and prometheus fetches the metrics from the endpoint. We don't see any kind of metrics

Why does kafka streams reprocess the messages produced after broker restart

末鹿安然 提交于 2019-12-06 09:51:11
I have a single node kafka broker and simple streams application. I created 2 topics (topic1 and topic2). Produced on topic1 - processed message - write to topic2 Note: For each message produced only one message is written to destination topic I produced a single message. After it was written to topic2, I stopped the kafka broker. After sometime I restarted the broker and produced another message on topic1. Now streams app processed that message 3 times. Now without stopping the broker I produced messages to topic1 and waited for streams app to write to topic2 before producing again. Streams

Deserialise a POJO in Kafka Streams

偶尔善良 提交于 2019-12-06 09:02:22
问题 My Kafka topic has messages of this format user1,subject1,80|user1,subject2,90 user2,subject1,70|user2,subject2,100 and so on. I have created User POJO as below. class User implements Serializable{ /** * */ private static final long serialVersionUID = -253687203767610477L; private String userId; private String subject; private String marks; public User(String userId, String subject, String marks) { super(); this.userId = userId; this.subject = subject; this.marks = marks; } public String

How can I get the offset value in KStream

流过昼夜 提交于 2019-12-06 08:57:12
问题 I'm developing a PoC with Kafka Streams. Now I need to get the offset value in the stream consumer and use it to generate an unique key (topic-offset)->hash for each message. The reason is: the producers are syslog and only few of them have ID's. I cannot generate an UUID in the consumer because in case of reprocess I need to regenerate the same key. My problem is: the org.apache.kafka.streams.processor.ProcessorContext class expose an .offset() method that returns the value, but I'm using

Gui viewer for RocksDb sst files

ε祈祈猫儿з 提交于 2019-12-06 07:03:20
I'm working with Kafka that save the data into rocksdb. Now I want to have a look at the db keys and values that created by Kafka. I downloaded FastNoSQL and tried but failed. The folder contains: .sst files .log files CURRENT file IDENTITY file LOCK file LOG files MANIFEST files OPTIONS files How can I watch the values? Keylord (since version 5.0) can open RocksDB databases. For example here is Kafka stream of Wordcount application : For RocksDB db files you can use FastoNoSql . It looks like the reason you have data stored in RocksDB files is because you are using Apache Kafka's Streams API.

Why does co-partitioning of two Kstreams in kafka require same number of partitions for both the streams?

*爱你&永不变心* 提交于 2019-12-06 03:33:56
I wanted to know why does co-partitioning of two Kstreams in kafka require same number of partitions for both the streams as is given in the documentation in below URL: enter link description here As the name "co-partition" indicates, you want to put data from different topic but same key to the same Kafka Streams application instance. If you don't have the same number of partitions, it's not possible to get this behavior. Assume you have topic A with 2 partitions and topic B with 3 partitions. Thus, it can happen that one record with key X is hashed to partitions A-0 and B-1 (ie, not same