apache-kafka-streams

Kafka streams filtering: broker or consumer side?

故事扮演 提交于 2019-12-01 11:40:01
问题 I am looking into kafka streams. I want to filter my stream, using a filter with very low selectivity (one in few thousands). I was looking at this method: https://kafka.apache.org/0100/javadoc/org/apache/kafka/streams/kstream/KStream.html#filter(org.apache.kafka.streams.kstream.Predicate) But I can't find any evidence, if the filter will be evaluated by consumer (I really do not want to transfer a lot of GB to consumer, just to throw them away), or inside the broker (yay!). If its evaluated

Kafka Streams 2.1.1 class cast while flushing timed aggregation to store

寵の児 提交于 2019-12-01 11:34:48
问题 I'm trying to use kafka streams to perform a windowed aggregation and emit the result only after a certain session window is closed. To achieve this I'm using the suppress function. The problem is that I don't find a way to make this simple test work because when it tries to persist the state I get a class cast exception because it tries to cast Windowed to String. I have tried to provide to the aggregate function a Materialized<Windowed<String>,Long,StateStore<>> but it doesn't type check

Set timestamp in output with Kafka Streams

假装没事ソ 提交于 2019-12-01 11:12:04
I'm getting CSVs in a Kafka topic "raw-data", the goal is to transform them by sending each line in another topic "data" with the right timestamp (different for each line). Currently, I have 2 streamers: one to split the lines in "raw-data", sending them to an "internal" topic (no timestamp) one with a TimestampExtractor that consumes "internal" and send them to "data". I'd like to remove the use of this "internal" topic by setting directly the timestamp but I couldn't find a way (the timestamp extractor are only used at consumption time). I've stumbled upon this line in the documentation:

Kafka KStream - using AbstractProcessor with a Window

别说谁变了你拦得住时间么 提交于 2019-12-01 11:00:08
问题 Im hoping to group together windowed batches of output from a KStream and write them to a secondary store. I was expecting to see .punctuate() get called roughly every 30 seconds. What I got instead is saved here. (The original file was several thousand lines long) Summary - .punctuate() is getting called seemingly randomly and then repeatedly. It doesn't appear to adhere to the value set via ProcessorContext.schedule(). Edit: Another run of the same code produced calls to .punctuate()

Kafka Streams: How to use persistentKeyValueStore to reload existing messages from disk?

风流意气都作罢 提交于 2019-12-01 10:52:44
问题 My code is currently using an InMemoryKeyValueStore, which avoids any persistence to disk or to kafka. I want to use rocksdb (Stores.persistentKeyValueStore) so that the app will reload state from disk. I'm trying to implement this, and I'm very new to Kafka and the streams API. Would appreciate help on how I might make changes, while I still try to understand stuff as I go. I tried to create the state store here: StoreBuilder<KeyValueStore<String, LinkedList<StoreItem>>> store = Stores.

Kafka streams: Read from ALL partitions in every instance of an application

荒凉一梦 提交于 2019-12-01 10:15:58
问题 When using KTable, Kafka streams doesn't allow instances to read from multiple partitions of a particular topic when the number of instances / consumers is equal to number of partitions. I tried achieving this using GlobalKTable, the problem with this is that data will be overwritten, also aggregation cannot be applied on it. Let's suppose I have a topic named "data_in" with 3 partitions (P1, P2, P3). When I run 3 instances (I1, I2, I3) of a Kafka streaming application, I want each instance

How many times kafka stream invokes poll() for fetching records from kafka topic

本小妞迷上赌 提交于 2019-12-01 08:11:33
问题 I am trying to understand kafka stream processor a bit more. I want to know what is the frequency of polling by a kafka stream processor for fetching the data from kafka. As I understand kafka stream processor internally creates a kafka Consumer client which fetches the data from kafka (and it invokes poll() ). So when first time poll() is called, what is the next time it would be called again to fetch data from kafka? Does it happen many times per second ? How can I know how many times poll(

Tombstone messages not removing record from KTable state store?

ぃ、小莉子 提交于 2019-12-01 08:02:33
I am creating KTable processing data from KStream. But when I trigger a tombstone messages with key and null payload, it is not removing message from KTable. sample - public KStream<String, GenericRecord> processRecord(@Input(Channel.TEST) KStream<GenericRecord, GenericRecord> testStream, KTable<String, GenericRecord> table = testStream .map((genericRecord, genericRecord2) -> KeyValue.pair(genericRecord.get("field1") + "", genericRecord2)) .groupByKey() reduce((genericRecord, v1) -> v1, Materialized.as("test-store")); GenericRecord genericRecord = new GenericData.Record(getAvroSchema(keySchema

KafkaStreams serde exception

末鹿安然 提交于 2019-12-01 07:04:52
i am playing with Kafka and streams technology; i have created a custom serializer and deserializer for the KStream that i will use to receive messages from a given topic. Now, the problem is that i am creating a serde in this way: JsonSerializer<EventMessage> serializer = new JsonSerializer<>(); JsonDeserializer<EventMessage> deserializer = new JsonDeserializer<>(EventMessage.class); Serde<EventMessage> messageSerde = Serdes.serdeFrom(serializer, deserializer); Serializer implementation: public class JsonSerializer<T> implements Serializer<T> { private Gson gson = new Gson(); public void

Kafka Streams error - Offset commit failed on partition, request timed out

拈花ヽ惹草 提交于 2019-12-01 06:38:47
We use Kafka Streams for consuming, processing and producing messages, and on PROD env we faced with errors on multiple topics: ERROR org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer clientId=app-xxx-StreamThread-3-consumer, groupId=app] Offset commit failed on partition xxx-1 at offset 13920: The request timed out.[] These errors occur rarely for topics with small load, but for topics with high load (and spikes) errors occur dozens of times a day per topic. Topics have multiple partitions (e.g. 10). Seems this issue does not affect processing of data (despite