How to process a KStream in a batch of max size or fallback to a time window?

后端 未结 2 2016
长情又很酷
长情又很酷 2020-12-21 14:38

I would like to create a Kafka stream-based application that processes a topic and takes messages in batches of size X (i.e. 50) but if the stream has low flow, to give me w

相关标签:
2条回答
  • 2020-12-21 15:05

    The simplest way might be, to use a stateful transform() operation. Each time you receive a record, you put it into the store. When you have received 50 records, you do your processing, emit output, and delete the records from the store.

    To enforce processing if you don't read the limit in a certain amount of time, you can register a wall-clock punctuation.

    0 讨论(0)
  • 2020-12-21 15:07

    @Matthias J. Sax answer is nice, I just want to add an example for this, I think it might be useful for someone. let's say we want to combine incoming values into the following type:

    public class MultipleValues { private List<String> values; }
    

    To collect messages into batches with max size, we need to create transformer:

    public class MultipleValuesTransformer implements Transformer<String, String, KeyValue<String, MultipleValues>> {
        private ProcessorContext processorContext;
        private String stateStoreName;
        private KeyValueStore<String, MultipleValues> keyValueStore;
        private Cancellable scheduledPunctuator;
    
        public MultipleValuesTransformer(String stateStoreName) {
            this.stateStoreName = stateStoreName;
        }
    
        @Override
        public void init(ProcessorContext processorContext) {
            this.processorContext = processorContext;
            this.keyValueStore = (KeyValueStore) processorContext.getStateStore(stateStoreName);
            scheduledPunctuator = processorContext.schedule(Duration.ofSeconds(30), PunctuationType.WALL_CLOCK_TIME, this::doPunctuate);
        }
    
        @Override
        public KeyValue<String, MultipleValues> transform(String key, String value) {
            MultipleValues itemValueFromStore = keyValueStore.get(key);
            if (isNull(itemValueFromStore)) {
                itemValueFromStore = MultipleValues.builder().values(Collections.singletonList(value)).build();
            } else {
                List<String> values = new ArrayList<>(itemValueFromStore.getValues());
                values.add(value);
                itemValueFromStore = itemValueFromStore.toBuilder()
                        .values(values)
                        .build();
            }
            if (itemValueFromStore.getValues().size() >= 50) {
                processorContext.forward(key, itemValueFromStore);
                keyValueStore.put(key, null);
            } else {
                keyValueStore.put(key, itemValueFromStore);
            }
            return null;
        }
    
        private void doPunctuate(long timestamp) {
            KeyValueIterator<String, MultipleValues> valuesIterator = keyValueStore.all();
            while (valuesIterator.hasNext()) {
                KeyValue<String, MultipleValues> keyValue = valuesIterator.next();
                if (nonNull(keyValue.value)) {
                    processorContext.forward(keyValue.key, keyValue.value);
                    keyValueStore.put(keyValue.key, null);
                }
            }
        }
    
        @Override
        public void close() {
            scheduledPunctuator.cancel();
        }
    }
    

    and we need to create key-value store, add it to StreamsBuilder, and build KStream flow using transform method

    Properties props = new Properties();
    ...
    Serde<MultipleValues> multipleValuesSerge = Serdes.serdeFrom(new JsonSerializer<>(), new JsonDeserializer<>(MultipleValues.class));
    StreamsBuilder builder = new StreamsBuilder();
    String storeName = "multipleValuesStore";
    KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore(storeName);
    StoreBuilder<KeyValueStore<String, MultipleValues>> storeBuilder =
            Stores.keyValueStoreBuilder(storeSupplier, Serdes.String(), multipleValuesSerge);
    builder.addStateStore(storeBuilder);
    
    builder.stream("source", Consumed.with(Serdes.String(), Serdes.String()))
            .transform(() -> new MultipleValuesTransformer(storeName), storeName)
            .print(Printed.<String, MultipleValues>toSysOut().withLabel("transformedMultipleValues"));
    KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), props);
    kafkaStreams.start();
    

    with such approach we used the incoming key for which we did aggregation. if you need to collect messages not by key, but by some message's fields, you need the following flow to trigger rebalancing on KStream (by using intermediate topic):

    .selectKey(..)
    .through(intermediateTopicName)
    .transform( ..)
    
    0 讨论(0)
提交回复
热议问题