Kafka streams: Read from ALL partitions in every instance of an application

荒凉一梦 提交于 2019-12-01 10:15:58

问题


When using KTable, Kafka streams doesn't allow instances to read from multiple partitions of a particular topic when the number of instances / consumers is equal to number of partitions. I tried achieving this using GlobalKTable, the problem with this is that data will be overwritten, also aggregation cannot be applied on it.

Let's suppose I have a topic named "data_in" with 3 partitions (P1, P2, P3). When I run 3 instances (I1, I2, I3) of a Kafka streaming application, I want each instance to read data from all partitions of "data_in". I mean that I1 can read from P1, P2 and P3, I2 can read from P1, P2 and P3, I2 and on and on.

EDIT: Keep in mind, that the producer can publish two similar IDs into two different partitions in "data_in". So when running two different instances, GlobalKtable will be overwritten.

Please, how to achieve this? This a portion of my code

private KTable<String, theDataList> globalStream() {

    // KStream of records from data-in topic using String and theDataSerde deserializers
    KStream<String, Data> trashStream = getBuilder().stream("data_in",Consumed.with(Serdes.String(), SerDes.theDataSerde));

    // Apply an aggregation operation on the original KStream records using an intermediate representation of a KStream (KGroupedStream)
    KGroupedStream<String, Data> KGS = trashStream.groupByKey();

    Materialized<String, theDataList, KeyValueStore<Bytes, byte[]>> materialized = Materialized.as("agg-stream-store");
    materialized = materialized.withValueSerde(SerDes.theDataDataListSerde);

    // Return a KTable
    return KGS.aggregate(() -> new theDataList(), (key, value, aggregate) -> {
        if (!value.getValideData())
            aggregate.getList().removeIf((t) -> t.getTimestamp() <= value.getTimestamp());
        else
            aggregate.getList().add(value);
        return aggregate;
    }, materialized);
}

回答1:


Either change the number of partitions of your input topic "data_in" to 1 partition or use a GlobalKtable to get data from all partitions in the topic and then you can join your stream with it. With that, your apps instances no longer have to be in different consumer group.

The code will look like this :

private GlobalKTable<String, theDataList> globalStream() {

   // KStream of records from data-in topic using String and theDataSerde deserializers
  KStream<String, Data> trashStream = getBuilder().stream("data_in", Consumed.with(Serdes.String(), SerDes.theDataSerde));

  thrashStream.to("new_data_in"); // by sending to an other topic you're forcing a repartition on that topic

  KStream<String, Data> newTrashStream = getBuilder().stream("new_data_in", Consumed.with(Serdes.String(), SerDes.theDataSerde));

  // Apply an aggregation operation on the original KStream records using an intermediate representation of a KStream (KGroupedStream)
  KGroupedStream<String, Data> KGS = newTrashStream.groupByKey();

  Materialized<String, theDataList, KeyValueStore<Bytes, byte[]>> materialized = Materialized.as("agg-stream-store");
  materialized = materialized.withValueSerde(SerDes.theDataDataListSerde);

// Return a KTable
  KGS.aggregate(() -> new theDataList(), (key, value, aggregate) -> {
      if (!value.getValideData())
          aggregate.getList().removeIf((t) -> t.getTimestamp() <= value.getTimestamp());
      else
        aggregate.getList().add(value);
      return aggregate;
  }, materialized)
  .to("agg_data_in");

  return getBuilder().globalTable("agg_data_in");
}

EDIT : I edited the code above to force a repartition on a topic called "new_data_in".



来源:https://stackoverflow.com/questions/53719700/kafka-streams-read-from-all-partitions-in-every-instance-of-an-application

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!