问题
When using KTable, Kafka streams doesn't allow instances to read from multiple partitions of a particular topic when the number of instances / consumers is equal to number of partitions. I tried achieving this using GlobalKTable, the problem with this is that data will be overwritten, also aggregation cannot be applied on it.
Let's suppose I have a topic named "data_in" with 3 partitions (P1, P2, P3). When I run 3 instances (I1, I2, I3) of a Kafka streaming application, I want each instance to read data from all partitions of "data_in". I mean that I1 can read from P1, P2 and P3, I2 can read from P1, P2 and P3, I2 and on and on.
EDIT: Keep in mind, that the producer can publish two similar IDs into two different partitions in "data_in". So when running two different instances, GlobalKtable will be overwritten.
Please, how to achieve this? This a portion of my code
private KTable<String, theDataList> globalStream() {
// KStream of records from data-in topic using String and theDataSerde deserializers
KStream<String, Data> trashStream = getBuilder().stream("data_in",Consumed.with(Serdes.String(), SerDes.theDataSerde));
// Apply an aggregation operation on the original KStream records using an intermediate representation of a KStream (KGroupedStream)
KGroupedStream<String, Data> KGS = trashStream.groupByKey();
Materialized<String, theDataList, KeyValueStore<Bytes, byte[]>> materialized = Materialized.as("agg-stream-store");
materialized = materialized.withValueSerde(SerDes.theDataDataListSerde);
// Return a KTable
return KGS.aggregate(() -> new theDataList(), (key, value, aggregate) -> {
if (!value.getValideData())
aggregate.getList().removeIf((t) -> t.getTimestamp() <= value.getTimestamp());
else
aggregate.getList().add(value);
return aggregate;
}, materialized);
}
回答1:
Either change the number of partitions of your input topic "data_in" to 1 partition or use a GlobalKtable
to get data from all partitions in the topic and then you can join your stream with it. With that, your apps instances no longer have to be in different consumer group.
The code will look like this :
private GlobalKTable<String, theDataList> globalStream() {
// KStream of records from data-in topic using String and theDataSerde deserializers
KStream<String, Data> trashStream = getBuilder().stream("data_in", Consumed.with(Serdes.String(), SerDes.theDataSerde));
thrashStream.to("new_data_in"); // by sending to an other topic you're forcing a repartition on that topic
KStream<String, Data> newTrashStream = getBuilder().stream("new_data_in", Consumed.with(Serdes.String(), SerDes.theDataSerde));
// Apply an aggregation operation on the original KStream records using an intermediate representation of a KStream (KGroupedStream)
KGroupedStream<String, Data> KGS = newTrashStream.groupByKey();
Materialized<String, theDataList, KeyValueStore<Bytes, byte[]>> materialized = Materialized.as("agg-stream-store");
materialized = materialized.withValueSerde(SerDes.theDataDataListSerde);
// Return a KTable
KGS.aggregate(() -> new theDataList(), (key, value, aggregate) -> {
if (!value.getValideData())
aggregate.getList().removeIf((t) -> t.getTimestamp() <= value.getTimestamp());
else
aggregate.getList().add(value);
return aggregate;
}, materialized)
.to("agg_data_in");
return getBuilder().globalTable("agg_data_in");
}
EDIT : I edited the code above to force a repartition on a topic called "new_data_in".
来源:https://stackoverflow.com/questions/53719700/kafka-streams-read-from-all-partitions-in-every-instance-of-an-application