Kafka Streams - shared changelog topic

问题

this is a followup question of : Kafka Streams - How to scale Kafka store generated changelog topics

let's hypothetically assume the stream consumer needs to do some transformation before storing the data (indexing by v->k instead of k->v).

At the end, the goal is that each consumer needs to store the full set of transformed record (v->k) in a rocksDB. I understand another processor upstream could take care of producing v->k based on k->v and the final consumer could simply materialized the new topic from a globaltable. But what happens if the pipeline is all done at the end consumer?

KTable<Key, Value> table = builder.table(topic);
table.groupBy((k, v) -> KeyValue.pair(v, k)).reduce((newValue, aggValue) -> newValue,
                                                    (newValue, aggValue) -> null,
                                                    Materialized.as(STORE_NAME));

Which of these options is the best practice or the most optimal for this scenario (please stand me correct if my assumptions are off)?

If all consumers have different applicationId, regardless of the groupId, they will each consume all the k-> events and generate multiple changelog intermediate topic with all the content (which is not optimal storage-wise).
If all consumers have the same applicationId, but are in a different group, thus independently loading all the k->v events, they will all contribute the same computed k->v events in a shared changelog stream (based on the applicationId). This does not look optimal as we would compute and produce the same data multiple time.
If all consumers have the same applicationId, and are in the same group to consume only a slice of the k->v events (according to the partitions), they will contribute a part of the computed k->v in the shared changelog stream. But I am unclear if each materialized rocksDB will have the full set of data or only the slice that flowed through its consumer pipeline?

回答1:

For Kafka Streams, applicationId == groupId. Thus (2) is not possible.

For (3), that state is sharded/partitioned and each instance has only part of the state.

If you want to get a full copy of the state, you need to use GlobalKTables instead of KTables.

来源：https://stackoverflow.com/questions/50993292/kafka-streams-shared-changelog-topic

标签

apache-kafka

apache-kafka-streams