问题
this is a followup question of : Kafka Streams - How to scale Kafka store generated changelog topics
let's hypothetically assume the stream consumer needs to do some transformation before storing the data (indexing by v->k instead of k->v).
At the end, the goal is that each consumer needs to store the full set of transformed record (v->k) in a rocksDB. I understand another processor upstream could take care of producing v->k based on k->v and the final consumer could simply materialized the new topic from a globaltable. But what happens if the pipeline is all done at the end consumer?
KTable<Key, Value> table = builder.table(topic);
table.groupBy((k, v) -> KeyValue.pair(v, k)).reduce((newValue, aggValue) -> newValue,
(newValue, aggValue) -> null,
Materialized.as(STORE_NAME));
Which of these options is the best practice or the most optimal for this scenario (please stand me correct if my assumptions are off)?
- If all consumers have different applicationId, regardless of the groupId, they will each consume all the k-> events and generate multiple changelog intermediate topic with all the content (which is not optimal storage-wise).
- If all consumers have the same applicationId, but are in a different group, thus independently loading all the k->v events, they will all contribute the same computed k->v events in a shared changelog stream (based on the applicationId). This does not look optimal as we would compute and produce the same data multiple time.
- If all consumers have the same applicationId, and are in the same group to consume only a slice of the k->v events (according to the partitions), they will contribute a part of the computed k->v in the shared changelog stream. But I am unclear if each materialized rocksDB will have the full set of data or only the slice that flowed through its consumer pipeline?
回答1:
For Kafka Streams, applicationId == groupId
. Thus (2) is not possible.
For (3), that state is sharded/partitioned and each instance has only part of the state.
If you want to get a full copy of the state, you need to use GlobalKTables
instead of KTables
.
来源:https://stackoverflow.com/questions/50993292/kafka-streams-shared-changelog-topic