Join multiple Kafka topics by key

醉酒当歌 提交于 2020-03-16 05:45:13

问题


How can write a consumer that joins multiple Kafka topics in a scalable way?

I have a topic that published events with a key and a second topic that publishes other events related to a subset of the first with the same key. I would like to write a consumer that subscribes to both topics and performs some additional actions for the subset that appears in both topics.

I can do this easily with a single consumer: read everything from both topics, maintaining state locally and perform the actions when both events have been read for a given key. But I need the solution to scale.

Ideally I need to tie the topics together so that they are partitioned the same way and the partitions are assigned to consumers in sync. How can i do this?

I know Kafka Streams joins topics together such that keys are allocated to the same nodes. How do they do it? P.S. I can't used Kafka Streams because I'm using Python.


回答1:


Too bad you are on Python -- Kafka Streams would be a perfect fit :)

If you want to do this manually, you will need to implement your own PartitionAssignor -- this, implementation must ensure, that partitions are co-located in the assignment: Assume you have 4 partitions per topic (let's call them A and B), than partitions A_0 and B_0 must be assigned to the same consumer (also A_1 and B_1, ...).

I hope Python consumer allows you to specify a custom partition assignor via config parameter partition.assignment.strategy.

This is the PartitionAssignor Kafka Streams uses: https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java

Streams uses the concept of tasks -- a tasks gets partitions of different topics with the same partition number assigned. Streams also tries to do a "sticky assignment" -- ie., don't move task (and thus partitions) in case of rebalance if possible. Thus, each consumer encodes its "old assignment" in the rebalance metadata.

Basically, the method #subscription() is called on each consumer that is alive. It will send the subscription information of the consumer (ie, to what topics a consumer wants to subscribe) plus optional metadata to the brokers.

In a second step, the leader of the consumer group, will compute the actual assignment, within #assign(). The responsible broker collects all information given by #subscription() in the first phase of the rebalance and hands it to #assign(). Thus, the leader gets a global overview over the whole group, and thus can ensure that partitions are assigned in a co-located manner.

In the last step, the broker received the computed assignment from the leader, and broadcasts it to all consumers of the group. This will result in a call to #onAssignment() on each consumer.

This might also help:

  • https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Architecture
  • http://docs.confluent.io/current/streams/architecture.html


来源:https://stackoverflow.com/questions/43027286/join-multiple-kafka-topics-by-key

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!