Is Kafka Stream StateStore global over all instances or just local?

南笙酒味 提交于 2019-12-01 16:14:06
Matthias J. Sax

This depends on your view on a state store.

  1. In Kafka Streams a state is shared and thus each instance holds part of the overall application state. For example, using DSL stateful operator use a local RocksDB instance to hold their shard of the state. Thus, with this regard the state is local.

  2. On the other hand, all changes to the state are written into a Kafka topic. This topic does not "live" on the application host but in the Kafka cluster and consists of multiple partition and can be replicated. In case of an error, this changelog topic is used to recreate the state of the failed instance in another still running instance. Thus, as the changelog is accessible by all application instances, it can be considered to be global, too.

Keep in mind, that the changelog is the truth of the application state and the local stores are basically caches of shards of the state.

Moreover, in the WordCount example, a record stream (the data stream) gets partitioned by words, such that the count of one word will be maintained by a single instance (and different instances maintain the counts for different words).

For an architectural overview, I recommend http://docs.confluent.io/current/streams/architecture.html

Also this blog post should be interesting http://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/

If worth mentioning that there is a GlobalKTable improvement proposal

GlobalKTable will be fully replicated once per KafkaStreams instance. That is, each KafkaStreams instance will consume all partitions of the corresponding topic.

From the Confluent Platform's mailing list, I've got this information

You could start prototyping using Kafka 0.10.2 (or trunk) branch...

0.10.2-rc0 already has GlobalKTable!

Here's the actual PR.

And the person that told me that was Matthias J. Sax ;)

Use a Processor instead of Transformer, for all the transformations you want to perform on the input topic, whenever there is a usecase of lookingup data from GlobalStateStore . Use context.forward(key,value,childName) to send the data to the downstream nodes. context.forward(key,value,childName) may be called multiple times in a process() and punctuate() , so as to send multiple records to downstream node. If there is a requirement to update GlobalStateStore, do this only in Processor passed to addGlobalStore(..) because, there is a GlobalStreamThread associated with GlobalStateStore, which keeps the state of the store consistent across all the running kstream instances.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!