问题
I know this question was asked before here: Kafka Streaming Concurrency?
But yet this is very strange to me. According to the documentation (or maybe I am missing something) each partition has a task meaning different instance of processors and each task is being execute by different thread. But when I tested it, I saw that different threads can get different instances of processor. Therefore if you want to keep any in memory state (old fashioned way) in your processor you must lock?
Example code:
public class SomeProcessor extends AbstractProcessor<String, JsonObject> {
private final String ID = UUID.randomUUID().toString();
@Override
public void process(String key, JsonObject value) {
System.out.println("Thread id: " + Thread.currentThread().getId() +" ID: " + ID);
OUTPUT:
Thread id: 88 ID: 26b11094-a094-404b-b610-88b38cc9d1ef
Thread id: 88 ID: c667e669-9023-494b-9345-236777e9dfda
Thread id: 88 ID: c667e669-9023-494b-9345-236777e9dfda
Thread id: 90 ID: 0a43ecb0-26f2-440d-88e2-87e0c9cc4927
Thread id: 90 ID: c667e669-9023-494b-9345-236777e9dfda
Thread id: 90 ID: c667e669-9023-494b-9345-236777e9dfda
Is there a way to enforce thread per instance ?
回答1:
The number of threads per instance is a configuration parameter (num.stream.threads
with default value of 1
). Thus, if you start a single KafkaStreams
instance you get num.stream.threads
threads.
Tasks split up the work in parallel units (based on your input topic partitions) and will be assigned to threads. Thus, if you have multiple tasks and a single thread, all tasks will be assigned to this thread. If you have two threads (sum over all KafkaStreams
instances) each thread executes about 50% of the tasks.
Note: because a Kafka Streams application is distributed in nature, there is no difference if you run a single
KafkaStreams
instance with multiple threads, or multipleKafkaStreams
instanced with one thread each. Tasks will be distributed over all available threads of your application.
If you want to share any data structure between tasks and you have more then one thread, it's your responsibility to synchronize the access to this data structure. Note, that the task-to-thread assignment can change during runtime, and thus, all access must be synchronized. However, this pattern is not recommended as it limits scalability. You should design your program with no shared data structures! The main reason for this is, that your program in general is distributed over multiple machines, and thus, different KafkaStreams
instances cannot access a shared data structure anyway. Sharing a data structure would only work within a single JVM but using a single JVM prevents horizontal scale out of your application.
来源:https://stackoverflow.com/questions/47119317/kafka-stream-processor-thread-safe