Kafka stream processor thread safe?

问题

I know this question was asked before here: Kafka Streaming Concurrency?

But yet this is very strange to me. According to the documentation (or maybe I am missing something) each partition has a task meaning different instance of processors and each task is being execute by different thread. But when I tested it, I saw that different threads can get different instances of processor. Therefore if you want to keep any in memory state (old fashioned way) in your processor you must lock?

Example code:

public class SomeProcessor extends AbstractProcessor<String, JsonObject> {

   private final String ID = UUID.randomUUID().toString();

   @Override
   public void process(String key, JsonObject value) {
     System.out.println("Thread id: " + Thread.currentThread().getId() +" ID: " + ID);

OUTPUT:

Thread id: 88 ID: 26b11094-a094-404b-b610-88b38cc9d1ef

Thread id: 88 ID: c667e669-9023-494b-9345-236777e9dfda

Thread id: 90 ID: 0a43ecb0-26f2-440d-88e2-87e0c9cc4927

Thread id: 90 ID: c667e669-9023-494b-9345-236777e9dfda

Is there a way to enforce thread per instance ?

回答1:

The number of threads per instance is a configuration parameter (num.stream.threads with default value of 1). Thus, if you start a single KafkaStreams instance you get num.stream.threads threads.

Tasks split up the work in parallel units (based on your input topic partitions) and will be assigned to threads. Thus, if you have multiple tasks and a single thread, all tasks will be assigned to this thread. If you have two threads (sum over all KafkaStreams instances) each thread executes about 50% of the tasks.

Note: because a Kafka Streams application is distributed in nature, there is no difference if you run a single KafkaStreams instance with multiple threads, or multiple KafkaStreams instanced with one thread each. Tasks will be distributed over all available threads of your application.

If you want to share any data structure between tasks and you have more then one thread, it's your responsibility to synchronize the access to this data structure. Note, that the task-to-thread assignment can change during runtime, and thus, all access must be synchronized. However, this pattern is not recommended as it limits scalability. You should design your program with no shared data structures! The main reason for this is, that your program in general is distributed over multiple machines, and thus, different KafkaStreams instances cannot access a shared data structure anyway. Sharing a data structure would only work within a single JVM but using a single JVM prevents horizontal scale out of your application.

来源：https://stackoverflow.com/questions/47119317/kafka-stream-processor-thread-safe

标签

java

multithreading

apache-kafka-streams