Thread Synchronization for DoFn in Apache Beam

别说谁变了你拦得住时间么 提交于 2021-01-27 20:06:49

问题


I am writing a DoFn in which its instance variable elements (i.e., a shared resource) can be mutated in the @ProcessElement method:

import java.util.ArrayList;
import java.util.List;

import org.apache.beam.sdk.transforms.DoFn;

public class DemoDoFn extends DoFn<String, Void> {
  private final int batchSize;

  private transient List<String> elements;

  public DemoDoFn(int batchSize) {
    this.batchSize = batchSize;
  }

  @StartBundle
  public void startBundle() {
    elements = new ArrayList<>();
  }

  @ProcessElement
  public void processElement(@Element String element, ProcessContext context) {
    elements.add(element); // <-------- mutated

    if (elements.size() >= batchSize) {
      flushBatch();
    }
  }

  @FinishBundle
  public void finishBundle() {
    flushBatch();
  }

  private void flushBatch() {
    // Flush all elements, e.g., send all elements in a single API call to a server

    // Initialize a new array list for next batch
    elements = new ArrayList<>(); // <-------- mutated
  }
}

Question 1: Do I need to add the synchronized keyword to the @ProcessElement method in order to avoid a race condition?

According to Apache Beam Thread-compatibility: "Each instance of your function (DoFn) object is accessed by a single thread at a time on a worker instance, unless you explicitly create your own threads. Note, however, that the Beam SDKs are not thread-safe. If you create your own threads in your user code, you must provide your own synchronization."

Question 2: Does "Each instance of your function object is accessed by a single thread at a time on a worker instance" indicate that Beam will synchronize @ProcessElement or the entire DoFn behind the scenes?

This IBM paper points out that and I quote

  1. "Third, the Beam programming guide guarantee that each user-defined function instance will only be executed by a single thread at a time. This means that the runner has to synchronize the entire function invocation, which could lead to significant performance bottlenecks."
  2. "Beam promises applications that there will only be a single thread executing their user-defined functions at a time. Therefore, if the underline engine spawns multiple threads, the runner has to synchronize the entire DoFn or GroupByKey invocation."
  3. "As Beam forbids multiple threads from entering the same PTransform instance, engines lose the opportunity to use operator parallelism."

The paper seems to indicate that the entire DoFn invocation is synchronized.


回答1:


I know this is old question but since I was researching the same thing - no, you don't need synchronized for your processElement because as you quoted: "Each instance of your function (DoFn) object is accessed by a single thread at a time on a worker instance"

Here is example of beam's official class that mutates instance variable https://github.com/apache/beam/blob/0c01636fc8610414859d946cb93eabc904123fc8/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L1369



来源:https://stackoverflow.com/questions/57130258/thread-synchronization-for-dofn-in-apache-beam

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!