Singleton in Google Dataflow

拥有回忆 提交于 2019-12-23 01:24:46

问题


I have a dataflow which reads the messages from PubSub. I need to enrich this message using couple of API's. I want to have a single instance of this API to used for processing all records. This is to avoid initializing API for every request.

I tried creating a static variable, but still I see the API is initialized many times.

How to avoid initializing of a variable multiple times in Google Dataflow?


回答1:


Dataflow uses multiple machines in parallel to do data analysis, so your API will have to be initialized at least once per machine.

In fact, Dataflow does not have strong guarantees on the life of these machines, so they may come and go relatively frequently.

A simple way to have your job access an external service and avoid initializing the API too much is to initialize it in your DoFn:

class APICallingDoFn extends DoFn {
    private ExternalServiceHandle handle = null;

    @Setup
    public void initializeExternalAPI() {
      // ...
    }

    @ProcessElement
    public void processElement(ProcessContext c) {
        // ... process each element -- setup will have been called
    }
}

You need to do this because Beam nor Dataflow guarantee the duration of a DoFn instance, or a worker.

Hope this helps.



来源:https://stackoverflow.com/questions/44646447/singleton-in-google-dataflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!