Partition data coming from CSV so I can process larger patches rather then individual lines

后端 未结 2 610
小鲜肉
小鲜肉 2020-12-16 17:57

I am just getting started with Google Data Flow, I have written a simple flow that reads a CSV file from cloud storage. One of the steps involves calling a web service to

相关标签:
2条回答
  • 2020-12-16 18:19

    You can buffer elements in a local member variable of your DoFn, and call your web service when the buffer is large enough, as well as in finishBundle. For example:

    class CallServiceFn extends DoFn<String, String> {
      private List<String> elements = new ArrayList<>();
    
      public void processElement(ProcessContext c) {
        elements.add(c.element());
        if (elements.size() >= MAX_CALL_SIZE) {
          for (String result : callServiceWithData(elements)) {
            c.output(result);
          }
          elements.clear();
        }
      }
    
      public void finishBundle(Context c) {
        for (String result : callServiceWithData(elements)) {
          c.output(result);
        }
      }
    }
    
    0 讨论(0)
  • 2020-12-16 18:39

    Note that a GroupIntoBatches transform was added to make this even easier.

    0 讨论(0)
提交回复
热议问题