Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

前端 未结 6 1351
死守一世寂寞
死守一世寂寞 2020-11-29 12:25

I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK.

6条回答
  •  一整个雨季
    2020-11-29 12:49

    The approach I took (works in the streaming mode, too):

    • Define a custom window for the incoming record
    • Convert the window into the table/partition name

      p.apply(PubsubIO.Read
                  .subscription(subscription)
                  .withCoder(TableRowJsonCoder.of())
              )
              .apply(Window.into(new TablePartitionWindowFn()) )
              .apply(BigQueryIO.Write
                             .to(new DayPartitionFunc(dataset, table))
                             .withSchema(schema)
                             .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
              );
      

    Setting the window based on the incoming data, the End Instant can be ignored, as the start value is used for setting the partition:

    public class TablePartitionWindowFn extends NonMergingWindowFn {
    
    private IntervalWindow assignWindow(AssignContext context) {
        TableRow source = (TableRow) context.element();
        String dttm_str = (String) source.get("DTTM");
    
        DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd").withZoneUTC();
    
        Instant start_point = Instant.parse(dttm_str,formatter);
        Instant end_point = start_point.withDurationAdded(1000, 1);
    
        return new IntervalWindow(start_point, end_point);
    };
    
    @Override
    public Coder windowCoder() {
        return IntervalWindow.getCoder();
    }
    
    @Override
    public Collection assignWindows(AssignContext c) throws Exception {
        return Arrays.asList(assignWindow(c));
    }
    
    @Override
    public boolean isCompatible(WindowFn other) {
        return false;
    }
    
    @Override
    public IntervalWindow getSideInputWindow(BoundedWindow window) {
        if (window instanceof GlobalWindow) {
            throw new IllegalArgumentException(
                    "Attempted to get side input window for GlobalWindow from non-global WindowFn");
        }
        return null;
    }
    

    Setting the table partition dynamically:

    public class DayPartitionFunc implements SerializableFunction {
    
    String destination = "";
    
    public DayPartitionFunc(String dataset, String table) {
        this.destination = dataset + "." + table+ "$";
    }
    
    @Override
    public String apply(BoundedWindow boundedWindow) {
        // The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
        String dayString = DateTimeFormat.forPattern("yyyyMMdd")
                                         .withZone(DateTimeZone.UTC)
                                         .print(((IntervalWindow) boundedWindow).start());
        return destination + dayString;
    }}
    

    Is there a better way of achieving the same outcome?

提交回复
热议问题