I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK.
The approach I took (works in the streaming mode, too):
Convert the window into the table/partition name
p.apply(PubsubIO.Read
.subscription(subscription)
.withCoder(TableRowJsonCoder.of())
)
.apply(Window.into(new TablePartitionWindowFn()) )
.apply(BigQueryIO.Write
.to(new DayPartitionFunc(dataset, table))
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);
Setting the window based on the incoming data, the End Instant can be ignored, as the start value is used for setting the partition:
public class TablePartitionWindowFn extends NonMergingWindowFn
Setting the table partition dynamically:
public class DayPartitionFunc implements SerializableFunction {
String destination = "";
public DayPartitionFunc(String dataset, String table) {
this.destination = dataset + "." + table+ "$";
}
@Override
public String apply(BoundedWindow boundedWindow) {
// The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
String dayString = DateTimeFormat.forPattern("yyyyMMdd")
.withZone(DateTimeZone.UTC)
.print(((IntervalWindow) boundedWindow).start());
return destination + dayString;
}}
Is there a better way of achieving the same outcome?