Autodetect BigQuery schema within Dataflow?

痞子三分冷 提交于 2019-12-05 20:06:47

If you are using protocol buffers as objects in your PCollections (which should be performing very well on the Dataflow back-end) you might be able to use a util I wrote in the past. It will parse the schema of the protobuffer into a BigQuery schema at runtime, based on inspection of the protobuffer descriptor.

I quickly uploaded it to GitHub, it's WIP, but you might be able to use it or be inspired to write something similar using Java Reflection (I might do it myself at some point).

You can use the util as follows:

TableSchema schema = ProtobufUtils.makeTableSchema(ProtobufClass.getDescriptor());
enhanced_events.apply(BigQueryIO.Write.to(tableToWrite).withSchema(schema)
            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));

where the create disposition will create the table with the schema specified and the ProtobufClass is the class generated using your Protobuf schema and the proto compiler.

I'm not sure about reading from BQ, but for writes I think that something like this will work on the latest java SDK.

.apply("WriteBigQuery", BigQueryIO.Write
    .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
    .to(outputTableName));


Note: BigQuery Table must be of the form: <project_name>:<dataset_name>.<table_name>.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!