BigQueryIO Read performance using withTemplateCompatibility

烈酒焚心 提交于 2019-12-10 07:04:28

问题


Apache Beam 2.1.0 had a bug with template pipelines that read from BigQuery which meant they could only be executed once. More details here https://issues.apache.org/jira/browse/BEAM-2058

This has been fixed with the release of Beam 2.2.0, you can now read from BigQuery using the withTemplateCompatibility option, your template pipeline can now be run multiple times.

  pipeline
    .apply("Read rows from table."
         , BigQueryIO.readTableRows()
                     .withTemplateCompatibility()
                     .from("<your-table>")
                     .withoutValidation())

This implementation seems to come with a huge performance cost to BigQueryIO read operation, I now have batch pipelines what ran in 8-11 minutes now consistently taking 45-50 minutes to complete. The only difference between both pipelines is the .withTemplateCompatibility().

Am trying to understand the reasons for the huge drop in performance and if there is any way to improve them.

Thanks.

Solution: based on jkff's input.

  pipeline
    .apply("Read rows from table."
         , BigQueryIO.readTableRows()
                     .withTemplateCompatibility()
                     .from("<your-table>")
                     .withoutValidation())
    .apply("Reshuffle",  Reshuffle.viaRandomKey())

回答1:


I suspect this is due to the fact that withTemplateCompatibility comes at the cost of disabling dynamic rebalancing for this read step.

I would expect it to have significant impact only if you're reading a small or moderate amount of data, but performing very heavy processing on it. In this case, try adding a Reshuffle.viaRandomKey() onto your BigQueryIO.read(). It will materialize a temporary copy of the data, but will parallelize downstream processing much better.



来源:https://stackoverflow.com/questions/47755692/bigqueryio-read-performance-using-withtemplatecompatibility

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!