问题
The first step of a Dataflow pipeline we're doing has a read of BigQuery using the Python Beam API.
beam.io.Read(
beam.io.BigQuerySource(
project=google_project,
table=table_name,
dataset=big_query_dataset_id
)
)
The table in question has 9 billion+ rows.
It looks like the export jobs that kick off as a result of this call finish very quickly - usually between 3-5 minutes with the expected amount of data in *.avro format in a folder for Dataflow to read.
However when actually executing this, things appear to work properly for about 10-20 minutes with the first step reading in the data to a PCollection - we can see the wall time increasing, the number of elements added to the Output collection on that step increasing, and workers scaling up to help.
However after a certain period of time (usually around 1 billion elements or rows of data), the wall time and number of elements both begin to steadily decrease. The vCPU hours keep increasing at the rate expected, meaning we are still running in some way/still paying for the CPU time, but the wall time keeps going down and the PCollection output/element count keeps trending toward zero. It's quite baffling - we can't tell anything is amiss from the logs (it at least appears things are working?) but considering the number of workers required/cost we'd really like to see evidence that things are moving forward.
I even gave this the benefit of the doubt that perhaps there was something crazy going on at the browser level, but I can confirm the behavior across different browsers and even different people looking at the same job.
Has anybody ever seen this before, and if so what is causing it? Is it just a bug in the step display/graphing that Dataflow provides or is there something else going on here?
Thanks in advance for any help!
Edit - Was able to solve the problem through a lot of experimentation.
The reason the wall time appears to go backwards seems to have to do with workers crashing when they are running out of memory trying to handle some hot keys. The crashing workers then stop reporting and that seems to make the wall time on tasks go down.
Overall we solved the problem by a combination of things:
- We moved as much logic as we could out of GroupBy and into lightweight combiners.
- We limited the number of GroupBys overall.
- We added the use of Shuffle Mode which seemed to help with some of the chokepoint GroupBys.
Hopefully this helps somebody else who runs into this problem.
来源:https://stackoverflow.com/questions/54892476/google-dataflow-wall-time-pcollection-output-numbers-going-backwards