The Dataflow pipelines developed by my team suddenly started getting stuck, stopping processing our events. Their worker logs became full of warning messages saying that one
I'm having the same issue, I’ve found out that the most common case it’s because one of the jobs failed to insert into the BigQuery table or failed saving the file into the CGS bucket (very uncommon). The thread in charge is not catching the Exception and keeps waiting the job. This is a bug of Apache Beam and I already created a ticket for it.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-7693
Let’s see if the guys from Apache Beam fix this issue (it’s a literally an unhandled exception).
So far my recommendation is to validate the constraints of your data before the insertion. So keep in mind things like: 1) Max Row size (right now 2019 is 1MB for stream insert and 100MB for batch) 2) REQUIRED values that are not coming should create a dead letter before and not being able to reach the job 3) If you have unknown fields don’t forget to enable the option ignoreUnknownFields (otherwise they will make your job die)
I presume that you are only having issues during the peak hours because more “unsatisfied” events are coming.
Hopefully this could help a little bit