问题
The Dataflow pipelines developed by my team suddenly started getting stuck, stopping processing our events. Their worker logs became full of warning messages saying that one specific step got stuck. The peculiar thing is that the steps that are failing are different, one is a BigQuery output and another for Cloud Storage output.
The following are the log messages that we are receiving:
For BigQuery output:
Processing stuck in step <STEP_NAME>/StreamingInserts/StreamingWriteTables/StreamingWrite for at least <TIME> without outputting or completing in state finish
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:765)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:829)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:131)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:103)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn$DoFnInvoker.invokeFinishBundle(Unknown Source)
For Cloud Storage output:
Processing stuck in step <STEP_NAME>/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles for at least <TIME> without outputting or completing in state process
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:421)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:287)
at org.apache.beam.sdk.io.FileBasedSink$Writer.close(FileBasedSink.java:1007)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:726)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
All applications have been drained and redeployed but the same thing happened after a while (period of 3 to 4 hours). Some of them were running for more than 40 days and they suddenly got into this without any changes in the code.
I would like ask for some help to know the reason of this problem. These are the following ids of some of the Dataflow jobs with those problems:
Stuck in BigQuery output: 2019-03-04_04_46_31-3901977107649726570
Stuck in Cloud Storage output: 2019-03-04_07_50_00-10623118563101608836
回答1:
As you correctly pointed out, this is likely because of a deadlock issue with the Conscrypt library which was being used as default security provider. As of Beam 2.9.0, Conscrypt is no longer the default security provider.
Another option is to downgrade to Beam 2.4.0, where conscrypt was also not the default provider.
For streaming pipelines, you can simply update your pipeline with the new SDK, and things should work.
As a short term workaround, you can kill the workers that are stuck to remove the deadlock issue, but you'll eventually run into the problem again. It's best to update to 2.9.0.
回答2:
I'm having the same issue, I’ve found out that the most common case it’s because one of the jobs failed to insert into the BigQuery table or failed saving the file into the CGS bucket (very uncommon). The thread in charge is not catching the Exception and keeps waiting the job. This is a bug of Apache Beam and I already created a ticket for it.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-7693
Let’s see if the guys from Apache Beam fix this issue (it’s a literally an unhandled exception).
So far my recommendation is to validate the constraints of your data before the insertion. So keep in mind things like: 1) Max Row size (right now 2019 is 1MB for stream insert and 100MB for batch) 2) REQUIRED values that are not coming should create a dead letter before and not being able to reach the job 3) If you have unknown fields don’t forget to enable the option ignoreUnknownFields (otherwise they will make your job die)
I presume that you are only having issues during the peak hours because more “unsatisfied” events are coming.
Hopefully this could help a little bit
回答3:
I was running into the same error and reason was that I created an empty BigQuery table without specifying a schema. Make sure to create a BQ table with a schema before you can load data via Dataflow.
来源:https://stackoverflow.com/questions/54990412/dataflow-pipeline-processing-stuck-in-step-step-name-for-at-least-time-wi