Solutions to fix stuck timers / activities in Cadence/SWF/StepFunctions

问题

So timers are durable in workflow engines like Cadence, SWF and Step functions. Durable timer is useful for use cases that need to wait for a long period of time, then wake up to execute some business logic. Because of the durability, it’s resilient to various failures, making the programming experience and model much better for developers.

But what if you want to change a timer after it has started? Like this example:

@Override
public void sampleWorkflowWithTimer(Input input){
  //...
  //some business logic before the timer
  //use a durable timer waiting for 7 days
  Workflow.sleep(Duration.ofDays(7));
  //send an email after the timer fires 
  activities.sendEmailReminder(input);
  //continue with other business 
  //...
}

After the timer started, you may want the timer to wait for 3 days instead, or even cancel it. What if you simply change the workflow code to this?

@Override
public void sampleWorkflowWithTimer(Input input){
  //...
  //some business logic before the timer
  //use a durable timer waiting for 73 days 
  Workflow.sleep(Duration.ofDays(3));
  //send an email after the timer fires 
  activities.sendEmailReminder(input);
  //continue with other business after the timer fires 
  //...
}

This does NOT work!

In SWF, Step Functions and Cadence, this will only work for new workflow executions that haven’t started the timer yet. But here you really want to fix is the workflows with “stuck” timers waiting for 7 days.

In SWF and Cadence, this is making things even worse -- the “stuck” workflows now become really stuck for “non deterministic errors”(aka NDE) until workflow timeouts. Because behind the scene, Cadence/SWF turns the durable time into a timer event in the workflow history, with the timer value. During replay, the workflow expects to see the timer existing with exactly the same timer value. Seeing a different timer value causes an NDE in workflows.

Also for activities, it can get stuck when we use a very large timeout, but it got stuck because there is a process/thread/RPC call get stuck, or during deployment the host got killed.

So what is the solution?

回答1:

Solutions to stuck timers

In Step functions, there is no real solutions. Because it’s impossible to change state machines for workflows that have already started running. So you must be careful when using a durable timer(wait) in Step functions, especially when the timer value is too large that you may regress. You may break down the big timer value into smaller increments, and add some checking points, or use an activity to simulate a timer. It’s certainly painful to work around like that.

There is a complicated solution in SWF, because of the poor workflow versioning support in SWF. The basic idea is as follow:

Add a feature flag to the workflow input. The newly started workflows can use 3 days as timer value, while the started workflows won’t get affected.
Change the “Workflow.Sleep()” to use Promise, and also use a new promise to wait for a “operation signal”. When processing the operation signal, wait for a new timer instead(or skip timer). Note that this is backward-compatible change because waiting for a signal doesn’t get any workflow history event involved.
Send the operation signal to all started workflows. It may be tedious here if you have too many workflows to send signals and don’t have a good way to find and send signals.

This approach can work with Cadence as well, it’s very similar as Solution #3.

The rest of this answer is describing the three different approaches that you can take in Cadence.

Solution #1: Reset Workflow

This is probably the easiest solution to understand and apply. Same as the above sample, You update the workflow to use 3 days as timer value, and let some existing workflows get stuck into NDE states. Then collect those workflows, and reset them with “LastDecisionCompleted” resetType.

./cadence --do <domain> wf reset --resetType LastDecisionCompleted -w <workflowID> -r <runID> --reason "some reason"

LastDecisionCompleted resetType means forgetting about the result of the last workflow decision task. It’s exactly the one that scheduled the 7-day timer in this case.

You may want to use the batch reset command if you have many of them to reset. see the CLI document about the reset feature.

Behind the scene, reset would let the stuck workflows forget about the last 7-day timer, and continue from before scheduling the timer. Because the code has been updated to use a 3-day timer, workflows will now run as you expected.

Solution #2: Versioning + Batch Reset

Cadence has much more powerful workflow versioning support:

“getVersion is used to safely perform backwards incompatible changes to workflow definitions. It is not allowed to update workflow code while there are workflows running as it is going to break determinism. The solution is to have both old code that is used to replay existing workflows as well as the new one that is used when it is executed for the first time.

getVersion returns maxSupported version when is executed for the first time. This version is recorded into the workflow history as a marker event. Even if maxSupported version is changed the version that was recorded is returned on replay. DefaultVersion constant contains version of code that wasn't versioned before.”

We can use this versioning to make the timer changes:

@Override
public void sampleWorkflowWithTimer(Input input){
  //...
  //some business logic before the timer
  //use a durable timer waiting for 7 days
  int version = Workflow.getVersion("timerChange", Workflow.DEFAULT_VERSION, 1);
  if (version == Workflow.DEFAULT_VERSION) {
     Workflow.sleep(Duration.ofDays(7));
  } else {
   // Because the workflow has waited for some time,
   // you may want to sleep for 3-timeAlreadyElapsed instead
     Workflow.sleep(Duration.ofDays(3));
  }
  //send an email after the timer fires 
  activities.sendEmailReminder(input);
  //continue with other business 
  //...
}

Note that the benefit of using this powerful versioning over a feature flag to the workflow input in SWF, is that not only the newly started workflows will use the 3-day timer value, but also already started workflows will use 3-day, as long as they haven’t started the 7-day timers.

Then we can go fix the workflows that started 7-day timers. We will reset those workflows with LastDecisionCompleted resetType. However, reset gets slightly easier to use because of versioning --

Cadence by automatically adds search attributes to workflows that use versioning. It allows you to look for workflows with specific versions in the history. In this case, the workflows that will started 3-day timers will have a search attribute “CadenceChangeVersion” with value “timerChange-1”. Therefore to find the stuck workflows, we can use the following SQL:

WorkflowTYpe = “YourWorkflowType” AND CloseTime = missing AND StartTime < “NewCodeDeployTime” AND CadenceChangeVersion != “timerChange-1”

Where WorkflowTYpe = “YourWorkflowType” means only for that particular workflow type, CloseTime = missing means open workflows only, StartTime < “NewCodeDeployTime” means workflows that were started with old code (7-day), CadenceChangeVersion != “timerChange-1” means excluding workflows that started new timer values(3 days).

The above SQL can include workflows that started with old code, but haven’t started timers yet. If you want to be more precise, you can also include HistoryLength into the SQL. You need to figure out what is the approximate range of history length(event counts) that are stuck at the 7-day timers.

Once you have the SQL, then use batch reset command to reset the workflows:

./cadence wf reset-batch --query ' WorkflowType= “YourWorkflowType” AND CloseTime = missing AND StartTime < "NewCodeDeployTime" AND CadenceChangeVersion != "timerChange-1" ' --resetType LastDecisionCompleted --reason "some reason"

Solution #3: Versioning + Batch Signal

This approach is quite similar to the one we described for SWF. First we change the workflow code using versioning, instead of feature flag.

@Override
public void sampleWorkflowWithTimer(Input input){
  //...
  //some business logic before the timer
  //use a durable timer waiting for 7 days
  int version = Workflow.getVersion("timerChange", Workflow.DEFAULT_VERSION, 1);
  if (version == Workflow.DEFAULT_VERSION) {
     final boolean received = Workflow.await(Duration.ofDays(7), 
                                  () -> this.operationSignal == true);
     if(received){
        // Because the workflow has waited for some time,
        // you may want to sleep for 3-timeAlreadyElapsed instead
        Workflow.sleep(Duration.ofDays(3));
     }
  } else {
     Workflow.sleep(Duration.ofDays(3));
  }
  //send an email after the timer fires 
  activities.sendEmailReminder(input);
  //continue with other business 
  //...
}

@Override
  public void operationSignal(final String signal) {
    // you can add more cases to this operationSignal
    this.operationSignal = true;
    LOGGER.info("receive operationSignal: " + signal );
  }

Like the above said, this versioning is different from SWF input feature flag. The workflows started with old code could also use the 3-day timer as long as they haven’t started 7-day one yet.

It’s worth mentioning that the change from Workflow.Sleep() Workflow.await() is backward compatible. That’s because they both schedule a timer with the same value -- 7 days. Waiting for a signal doesn’t need any history event.

Now you can send an operationSignal to all the waiting workflows. Like above, we can use SQL to search for all of those workflows. Then use the batch signal command to signal them.

./cadence wf batch start --query 'WorkflowType= “YourWorkflowType” AND CloseTime = missing AND StartTime < "NewCodeDeployTime" AND CadenceChangeVersion != "timerChange-1" ' --reason "some reason" --bt signal --input "anything"
--sig SampleWorkflow::operationSignal

Once a stuck workflow receives an operationSignal, it will get unblocked from the 7-day timer and execute the new code in “received” logic.

Note that It’s more convenient in Cadence to send signals to workflows in batch than SWF. The batch job is guaranteed to execute as a system workflow in Cadence.

What about Stuck Activities

What if you scheduled and started an activity that has a very long timeout but later on didn’t want to wait for it to complete/timeout/fail? The workflows are now stuck because of waiting for the activity.

Prevention

Before jumping into solutions, it is notable that using a long activity timeout without a proper heartbeat is considered an anti-pattern in Cadence/SWF/Step functions. So you should avoid that from the very beginning. If you expect an activity to run for a long time, e.g. for > 10 minutes, you should use set a proper heartbeat timeout value, and call heartbeat API in the activity. See more details in this answer for Cadence.

Using heartbeat is not only important for activity stuck due to some IO/dependency. This is more likely in the cases of worker deployment, or activity worker failures and you need to restart activities.

class MyActivitiesImpl implements MyActivities {

    @Override
    public String myHeartbeatActivity() {
      ...
      // after any IO/RPC call/some time period
      Activity.heartbeat(heartbeatDetails);
      ...

Note that the actual heartbeat call to cadence server is optimized by the client SDK. So that if you call it every 1ms it won't have any perform issue. Internally the SDK will decide to make a RPC call when it's about 80% to the heartbeat timeout.

So, what if my activity is already stuck

However, mistakes always happen as we are all human.

Assuming our workflow code looks like this.

@Override
public void sampleWorkflowWithLongTimeoutActivity(Input input){
  //...
  //some business logic before the activity
  activities.helloActivity(input);
  //continue with other business 
  //...
}

And many workflows now are stuck at the helloActivity. Because this activity is stuck due to incorrect timeouts, to mitigate the issue, the first thing you should do is update the activity options to use the correct timeout values, or use heartbeat if the activity is a long running activity.

Updating the activity timeout options is a backward compatible change in both Cadence and SWF, without any versioning. However, activity timeouts are specified in state machines of Step functions, and any tiny change of the state machine only takes effect for new workflow executions. Therefore, for Cadence and SWF, fixing the timeout options will work for any workflows that haven’t yet started the activity, but for Step functions, it only works for completely new workflows started from beginning.

What about the workflows that have started the activities with wrong timeout values?

Obviously there is nothing you can do in Step Functions state machines. You can kill the workflows and restart them but that’s tedious if the workflow has some value that states that you don’t want to restart. Then the only thing you should do is on the activity side: You should carefully write the activity code(make sure no dead loop or dead waiting block) Monitor the activity execution with proper metrics You can add logic in the activity to return failure to the state machine early in a separate thread, if there is something that retry won’t help at the moment. The separate thread would attempt to act like enforcing the correct timeout in the workflow side(but actually not the same).

Solution #1

For SWF/Cadence, you can use a similar solution for the stuck timers in the workflow code.

@Override
public void sampleWorkflowWithLongTimeoutActivity(Input input){
  //...
  //some business logic before the activity
  Promise<void> hello = Async.function(activities::helloActivity, input);
  Workflow.await(()-> hello.isCompleted() || this.operationSignal == true );
  if(this.operationSignal){
   // add your logic to handle the situation that we skip the wrong activity timeout.You may want to schedule the same activity again with correct timeouts
  }else{
  //continue with other business like before to be compatible 
  }
 
}

The trick here is to change synchronous to asynchronous. Changing the single activity from synchronous to asynchronous usually causes NDE, but in this case the workflow immediately waits for the activity and a signal. Internally this will have the same workflow history, hence it’s a backward compatible change.

Solution #2

Luckily, if you are using Cadence, simply resetting the workflows with LastDecisionCompleted resetType, it is a lifesaver.

Also you can use batch reset if you have too many of them to reset.

Solution #3

You can also use CLI command to complete or fail the activity:

./cadence --do <domain> wf activity complete -w <workflowID> -r <runID> --activity_id <activityID> --result <result> --identity <some_identity_string>

and

./cadence --do <domain> wf activity fail -w <workflowID> -r <runID> --activity_id <activityID> --reason <reason> --detail <detail> --identity <some_identity_string>

来源：https://stackoverflow.com/questions/65118584/solutions-to-fix-stuck-timers-activities-in-cadence-swf-stepfunctions

标签

aws-step-functions

amazon-swf

cadence-workflow

uber-cadence