How should I clean up hung grandchild processes when an alarm trips in Perl?

倾然丶 夕夏残阳落幕 提交于 2019-12-03 12:24:41

I've read the question a few times, and I think I sort of get what you are trying to do. You have a control script. This script spawns children to do some stuff, and these children spawn the grandchildren to actually do the work. The problem is that the grandchildren can be too slow (waiting for STDIN, or whatever), and you want to kill them. Furthermore, if there is one slow grandchild, you want the entire child to die (killing the other grandchildren, if possible).

So, I tried implementing this two ways. The first was to make the parent spawn a child in a new UNIX session, set a timer for a few seconds, and kill the entire child session when the timer went off. This made the parent responsible for both the child and the grandchildren. It also didn't work right.

The next strategy was to make the parent spawn the child, and then make the child responsible for managing the grandchildren. It would set a timer for each grandchild, and kill it if the process hadn't exited by expiration time. This works great, so here is the code.

We'll use EV to manage the children and timers, and AnyEvent for the API. (You can try another AnyEvent event loop, like Event or POE. But I know that EV correctly handles the condition where a child exits before you tell the loop to monitor it, which eliminates annoying race conditions that other loops are vulnerable to.)

#!/usr/bin/env perl

use strict;
use warnings;
use feature ':5.10';

use AnyEvent;
use EV; # you need EV for the best child-handling abilities

We need to keep track of the child watchers:

# active child watchers
my %children;

Then we need to write a function to start the children. The things the parent spawns are called children, and the things the children spawn are called jobs.

sub start_child($$@) {
    my ($on_success, $on_error, @jobs) = @_;

The arguments are a callback to be called when the child completes successfully (meaning its jobs were also a success), a callback when the child did not complete successfully, and then a list of coderef jobs to run.

In this function, we need to fork. In the parent, we setup a child watcher to monitor the child:

    if(my $pid = fork){ # parent
        # monitor the child process, inform our callback of error or success
        say "$$: Starting child process $pid";
        $children{$pid} = AnyEvent->child( pid => $pid, cb => sub {
            my ($pid, $status) = @_;
            delete $children{$pid};

            say "$$: Child $pid exited with status $status";
            if($status == 0){
            else {

In the child, we actually run the jobs. This involves a little bit of setup, though.

First, we forget the parent's child watchers, because it doesn't make sense for the child to be informed of its siblings exiting. (Fork is fun, because you inherit all of the parent's state, even when that makes no sense at all.)

    else { # child
        # kill the inherited child watchers
        %children = ();
        my %timers;

We also need to know when all the jobs are done, and whether or not they were all a success. We use a counting conditional variable to determine when everything has exited. We increment on startup, and decrement on exit, and when the count is 0, we know everything's done.

I also keep a boolean around to indicate error state. If a process exits with a non-zero status, error goes to 1. Otherwise, it stays 0. You might want to keep more state than this :)

        # then start the kids
        my $done = AnyEvent->condvar;
        my $error = 0;


(We also start the count at 1 so that if there are 0 jobs, our process still exits.)

Now we need to fork for each job, and run the job. In the parent, we do a few things. We increment the condvar. We set a timer to kill the child if it's too slow. And we setup a child watcher, so we can be informed of the job's exit status.

    for my $job (@jobs) {
            if(my $pid = fork){
                say "[c] $$: starting job $job in $pid";

                # this is the timer that will kill the slow children
                $timers{$pid} = AnyEvent->timer( after => 3, interval => 0, cb => sub {
                    delete $timers{$pid};

                    say "[c] $$: Killing $pid: too slow";
                    kill 9, $pid;

                # this monitors the children and cancels the timer if
                # it exits soon enough
                $children{$pid} = AnyEvent->child( pid => $pid, cb => sub {
                    my ($pid, $status) = @_;
                    delete $timers{$pid};
                    delete $children{$pid};

                    say "[c] [j] $$: job $pid exited with status $status";
                    $error ||= ($status != 0);

Using the timer is a little bit easier than alarm, since it carries state with it. Each timer knows which process to kill, and it's easy to cancel the timer when the process exits successfully -- we just delete it from the hash.

That's the parent (of the child). The child (of the child; or the job) is really simple:

            else {
                # run kid
                exit 0; # just in case

You could also close stdin here, if you wanted to.

Now, after all the processes have been spawned, we wait for them to all exit by waiting on the condvar. The event loop will monior the children and timers, and do the right thing for us:

        } # this is the end of the for @jobs loop

        # block until all children have exited

Then, when all the children have exited, we can do whatever cleanup work we want, like:

            say "[c] $$: One of your children died.";
            exit 1;
        else {
            say "[c] $$: All jobs completed successfully.";
            exit 0;
    } # end of "else { # child"
} # end of start_child

OK, so that's the child and grandchild/job. Now we just need to write the parent, which is a lot easier.

Like the child, we are going to use a counting condvar to wait for our children.

# main program
my $all_done = AnyEvent->condvar;

We need some jobs to do. Here's one that is always successful, and one that will be successful if you press return, but will fail if you just let it be killed by the timer:

my $good_grandchild = sub {
    exit 0;

my $bad_grandchild = sub {
    my $line = <STDIN>;
    exit 0;

So then we just need to start the child jobs. If you remember way back to the top of start_child, it takes two callbacks, an error callback, and a success callback. We'll set those up; the error callback will print "not ok" and decrement the condvar, and the success callback will print "ok" and do the same. Very simple.

my $ok  = sub { $all_done->end; say "$$: $_[0] ok" };
my $nok = sub { $all_done->end; say "$$: $_[0] not ok" };

Then we can start a bunch of children with even more grandchildren jobs:

say "starting...";

$all_done->begin for 1..4;
start_child $ok, $nok, ($good_grandchild, $good_grandchild, $good_grandchild);
start_child $ok, $nok, ($good_grandchild, $good_grandchild, $bad_grandchild);
start_child $ok, $nok, ($bad_grandchild, $bad_grandchild, $bad_grandchild);
start_child $ok, $nok, ($good_grandchild, $good_grandchild, $good_grandchild, $good_grandchild);

Two of those will timeout, and two will succeed. If you press enter while they're running, though, then they might all succeed.

Anyway, once those have started, we just need to wait for them to finish:


say "...done";

exit 0;

And that's the program.

One thing that we aren't doing that Parallel::ForkManager does is "rate limiting" our forks so that only n children are running at a time. This is pretty easy to manually implement, though:

 use Coro;
 use AnyEvent::Subprocess; # better abstraction than manually
                           # forking and making watchers
 use Coro::Semaphore;

 my $job = AnyEvent::Subprocess->new(
    on_completion => sub {}, # replace later
    code          => sub { the child process };

 my $rate_limit = Coro::Semaphore->new(3); # 3 procs at a time

 my @coros = map { async {
     my $guard = $rate_limit->guard;
     $job->clone( on_completion => Coro::rouse_cb )->run($_);
 }} ({ args => 'for first job' }, { args => 'for second job' }, ... );

 # this waits for all jobs to complete
 my @results = map { $_->join } @coros;

The advantage here is that you can do other things while your children are running -- just spawn more threads with async before you do the blocking join. You also have a lot more control over the children with AnyEvent::Subprocess -- you can run the child in a Pty and feed it stdin (like with Expect), and you can capture its stdin and stdout and stderr, or you can ignore those things, or whatever. You get to decide, not some module author that's trying to make things "simple".

Anyway, hope this helps.

Brian - it's a bit crude and non-idiomatic, but one approach I've seen taken is this: anytime you fork, you:

  1. Give the child process a first "-id" dummy parameter to the program, with a somewhat unique (per PID) value - a good candidate could be up-to-millisecond timestamp + parent's PID.

  2. The parent records the child PID and a -id value into a (ideally, persistent) registry along with the desired timeout/kill time.

Then have a watcher process (either the ultimate grandparent or a separate process with the same UID) simply cycle through the registry periodically, and check which processes needing to be killed (as per to-kill-time) are still hanging around (by matching both PID and "-id" parameter value in the registry with the PIDs and command line in process table); and send signal 9 to such process (or be nice and try to kill gently first by trying to send signal 2).

The unique "-id" parameter is obviously intended to prevent killing some innocent process that just happened to re-use a prior process's PID by coincidence, which is probably likely given the scale you mentioned.

The idea of a registry helps with the problem of "already disassociated" grand-children since you no longer depend on the system to keep parent/child association for you.

This is kind of brute force, but since nobody answered yet I figured I'll though my 3 cents worth of an idea your way.

I have to solve this same problem in a module I've been working on. I'm not completely satisfied with all of my solution(s) either, but what generally works on Unix is to

  1. change a child's process group
  2. spawn grandchildren as necessary
  3. change the child's process group again (say, back to its original value)
  4. signal the grandchildren's process group to kill the grandchildren

Something like:

use Time::HiRes qw(sleep);

sub be_sleepy { sleep 2 ** (5 * rand()) }
$SIGINT = 2;

for (0 .. $ARGV[1]) {
    print ".";
    print "\n" unless ++$count % 50;
    if (fork() == 0) {   
        # a child process
        # $ORIGINAL_PGRP and $NEW_PGRP should be global or package or object level vars
        $ORIGINAL_PGRP = getpgrp(0);
        setpgrp(0, $$);
        $NEW_PGRP = getpgrp(0);

        local $SIG{ALRM} = sub {
            die "$$ timed out\n";

        eval {
            alarm 2;
            while (rand() < 0.5) {
                if (fork() == 0) {
            alarm 0;

        exit 0;

sub kill_grandchildren {
    setpgrp(0, $ORIGINAL_PGRP);
    kill -$SIGINT, $NEW_PGRP;   # or  kill $SIGINT, -$NEW_PGRP

This isn't completely fool proof. The grandchildren might change their process groups or trap signals.

None of this will work on Windows, of course, but let's just say that TASKKILL /F /T is your friend.

Update: This solution doesn't handle (for me, anyway) the case when the child process invokes system "perl -le '<STDIN>'". For me, this immediately suspends the process, and prevents the SIGALRM from firing and the SIGALRM handler from running. Is closing STDIN the only workaround?