Signal handling and check pointing for mpif90

问题

I have written a code for trapping the signal for CTRL+C for gfortran and it works.

program trap  
external trap_term  
call signal(2, trap_term)  
call sleep(60)  
end program trap  

function trap_term()  
integer::trap_term  
print*,'done'  
call exit(trap_term)  
end function trap_term

How would one write exactly same thing for mpif90 ? Also, what is the best way to include checkpoints and restart (probably automatic) the code (from where left before) in parallel processors.

This is required because I have allocated time on clusters. Jobs are kicked out after fixed number of hours and a new resubmission is required.

回答1:

Writing your software to checkpoint on receipt of a kill signal from the operating system is likely to be far less useful than you probably hope it will be. Suppose that you can code your program to write a full checkpoint in the time available to it when it is told to stop. You are then left with restarting your program from the arbitrary point at which it was previously stopped. That's a far from trivial problem.

Why not do what many of us used to do, and many of us still do, in this domain ? Write your code to checkpoint every X iterations or at intervals of approximately Y minutes (you choose X and Y) ? And write routines to restart from one of those checkpoints in the event that a previous execution has been prematurely halted. This way you only have to restart from a single defined state of execution.

You should probably be writing these checkpoint and restart routines anyway to guard against hardware problems, which only become worse as the CPU count rises and the number of network connections multiplies.

I suppose you could write your code to keep an eye on the wall-clock, as it were, and tell it, on start-up, that it had an allowance of N hours so to checkpoint at N-n hours, where n is long enough to do the checkpointing with a small margin of error. But this approach won't help if a CPU fails mid-computation.

回答2:

tl;dr; Do as High Performance Mark and francescalus suggest.

In addition to what HPM says in his answer, keep in mind that what you're allowed to do in a signal handler is extremely limited. For instance, allocating memory is not allowed, which in turn rules out a lot of other things such as Fortran (or C stdio) I/O because the Fortran I/O routines may allocate memory for their own use. You can see a list of so-called 'async-signal-safe' POSIX functions e.g. at http://man7.org/linux/man-pages/man7/signal.7.html .

Among the few things which you can reliably do in a signal handler is to set some flag variable, which you then check later on in your main program. E.g. after an iteration is finished, you check the flag whether to checkpoint and exit, and then do all the I/O and whatever in the "normal" context, not in the signal handler context. This is essentially what francescalus explained in his comment to HPM's answer.

来源：https://stackoverflow.com/questions/33619071/signal-handling-and-check-pointing-for-mpif90

标签

fortran

signals

mpi

restart

signal-handling