java: inconsistent watchdog timeout in systemd-notify

My java application gets installed onto on OpenSUSE 13.2 OS, and I'm using systemd for process control. (systemd version 210)

I would like to take advantage of the systemd watchdog functionality using systemd-notify. However, I notice the app restarting due to inconsistent timeouts from the watchdog.

With WatchdogSec=120, and the app configured to call systemd-notify every 60 seconds, I observe restarts every five to 20 minutes, on average.

here is the (slightly redacted) systemd unit file for the process:

# Cool systemd service
[Unit]
Description=Something Awesome
After=awesomeparent.service
Requires=awesomeparent.service

[Service]
Type=simple
WorkingDirectory=/opt/awesome
Environment="AWESOME_HOME=/opt/awesome" 
User=awesomeuser
Restart=always
WatchdogSec=120
NotifyAccess=all
ExecStart=/home/awesome/jre1.8.0_05/bin/java -jar awesome.jar

[Install]
WantedBy=multi-user.target

And here is the code for calling systemd-notify

String pidStr = ManagementFactory.getRuntimeMXBean().getName();
pidStr = pidStr.split("@")[0];

String cmd = "/usr/bin/systemd-notify";

Process process = new ProcessBuilder(cmd, 
                                    "MAINPID=" + pidStr, 
                                    "WATCHDOG=1").redirectErrorStream(true)
                                                 .start();

int exitCode = 0;
if ((exitCode = process.waitFor()) != 0) {                
    String output = IOUtils.toString(process.getInputStream());
    Log.MAIN_LOG.error("Failed to notify systemd: " + 
                              ((output.isEmpty()) ? "" : " " + output) +
                              " Exit code: " + exitCode);

}

In the logs, I never see the failure messages (process always returns 0 exit code) and I'm 100% sure that the task IS being executed once per minute, on the minute. I can see the task log being executed immediately before restarts.

Anyone have any ideas why systemd-notify just doesn't work sometimes?

I'm thinking about writing code to call sd_pid_notify directly, but would like to know if there's a simple config thing I can do before going that route.

Here's the JNA code that solved the problem:

import com.sun.jna.Library;
import com.sun.jna.Native;

/**
 * The task issues a notification to the systemd watchdog. The systemd watchdog
 * will restart the service if the notification is not received.
 */

public class WatchdogNotifierTask implements Runnable {

private static final String SYSTEMD_SO = "systemd";
private static final String WATCHDOG_READY = "WATCHDOG=1";

@Override
public void run() {

  try {
    int returnCode = SystemD.INSTANCE.sd_notify(0, WATCHDOG_READY);
    if (returnCode < 0) {
      Log.MAIN_LOG.error(
          "Systemd watchdog returned a negative error code: " + Integer.toString(returnCode));
    } else {
      Log.MAIN_LOG.debug("Successfully updated systemd watchdog.");
    }
  } catch (Exception e) {
    Log.MAIN_LOG.error("calling sd_notify native code failed with exception: ", e);
  }
} 

/**
 * This is a linux-specific interface to load the systemd shared library and call the sd_notify
 * function. Should we need other systemd functionality, it can be loaded here. It uses JNA for
 * native library calls.
 *
 */
interface SystemD extends Library {
  SystemD INSTANCE = (SystemD) Native.loadLibrary(SYSTEMD_SO, SystemD.class);
  int sd_notify(int unset_environment, String state);
}

}

JdeBP

Anyone have any ideas why systemd-notify just doesn't work sometimes?

This is actually a long-standing problem in several systemd protocols, not just in the readiness notification protocol spoken by systemd-notify. The protocol for sending things directly to systemd's own journal also has this problem.

Both protocols attempt to find out stuff about the sending, client-end, process by reading things out of /proc/client-process-id/*. Unfortunately, systemd-notify is a short-lived program that exits as soon as it has sent the message to the server. So reading /proc/client-process-id/* does not yield the information about the client end that the server needs. In particular, the server cannot determine what (systemd) control group the client-end belongs to, and thus determine what service unit controls it, and thus determine whether it is a process that is allowed to send readiness notification messages.

As you have discovered, calling a library routine in-process in your actual dæmon, instead of forking a short-lived child process to run systemd-notify avoids this problem, because of course your dæmon does not immediately exit after sending the notification. Be aware, however, that if you issue a readiness notification immediately before exiting your daemon (as, ironically, some dæmons do in order to notify the world that they are terminating), you'll encounter this same problem even with an in-process library function.

There's no need to call a systemd library function as native code in order to speak this protocol, by the way. (And not using the library function gains you the advantage of speaking this protocol properly even if systemd isn't at the server end of it — a failing of the systemd library function.) It's not a hard protocol to speak in Java, and the systemd manual page describes the protocol. You look at an environment variable, open a datagram socket, use the variable's value for the name of the socket to send to, send a single datagram message, and then close the socket. Java is capable of this. ☺

java: inconsistent watchdog timeout in systemd-notify

Further reading