Understanding the ALOM Watchdog Timer


ALOM features a watchdog mechanism to detect and respond to a system hang, should one ever occur.

Note: The ALOM watchdog feature is not supported on all platforms. For more information about whether your host system is supported, refer to the Release Notes for your version of the ALOM software.

The ALOM watchdog is a timer that is continually reset by a user application, as long as the operating system and user application are running. In the event of a system hang, the user application is no longer able to reset the timer. The timer will then expire and will perform an action that has been set by the user, eliminating the need for operator intervention.

In order to fully understand the ALOM watchdog timer, it’s useful to understand certain terms associated with the feature’s components and how all of the components interact.

  1. If the ALOM watchdog timer is enabled, it will automatically begin monitoring the host server, and will detect when the host or application encounters a hang condition or stops running. The default timeout period is 60 seconds; in other words, if the ALOM watchdog timer does not hear from the host system within that 60-second window, it will automatically perform the action that you specify in the sys_autorestart variable. You can change the timeout period through the sys_wdttimeout variable.
  2. If you set XIR as the function that ALOM would perform once the watchdog timer timeout period is reached, then ALOM will attempt to XIR the host system. If the XIR does not complete within the specified number of seconds (set through the sys_xirtimeout variable), then ALOM forces the server to perform a hard reset instead.
  3. The ALOM watchdog should be enabled by the user application after the host system is booted up. ALOM starts a timer to detect host boot failures as soon as the host is powered on or reset. The host is considered fully booted once the ALOM watchdog timer is started. If the host fails to boot within a certain amount of time, it will take an action that you have specified. You use the sys_boottimeout variable to specify the amount of time that the ALOM watchdog will wait for the host to boot. You specify the action it will take if it doesn’t boot in that time through the sys_bootrestart variable. You can set the maximum number of attempted reboots using the sys_maxbootfail variable, to keep the system from going through an endless cycle of reboots. If the system goes through the number of reboots set through the sys_maxbootfail variable, then ALOM will perform an action that you specify through the sys_bootfailrecovery variable.

Managed system interface variables

Sample ALOM watchdog program