Remote reboot / keepalive using watchdog

I needed something that can do the following:
– automatically reboot if another server goes missing/offline
– reboot if instructed by another server
– it must be failproof and work in all conditions (bad hardware, bad memory, lots of errors)
Why? Because I want more control, more automation and don’t want to depend on datacenter staff for reboots.
Let’s implement this using the watchdog daemon.


! Warning !
This a very dangerous to implement and/or test on a production server. You might inadvertedly reboot your server, check everything twice, read all the references/documentation and make sure you understand what’s going on.


So, what’s a watchdog?
A watchdog is something, either hardware-based or kernel-based, that monitors the system for “normal” behaviour and if it fails, performs a system reset to hopefully recover normal operation.
There are 2 parts:
– user-space background daemon – periodically writes to the watchdog device (every second) as a way to mark that everything is still ok at this point, will stop writing if there are any faults detected
– hardware module or kernel module – Intel cpus come with a Intel TCO watchdog (iTCO_wdt), if not available the Linux kernel has a software watchdog module – both do the same thing, they monitor the watchdog device and if there’s no activity (eg: for 1min) a hard reset/reboot is forced

Why does this watchdog functionality exists?
Let’s say that somehow the whole system locks up, or there’s not enough memory left, how are you going to login and fix the system? You might not be able, so this is a failsafe method that tries to detect hard-to-recover problems and then iniaties a reset/reboot at the lowest possible level.

What I wanted to implement:
– watchdog runs a custom script as an additional check
– if the custom script returns anything other than 0, then the watchdog will initiate a reboot
– the custom scripts handles 2 cases:
a) checks if a file called “remote-reboot-flag” exists and if yes then signals the watchdog for a reboot
b) checks if a file called “keepalive” exista and hasn’t been modified in 15 minutes, if yes then signals the watchdog for a reboot

This allows the following things:
– I can remotely reboot the server by creating the file “remote-reboot-flag” using a NSF mount, the watchdog then kicks in
– another server periodically touches the “keepalive” file, then if something happens the file won’t be updated anymore and the watchdog kicks in

watchdog configuration
/etc/watchdog.conf

the custom watchdog script, chose PHP for this one
/root/watchdog-custom-script

Restart the watchdog daemon:

Check the log files, /var/log/syslog might contain something similar to this:

Make sure the logs do contain your custom script path. I also recommend activating the softboot setting for watchdog, on some Linux distributions this can be added to /etc/default/watchdog:

Second server setup
For the keepalive function, I have the following set up on another server in a cron.d file:

To perform a remote reboot from another server that has a NFS mount you just run:

Safety checks
Check the custom script separately before setting it in watchdog.conf. Run it manually and check all conditions and file paths. You don’t want to get into a reboot loop!

References:
http://www.sat.dundee.ac.uk/psc/watchdog/watchdog-background.html
https://linux.die.net/man/8/watchdog
https://linux.die.net/man/5/watchdog.conf
https://www.jann.cc/2013/02/02/linux_watchdog.html

Leave a Reply

Your email address will not be published. Required fields are marked *