I needed something that can do the following:
– automatically reboot if another server goes missing/offline
– reboot if instructed by another server
– it must be failproof and work in all conditions (bad hardware, bad memory, lots of errors)
Why? Because I want more control, more automation and don’t want to depend on datacenter staff for reboots.
Let’s implement this using the watchdog daemon.
! Warning !
This a very dangerous to implement and/or test on a production server. You might inadvertedly reboot your server, check everything twice, read all the references/documentation and make sure you understand what’s going on.
So, what’s a watchdog?
A watchdog is something, either hardware-based or kernel-based, that monitors the system for “normal” behaviour and if it fails, performs a system reset to hopefully recover normal operation.
There are 2 parts:
– user-space background daemon – periodically writes to the watchdog device (every second) as a way to mark that everything is still ok at this point, will stop writing if there are any faults detected
– hardware module or kernel module – Intel cpus come with a Intel TCO watchdog (iTCO_wdt), if not available the Linux kernel has a software watchdog module – both do the same thing, they monitor the watchdog device and if there’s no activity (eg: for 1min) a hard reset/reboot is forced
Why does this watchdog functionality exists?
Let’s say that somehow the whole system locks up, or there’s not enough memory left, how are you going to login and fix the system? You might not be able, so this is a failsafe method that tries to detect hard-to-recover problems and then iniaties a reset/reboot at the lowest possible level.
What I wanted to implement:
– watchdog runs a custom script as an additional check
– if the custom script returns anything other than 0, then the watchdog will initiate a reboot
– the custom scripts handles 2 cases:
a) checks if a file called “remote-reboot-flag” exists and if yes then signals the watchdog for a reboot
b) checks if a file called “keepalive” exista and hasn’t been modified in 15 minutes, if yes then signals the watchdog for a reboot
This allows the following things:
– I can remotely reboot the server by creating the file “remote-reboot-flag” using a NSF mount, the watchdog then kicks in
– another server periodically touches the “keepalive” file, then if something happens the file won’t be updated anymore and the watchdog kicks in
watchdog configuration
/etc/watchdog.conf
1 2 3 4 5 6 7 |
# let's run our checks only every 10 seconds, don't use more than 30 or 60 secs! interval = 10 # path to our custom script test-binary = /root/watchdog-custom-script # max execution time for our script test-timeout = 10 |
the custom watchdog script, chose PHP for this one
/root/watchdog-custom-script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
#!/usr/bin/php <?php $filename_keepalive = '/root/keepalive'; $filename_reboot = '/root/remote-reboot-flag'; if (file_exists($filename_reboot)) { echo "reboot flag file found, returning -1\n"; unlink($filename_reboot); exit(-1); // reboot } if (file_exists($filename_keepalive)) { $mtime = filemtime($filename_keepalive); $time_dif = time() - $mtime; $max_dif = 60 * 15; // 15 min if ($time_dif > $max_dif) { echo "keepalive modification time too much, returning -1\n"; unlink($filename_keepalive); exit(-1); // reboot } } exit(0); // everything's ok |
Restart the watchdog daemon:
1 |
/etc/init.d/watchdog restart |
Check the log files, /var/log/syslog might contain something similar to this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Dec 9 14:56:37 sandy systemd[1]: Stopping watchdog keepalive daemon... Dec 9 14:56:37 sandy wd_keepalive[12068]: stopping watchdog keepalive daemon (5.14) Dec 9 14:56:37 sandy systemd[1]: Stopped watchdog keepalive daemon. Dec 9 14:56:37 sandy systemd[1]: Starting watchdog daemon... Dec 9 14:56:37 sandy watchdog[12114]: starting daemon (5.14): Dec 9 14:56:37 sandy watchdog[12114]: int=10s realtime=yes sync=no soft=yes mla=0 mem=0 Dec 9 14:56:37 sandy watchdog[12114]: ping: no machine to check Dec 9 14:56:37 sandy watchdog[12114]: file: no file to check Dec 9 14:56:37 sandy watchdog[12114]: pidfile: no server process to check Dec 9 14:56:37 sandy watchdog[12114]: interface: no interface to check Dec 9 14:56:37 sandy watchdog[12114]: temperature: no sensors to check Dec 9 14:56:37 sandy watchdog[12114]: test=/root/watchdog-sandy(10) repair=none(0) alive=/dev/watchdog heartbeat=none to=root no_act=no force=no Dec 9 14:56:37 sandy watchdog[12114]: watchdog now set to 60 seconds Dec 9 14:56:37 sandy watchdog[12114]: hardware watchdog identity: iTCO_wdt Dec 9 14:56:37 sandy systemd[1]: Started watchdog daemon. |
Make sure the logs do contain your custom script path. I also recommend activating the softboot setting for watchdog, on some Linux distributions this can be added to /etc/default/watchdog:
1 2 |
# Specify additional watchdog options here (see manpage). watchdog_options=" --softboot " |
Second server setup
For the keepalive function, I have the following set up on another server in a cron.d file:
1 2 3 4 5 |
MAILTO=root */2 * * * * root touch /nfs-path/keepalive # EOF |
To perform a remote reboot from another server that has a NFS mount you just run:
1 |
touch /nfs-path/remote-reboot-flag |
Safety checks
Check the custom script separately before setting it in watchdog.conf. Run it manually and check all conditions and file paths. You don’t want to get into a reboot loop!
References:
http://www.sat.dundee.ac.uk/psc/watchdog/watchdog-background.html
https://linux.die.net/man/8/watchdog
https://linux.die.net/man/5/watchdog.conf
https://www.jann.cc/2013/02/02/linux_watchdog.html