Auto-Restarting a Service with Nagios

James Young · June 17, 2013

I haven’t worked out why yet, but this seems to be a common theme - the PHP/FastCGI service dies periodically, which causes outages with my blog (Nginx does not like it if the back end goes away).  So, I need a solution to fix this.  Enter Nagios!

Nagios is able to have customized event handlers.  Those event handlers can be set up to perform any action you want - such as restarting a service.  So, we’ll use Nagios to restart the service every time it dies.

First, create a script in /usr/local/lib64/nagios/plugins/eventhandlers/restart-fastcgi ;

# Restarts the php-fpm FastCGI service if it dies

case "$1" in
        case "$2" in
                case "$3" in
			echo -n "Starting Fast-CGI service (3rd soft critical state)..."
			sudo /sbin/service php-fpm start | /bin/mail -s "[] FastCGI Restarted" root
			echo -n "Starting Fast-CGI service ..."
			sudo /sbin/service php-fpm start | /bin/mail -s "[] FastCGI Restarted" root
exit 0

Ok, now we’ll need to configure sudoers to allow the nagios user to run ‘service start php-fpm’ without credentials.  Add this to your sudoers with visudo;

Defaults:nagios         !requiretty,visiblepw
Cmnd_Alias      NAGIOS_START_PHPFPM = /sbin/service php-fpm start
nagios          ALL=(root)      NOPASSWD: NAGIOS_START_PHPFPM

Now, we’ll test that we can actually do it.  As root, do this;

su - nagios
/usr/local/lib64/nagios/plugins/eventhandlers/restart-fastcgi CRITICAL SOFT 3

You should then get an email sent to root saying it’s starting the service.  Obviously it won’t actually DO it (it’s already running).  Check in your /var/log/secure that the sudo command worked.  If so, great!  Now we need to set up Nagios itself to do the restart.

First, we’ll define a command to do the restart (note, I use $USER8$ to point to the local event handlers folder);

define command{
        command_name    restart-fastcgi

Then we’ll add that event handler to the service check we already have in place for checking our FastCGI service;

define service{
        use                     generic-service
        host_name               yourhostnamehere
        service_description     PHP-FPM Service
        max_check_attempts      4
        event_handler           restart-fastcgi
        flap_detection_enabled  0
        check_command           check_local_procs!0:!1:!RSDT -C php-fpm

After that, everything should work.  Don’t forget to restart Nagios.  Specifically, you want max_check_attempts to be at least one more than the limit you set in the script, since on the third SOFT failure it will try a restart - you probably don’t want Nagios yelling at you about a critical error (and going to a HARD state) before it’s tried a restart.  Then again, you might.  Change it as you want.

Now, we can be brave and manually stop the php-fpm service and watch Nagios to see if it restarts.  It should, after a few minutes.  You can tune the script above to make it do the restart faster (on the first soft fail if you want) if you want.

Good luck!

Twitter, Facebook