HWiNFO Sensors in CheckMK

I use HWiNFO64 on my Windows PCs to monitor the various temperature and fan sensors.  I wanted to get this data into CheckMK for monitoring purposes.  Here’s how I did it.

First, in HWiNFO, tag any sensors you want to monitor for the Vista gadget.  This causes HWiNFO to populate registry keys with the relevant data.  You’ll then need to make a custom plugin for CheckMK in C:\Program Files (x86)\checkmk\plugins named “hwinfo64.cmd”, containing the following;

Now, do a test on your CheckMK server, you should see the <<<hwinfo64>>> fields in your agent output for the host.  Great.  Now we need to write a check in CheckMK to interpret that data.  Make a new check ‘hwinfo64’ in /omd/sites/SITENAME/local/share/check_mk/checks, replacing SITENAME with your OMD site name;

Apologies for the terrible Python, my Python is very weak.  Also note that this assumes that all temperature-type sensors are in Celsius units, and all fan-type sensors are in RPM units.

Once that’s done, you should be able to add services to your host and the HWiNFO sensors will be automatically inventoried and show up.  They will use some default thresholds.  In order to customize those thresholds, edit etc/check_mk/main.mk in your OMD site and do something like this;

That will set the warning/crit threshold for CPU temp checks at 70/80 C, and the threshold for GPU checks at 90/100 C, on the machines ‘desktop1’ and ‘desktop2’.  Set as appropriate for your environment.

Legacy Nagios checks with CheckMK

I’ve recently started converting my old Nagios installs across to using CheckMK.  As part of this, I have a collection of old Nagios checks that I want to be able to use verbatim in CheckMK as legacy checks.  Here’s how you do that.

After you create your site using OMD, go into the site with ‘su – <sitename>’.  Then, edit etc/check_mk/main.mk and add something like this;

legacy_checks = [
  ( ( "check_solar!250!100", "Solar Output", True), [ "inverter" ] ),
]

extra_nagios_conf += r"""
  # 'check_solar' - Checks status of solar array
  # ARG1 = Warning level
  # ARG2 = Critical level
  define command{
    command_name check_solar
    command_line $USER2$/check_solar $ARG1$ $ARG2$
  }
"""

Now, put your script (in this case it’s check_solar) into local/lib/nagios/plugins/ .  What’s going on here is this;

  • Define a legacy Nagios check calling the command check_solar with parameters 250 and 100.  The check will have a description of Solar Output, outputs performance statistics, and will be assigned to the host named inverter.
  • Define a chunk of legacy Nagios config defining the check_solar command.

Then, go into your inverter host, edit services, and the manual service should appear.  Save config and you’re done!  Pretty easy.

 

 

Auto-Restarting a Service with Nagios

I haven’t worked out why yet, but this seems to be a common theme – the PHP/FastCGI service dies periodically, which causes outages with my blog (Nginx does not like it if the back end goes away).  So, I need a solution to fix this.  Enter Nagios!

Nagios is able to have customized event handlers.  Those event handlers can be set up to perform any action you want – such as restarting a service.  So, we’ll use Nagios to restart the service every time it dies.

First, create a script in /usr/local/lib64/nagios/plugins/eventhandlers/restart-fastcgi ;

#!/bin/sh
#
# Restarts the php-fpm FastCGI service if it dies
#
# restart-fastcgi $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$

case "$1" in
OK)
        ;;
WARNING)
        ;;
UNKNOWN)
        ;;
CRITICAL)
        case "$2" in
        SOFT)
                case "$3" in
                3)
			echo -n "Starting Fast-CGI service (3rd soft critical state)..."
			sudo /sbin/service php-fpm start | /bin/mail -s "[blog.zencoffee.org] FastCGI Restarted" root
			;;
			esac
		;;
	HARD)
			echo -n "Starting Fast-CGI service ..."
			sudo /sbin/service php-fpm start | /bin/mail -s "[blog.zencoffee.org] FastCGI Restarted" root
			;;
	esac
	;;
esac
exit 0

Ok, now we’ll need to configure sudoers to allow the nagios user to run ‘service start php-fpm‘ without credentials.  Add this to your sudoers with visudo;

Defaults:nagios         !requiretty,visiblepw
Cmnd_Alias      NAGIOS_START_PHPFPM = /sbin/service php-fpm start
nagios          ALL=(root)      NOPASSWD: NAGIOS_START_PHPFPM

Now, we’ll test that we can actually do it.  As root, do this;

su - nagios
/usr/local/lib64/nagios/plugins/eventhandlers/restart-fastcgi CRITICAL SOFT 3 127.0.0.1

You should then get an email sent to root saying it’s starting the service.  Obviously it won’t actually DO it (it’s already running).  Check in your /var/log/secure that the sudo command worked.  If so, great!  Now we need to set up Nagios itself to do the restart.

First, we’ll define a command to do the restart (note, I use $USER8$ to point to the local event handlers folder);

define command{
        command_name    restart-fastcgi
        command_line    $USER8$/restart-fastcgi $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$
}

Then we’ll add that event handler to the service check we already have in place for checking our FastCGI service;

define service{
        use                     generic-service
        host_name               yourhostnamehere
        service_description     PHP-FPM Service
        max_check_attempts      4
        event_handler           restart-fastcgi
        flap_detection_enabled  0
        check_command           check_local_procs!0:!1:!RSDT -C php-fpm
}

After that, everything should work.  Don’t forget to restart Nagios.  Specifically, you want max_check_attempts to be at least one more than the limit you set in the script, since on the third SOFT failure it will try a restart – you probably don’t want Nagios yelling at you about a critical error (and going to a HARD state) before it’s tried a restart.  Then again, you might.  Change it as you want.

Now, we can be brave and manually stop the php-fpm service and watch Nagios to see if it restarts.  It should, after a few minutes.  You can tune the script above to make it do the restart faster (on the first soft fail if you want) if you want.

Good luck!

Implementing check-by-ssh with Nagios

I wanted to get some Nagios checks running from my home Nagios box to my new VPS, and I wanted to do it via SSH (at the time I didn’t know about NSClient++ with certificates).  Fortunately, this is (reasonably!) easy to do.

First, you need a nagios account on the target server.  We’ll assume you already have one, and its shell is set to /bin/bash.  It does not need a password, and indeed it shouldn’t even have one.  We’re going to use SSH keys the whole way through.

Server Configuration

On your Nagios server, we’ll need to swap over to the nagios user and create a public key for ssh with no password;

sudo su -
sudo -u nagios bash
cd
ssh-keygen
cat ~/.ssh/id_rsa.pub

With that in place, you’re ready to configure the target.  Leave this window open, and copy the outputted key into the clipboard.

Target Configuration

On your target machine, edit /etc/ssh/sshd_config and add the following;

Match User nagios
PasswordAuthentication no
RSAAuthentication yes
PubkeyAuthentication yes

Doing the above sets things up so that the nagios user must use public key authentication when logging into the target server, and cannot use a password.  Things are more secure that way.

Now, you’ll need to paste in the /var/spool/nagios/.ssh/id_rsa.pub file you created on the server into the client with;

sudo su -
sudo -u nagios bash
cd
mkdir ~/.ssh
cat >> ~/.ssh/authorized_keys
[PRESS CTRL-D WHEN PASTED]
chmod -R og= ~/.ssh

With that in place, you’re in a good position to test out the connection.

Testing the SSH Connection

All of the following tests will happen on the server machine, using the terminal you already have open logged in as the nagios user.

Check that you can ssh into your target as the nagios user;

ssh nagios@target.example.com

If this doesn’t work, examine the error message.  You may have port 22 blocked, the nagios user may not be allowed to log in via SSH, or the nagios user’s shell may be set to /sbin/nologin.

If this works, now try and log in with the various permutations that may be used for the hostname, eg;

ssh nagios@target
ssh nagios@192.168.1.1

Each time you should be prompted to accept the key, and do so (if the fingerprint is right).  You’re doing this to populate the known_hosts file for the nagios user on your server, so that check_by_ssh can work properly.

Now, we can test check_by_ssh directly.  Do this;

cd /usr/lib64/nagios/plugins
./check_by_ssh -H target.example.com -n target -C uptime
./check_by_ssh -H target.example.com -n target -C '/usr/lib64/nagios/plugins/check_disk -w 20% -c 10%'

You should see first the uptime of the host followed by a regular looking Nagios check for checking the local disk.  If you don’t, go check that you actually have the check_disk plugin in that location, and make sure that SELinux isn’t causing grief.

Configuring Nagios on the server

Now that you’ve established that the check_by_ssh plugin can work, you need to define a new command definition for it.  We’ll do an example for running the check_disk plugin, and assume that $USER1$ corresponds to /usr/lib/nagios/plugins on both machines.

define command{
        command_name    check_byssh_disk
        command_line    $USER1$/check_by_ssh -H $HOSTADDRESS$ -n $HOSTNAME$ -C '$USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$'
}

Now you have a new command check_byssh_disk, which works exactly like the regular check_localdisk check does, except it will run against a remote host using SSH.  The host is connected to by its specified address using address in the host definition block, and the name is set using the host_name field in the host definition block.

NOTE – This is a fairly simple way of getting this going, but be aware that Nagios checks via SSH are fairly resource hungry (SSH session establish/teardown is needed for every check).  There’s a better way – using NSClient++ with certificates.

Quota Usage Checking with Nagios

My ISP (Adam Internet) supplies an XML usage page, which uses Basic authentication.  You sign up on their admin page for a token, and you then use those credentials (username:token) as the creds when fetching the usage page, and you get back a whole bunch of XML.  There’s various smartphone tools to use that data.  What I wanted was a Nagios check, and also for perfdata to go into pnp4nagios from that based on my Internet usage.

At this link you can find a script I wrote which does all of this.  You’ll need to install the LWP and XML::XPath modules to Perl, which you can do in CentOS with;

yum install perl-XML-XPath

LWP should be installed by default.  Usage is quite simple, just call it like this;

check_adam -u username:token -w 90 -c 95

to set warning threshold at 90% and critical threshold at 95%.  The script works by fetching the XML usage data, parsing it using XPath, and pulling out various bits of info.  It then generates a status line for Nagios similar to the following;

OK: 10 of 100 GiB (10%). 0.5 GiB today. Day 24.

Showing you absolute usage, quota, percentage usage, amount used today, and how far you are into your billing cycle.  Threshold is based on the percentage listed there. At some stage I’ll put in some cleverness around alerting if you’re going to run out of quota at your current usage, or something similar.

Perfdata returned is quite extensive – quota, usage overall, usage today, days into cycle, SNR (both directions), Sync speed (both directions), attenuation (both directions).

Enjoy.

Getting data from HWiNFO64 to Nagios

As discussed, I recently setup Nagios for monitoring on my home network.  On my main PC, I use HWiNFO64 for keeping track of CPU temperatures and fan speeds.  I wanted a way to get HWiNFO data into Nagios, and also into Cacti for graphing performance data (in particular, temperatures).

It turns out that HWiNFO64 supports a Vista sidebar gadget.  If you enable this functionality and then enable various sensors to appear in the Gadget, what HWiNFO64 actually does is creates a whole bunch of registry keys and updates those with the appropriate sensor telemetry.  You can then leverage this using NSClient++ and an appropriate external check script to get this data into Nagios.

You can get the external check script from my Google Code repository.  This script will go and check any HWiNFO checks you’ve named and then check them against any warning and critical thresholds you supply.  You’ll need to know the SID that your HWiNFO data is under – look in HKEY_USERS using REGEDIT, and find which SID has the \Software\HWiNFO64\VSB key in it.

Now that’s in place, you need to edit your nsclient.ini (you did install NSClient++, right?), and add a few things.  This is what you’ll need;

[/settings/NRPE/server]
allow arguments = true

[/settings/external scripts]
allow arguments = true

[/settings/external scripts/wrappings]
vbs=cscript.exe //T:30 //NoLogo scripts\lib\wrapper.vbs %SCRIPT% %ARGS%

[/settings/external scripts/wrapped scripts]
check_hwinfo=check_hwinfo.vbs /sid:"$ARG1$" /sensor:"$ARG2$" /warn:"$ARG3$" /crit:"$ARG4$"

Now, from your Nagios box, you should be able to run something like;

/usr/lib64/nagios/plugins/check_nrpe -H YOURWINDOWSHOSTADDRESS -c check_hwinfo -a 'YOURSIDHERE' 'CPU Package,Motherboard,CPU Fan' '70,40,500:' '80,50,100:'

Your SID should look like ‘S-1-5-21-NUMBERS-NUMBERS-NUMBERS-NUMBERS’.  This example above will do the following (note that your text labels in HWiNFO64 must match at least partially what you search for above);

  • Read the ‘CPU Package’ sensor.  Return CRITICAL if it’s 80 degrees or above, WARNING if it’s 70 degrees or above
  • Read the ‘Motherboard’ sensor.  Return CRITICAL if it’s 50 degrees or above, WARNING if it’s 40 degrees or above
  • Read the ‘CPU Fan’ sensor.  Return CRITICAL is it’s under 100 rpm, or WARNING if it’s under 500 rpm

If you see nothing, run check_nrpe as above, but use the alias_disk command.  If you see nothing, there’s something wrong with your NSClient configuration.  Maybe the service isn’t running?  Or maybe you need to restart it?  Firewall issues?

Once that’s done, add a check to Nagios as you normally would, and you’ll be able to monitor your HWiNFO64 data from Nagios.  Perfdata comes through normally, so you can use that to graph stuff too.

More on pnp4nagios and Cacti shortly.

IPMI and CentOS with the N36L

Over this weekend I’ve been setting up Nagios on my Microserver.  Specifically, I want to have alerting when/if the Microserver reaches critical conditions such as RAID device failure, over temperature, on UPS, stuff like that.

For monitoring temperatures and fan speed, I want to use IPMI since I have the Remote Access Card fitted.  However, IPMI doesn’t ‘just work’ out of the box.  This post will explain how to make it work.

echo modprobe ipmi_devintf >>/etc/sysconfig/modules/ipmi.modules
 echo modprobe ipmi_si>>/etc/sysconfig/modules/ipmi.modules
 echo options ipmi_si type=kcs ports=0xca2 >> /etc/modprobe.d/ipmi_si.conf
 yum install -y freeipmi
 modprobe ipmi_devintf
 modprobe ipmi_si
 ipmi-sensors

Assuming that all worked, you should see the output of your Microserver’s IPMI data, which should look something like this;

7936: Watchdog (Watchdog 2): [OK]
 22599: CPU_THEMAL (Temperature): 42.00 C (NA/110.00): [OK]
 22619: NB_THERMAL (Temperature): 41.00 C (NA/105.00): [OK]
 22593: SEL Rate (Other Fru): 6.00 msgs (NA/90.00): [OK]
 22620: AMBIENT_THERMAL (Temperature): 27.00 C (NA/45.00): [OK]
 22617: EvtLogDisabled (Event Logging Disabled): [OK]
 22618: System Event (System Event): [OK]
 22608: SYS_FAN (Fan): 1100.00 RPM (0.00/NA): [OK]
 22621: CPU Thermtrip (Processor): [OK]
 1536: Sys Pwr Monitor (Power Unit): [OK]

Note the typo in CPU_THEMAL.  Good one, HP.  NB_THERMAL is the northbridge temperature, which should be pretty similar to the CPU temperature on this motherboard.  AMBIENT_THERMAL shows temperature inside the case, and SYS_FAN shows the fan speed.  The critical thresholds are listed.

Now, if you’re using Nagios, you will now want the Nagios IPMI Monitoring Plugin.  Get this, and also install perl-IPC-Run .  Run the check as root with sudo, and you’re sorted.  Better explanation of that when I get around to writing more about Nagios.

Oh, and if you’re using SElinux, be prepared to battle with it to make it work properly with Nagios plugins requiring root…

Pushover – Easy Notifications

Recently got myself a new work phone – a Galaxy S3 4G.  At the same time, I decided to do some research on what I could do about getting notifications to the phone.  Notifications of things like RAID array problems, stuff going onto UPS power, that sort of thing.

Enter Pushover.  This thing’s pretty awesome.  Basically, you buy the app for whatever device it’s going on, and then you sign into their website and get your user key.  From there, you can create an ‘application’, which is an API key you can use to send notifications from other things.  You can use all sorts of languages to send notifications, and you can even send a special CURL request from a normall shell script to send alerts.  Something like this;

curl -s -k \
 -F "token=YOURAPIKEYHERE" \
 -F "user=YOURUSERKEYHERE" \
 -F "message=Content Here" \
 -F "title=Title Here" \
 -F "priority=0" \
 https://api.pushover.net/1/messages.json

Fire that off, and tada!  You get an alert to your phone!  It’s pretty awesome.  You can also tie it together with IFTTT to receive notifications from all sorts of things (your favourite RSS feed getting updated, email matching specific criteria landing in your inbox etc).

Give it a go.