Splunkd High CPU after leap second addition?

Had my alerting system yell at me about high CPU load on my Splunk Free VM;

A bit of examination revealed that it was indeed at abnormally high load average (around 10), although there didn’t appear to be anything wrong.  Then a quick look at dmesg dropped the penny;

Jan 1 10:29:59 splunk kernel: Clock: inserting leap second 23:59:60 UTC

Err.  The high CPU load average started at 10:30am, right when the leap second was added.

A restart of all the services resolved the issue.  Load average is back down to its normal levels.

Netflow Collector on Splunk – Interesting Bug

The Splunk Add-on for Netflow appears to have a bug.  If you run through the configure.sh script accept all the defaults, it refuses to ingest any Netflow data.

This is because its script deletes all ASCII netflow data that’s older than -1 day old.

You can easily fix this by either rerunning configure.sh again and typing in every value, or edit /opt/splunk/etc/apps/Splunk_TA_flowfix/bin/flowfix.sh and change the following line;

# Cleanup files older than -1
find /opt/splunk/etc/apps/Splunk_TA_flowfix/nfdump-ascii -type f -mtime +-1 -exec rm -f {} \;

Change the +-1 to +1.  This tells the script to clean up all ASCII netflow data older than 1 day (ie, not everything older than some time in the future).

Splunk integration with Docker

I’ve changed over my log aggregation system from ElasticStack to Splunk Free over the past few days.  The primary driver for this is that I use Splunk at work, and since Splunk Free allows 500Mb/day of ingestion, that’s plenty for all my home stuff.  So, using Splunk at home means I gain valuable experience at using Splunk professionally.

What we’ll be talking about here is how you integrate your Docker logging into Splunk.

Configure an HTTP Event Collector

Firstly, you’ll need to enable the Splunk HTTP Event Collector.  In the Splunk UI, click Settings -> Data Inputs -> HTTP Event Collector -> Global Settings.

Click Enabled alongside ‘All Tokens’, and enable SSL.  This will enable the HTTP Event Collector on port 8088 (the default), using the Splunk default certificate.  This isn’t enormously secure (you should use your own cert), but this’ll do for now.

Now, in the HTTP Event Collector window, click New Token and add a token.  Give it whatever details you like, and set the source type to json_no_timestamp.  I’d suggest you send the results to a new index, for now.

Continue the wizard, and you’ll get an access token.  Keep that, you’ll need it.

Configure Docker Default Log Driver

You now need to configure the default logging method used by Docker.  NOTE – Doing this will break the docker logs command, but you can find everything in Splunk anyway.  More on that soon.

You will need to override the startup command for dockerd to include some additional options.  You can do this on CentOS7 by creating a /etc/systemd/system/docker.service.d/docker-settings.conf with the following contents;

[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --log-driver=splunk --log-opt splunk-token=PUTYOURTOKENHERE --log-opt splunk-url=https://PUTYOURSPLUNKHOSTHERE:8088 --log-opt tag={{.ImageName}}/{{.Name}}/{{.ID}} --log-opt splunk-insecureskipverify=1

The options should be fairly evident.  The tag= option configures the tag that is attached to the JSON objects outputted by Docker, so it contains the image name, container name, and unique ID for the container.  By default it’ll be just the unique ID, which frankly isn’t very useful post-mortem.  The last option allows the use of the Splunk SSL certificate.  Get rid of this option when you use a proper certificate.

Getting the driver in place

Now you’ve done that, you should be able to restart the Docker host, then reprovision all the containers to change their logging options.  In my case, this is a simple docker-compose down followed by docker-compose up, after a reboot.

The docker logs command will be broken now, but you can instead use Splunk to replicate the functionality, like this;

index=docker host=dockerhost | spath tag | search tag="*mycontainer*" | table _time,line

That will drop out the logs from the last 60 minutes for the container mycontainer running on the host dockerhost.

You can then start doing wizardry like this;

index=docker | spath tag | search tag="nginx*" 
| rex field=line "^(?<remote_addr>\S+) - (?<remote_user>\S+) \[(?<time_local>.+)\] \"(?<request>.+)\" (?<status>\d+) (?<body_bytes>\d+) \"(?<http_referer>.+)\" \"(?<http_user_agent>).+\" \"(?<http_x_forwarded_for>).+\"$"
| rex field=request "^(?<request_method>\S+) (?<request_url>\S+) (?<request_protocol>\S+)$"
| table _time,tag,remote_addr,request_url

To dynamically parse NGINX container logs outputted by Docker, split up the fields, and then list them by time, remote IP, and the URL requested.

I’m sure there’s better ways of doing this (such as parsing the logs at index time instead of at search time), but this way works pretty well and should function as a decent starting point.