How many times do you connect with SSH to your remote server and cat, grep, tail and awk through your logs? It probably works for 3 servers and running a handful services, but if you have more, you should definitely spend some time to centralize your logs.

I personally prefer Graylog2 which can deal very well with different log formats like GELF, Syslog RFC's. Just start some listener with the format and forward them to your Graylog2 instance.

A few servers are NAT'ed behind VDSLs with dynamic IP's and some physical and virtual servers hosted somewhere else with static public IP's which have all running Linux. This makes monitoring normally hard, so I use a fully meshed tinc VPN, so I don't have to deal a lot securing many different applications and protocols and everything is reachable in a flat private /24 IPv4 network. To manage the server basic configuration I use Ansible and most services run as a container using Docker.

What is the benefit?

  • The GELF listener allows me to the Docker built-in driver to easily centralize containerized application logs
  • Log4j2 provides a GELF output as well which gives me full access to my logs from my OpenNMS instances
  • Graylog2 can parse Syslog in various RFC's which gives me centralized system logs from my Linux systems
  • It is very easy to deploy the configurations with Ansible

Setting up a Graylog2 Service Stack

Define a service stack using Docker Compose and get Graylog2 up and running. Here is docker-compose.yml file I use, just download it and run docker-compose up -d.

The following ports get exposed:

  • 9000: The Graylog2 web application
  • 12201: GELF UDP listener for my Java applications
  • 514: GELF UDP Syslog listener to forward my system logs

Configure a GELF UDP and a Syslog UDP Input

With the first login in Graylog2 you have to create two Inputs. The GELF UDP input is used to receive log messages from my OpenNMS applications and Docker Daemons and the Syslog UDP input receives my Syslog messages.

Graylog2 Input Configuration

Configure Syslog forwarder

On my systems is rsyslog running. It is required to configure rsyslogd and I use Ansible to create a file in /etc/rsyslog.d/50-graylog-forwarding with following content:

if $programname == 'snmpd' and $msg contains 'statfs' then {

*.* @;RSYSLOG_SyslogProtocol23Format

On, my Graylog2 instance is running and is listening on 514/udp. After restarting rsyslog the logs will be forwarded. The first if-statement ensures I don't log a lot of garbage coming from snmpd which described more in detail in this blog post.

Configure OpenNMS to forward logs to Graylog2

Add Modify the ${OPENNMS_HOME/etc/log4j2.xml and add a GELF UDP log appender which is described in our OpenNMS Wiki.

All the daemons will no also forward their logs to Graylog2 via UDP and are searchable for services node labels, node ids and daemons. Different OpenNMS instances running various versions are identified by an application_name` tag.

Graylog2 Screenshot

Forward Docker Container logs

To get logs from my applications running in Docker container, I have configured them to use the GELF driver by just adding this snippet to my service definition:

  driver: "gelf"
    gelf-address: "udp://"
    tag: "horizon-core-web:stable"

The tag is used to identify the service, you can be as creative as you want and the gelf-address tells the Docker daemon where to forward the messages which is my Graylog2 listener input.

Happy logging and searching

To monitor your systems you rely heavily on SNMP, it gives out of the box a lot of possibilities getting important performance and status information.

The main topic security is often not considered. SNMP version 1 and 2c transmit everything in plain text over the wire. There is also no user, password authentication method, just a shared community string which gives access to the information. To address these problems SNMP v3 was introduced.

The Linux Net-SNMP agent supports SNMP v3 and OpenNMS does as well, so nothing prevents us to use encryption and user authentication.

WARNING: I assume Net-SNMP uses SHA-1 which is secure anymore. As far I know today, there is no implementation for Net-SNMP available which supports SHA-2 with a 256-bit hash.

Nevertheless here is the way to configure SNMP v3. It is still better than sending everything over the wire in plain text. In critical environments, I would definitely consider adding mechanisms to isolate and protect the management network from the rest of the world on network layers to reduce the attack vector.

Make your Net-SNMP configuration modular

Today, people running configuration management tools rolling out configurations to a lot of systems. Net-SNMP gives you the possibility to use an include drop-in folder to extend the default configuration, which is very handy to include device dependent configuration snippets.

All you have to do is to add the following line in your snmpd.conf

includeDir /etc/snmp/conf.d

All files ending with .conf will now be added to your Net-SNMP configuration. This makes it using configuration management tools to add device dependent disk, process or log monitoring directives without mangling one large snmpd.conf with variables.

How to configure Net-SNMP with SNMP v3

The first step, create a user with password and tell the agent what methods for encryption and signature should be used with:

createUser monitor SHA 0p3nnm5423 AES opennmsopennms rouser monitor priv .

The command creates a user named monitor and uses SHA as Message Authentication Code. For encryption you have the choice between DES and AES , I would recommend the newer AES encryption method. I can recommend using something like apg to create better passwords.

Once you added the configuration you have to restart the Net-SNMP daemon and you can test it with the following command:

snmpget -v 3 -u monitor -l authPriv -a SHA -A 0p3nnm5423 -x AES -X opennmsopennms localhost .

You should be able to get the system location. Next, you can configure OpenNMS to use SNMP v3 for your IP address or a whole range in the Web UI by going to "Admin -> Configure SNMP Community by IP".

That's it – happy monitoring.

Centralizing logs is important as soon you have more than 2 servers. In my environment the bare metal is monitored with Net-SNMP and my services are deployed as containers with Docker. All system logs are sent to a Graylog2 instance and I quickly noticed a few ugly entries caused by snmpd.

Cannot statfs /run/docker/netns/...: Permission denied

You will notice a few of them. First approach try to increase the logging level in /etc/default/snmpd from SNMP daemon with

SNMPDOPTS='-Ls3d -Lf /dev/null -u snmp -g snmp -I -smux,mteTrigger,mteTriggerConf -p /run/'

The man page from Net-SNMP described the logging and I've increased with -Ls3d the level to "Error" instead of "Warning", but it didn't help. I researched in the web and found this topic in Red Hats Bugzilla.

It turns out snmpd is reading /proc/mount and runs statfs and logs an error. One of the authors in the comment section found a solution to use rsyslog filtering this type of message with:

if $programname == 'snmpd' and $msg contains 'statfs' then {

The result is now a much cleaner log with less garbage.

Happy Logging

As most of us noticed a few companies changed our perspective how to develop software and deploy them as a service. There are quite a few changes between selling every year a box with 10 CD's and develop and deliver your software as a service. This article is a collection of thoughts and ideas I had and wanted to be written.

Who cares about a version number?

User give a shit about version numbers anymore, all what matters needs to be focused on the user. Great user experience, functionality and a good "Effort-to-Outcome" ratio to solve your problems will make your software successful.

Usability improvements, features and fixes are delivered immediately and this is where all the fuzz about continuous delivery and the devops culture kicks in.

Pets and Cattle

The virtualisation technology forced hardware manufacturer to change their mindsets to make their boxes to behave as good cattle instead of being a pet. The same will happen to Linux distributions and configuration management tools with the fuzz about containers, I believe they didn't really noticed yet. You want hardware as commodity, you need a Linux kernel as commodity and the diversity and history of Linux distributions are in your way.

All the ugly ifdefs in configuration management tools just make your service run on a specific distributions using yum, apt, or apk and the nasty glue you have to write to configure your service to run in a container is still painful. Applications aren't often built to run in such environments and you have to hack a lot of stuff - but this is a whole different story.

Monitoring is important

Everybody tells you monitoring is an important thing. You can only improve what you measure and you need to know where things go down the hill.

Current monitoring tools felt behind with todays needs. Most of them allow you only to think in terms of bare metal boxes like hosts and IP addresses. Some tools do only one part, the performance management XOR fault management and you need both. In case you have two, you have to maintain and glue two tools together and you want alerting - and you don't want to maintain them twice.

Monitoring needs to change

Monitoring tools need to be changed to be part of the software development and service deployment process.

Most of them are built by people with operation background and not software development background. We need more software development background in monitoring tools. The operation background is often quite good.

We should make monitoring a part in our Test suits. Why not define an "Operation Test" behind an "Integration Test" and let it run in your monitoring tool? Additionally monitoring people should make clear for the user what is the difference between "Performance Management" and "Application Profiling".

Monitoring people should adapt the terms like whitebox and blackbox testing for operational services. For examples when you test the error code of a landing page for a web application it is a blackbox test. When you measure internal application specific entities with JMX you have white box test.

Fight against Alarm Fatigue

Monitoring applications tend to overmonitor your environment by default. You measure a lot, it tells you a lot but you oversee the important things in all the noise. The signal to noise ratio is too high and people become alarm fatigue. Rule #1 in alerting: "Notify only someone when human interaction is really necessary."

Applications and Monitoring

With deploying services in containers the whole idea of provisioning need to be changed. Monitoring tools should allow you to model "Application Service" with associated performance- and availability metrics. The alerting should also be possible on those "Application Services". Performance metrics and operational tests are driven by high level services mostly through ReST API's. Containers will come and go providing resources to this "Application Service". They can't no longer be treated as long living hosts with a static assigned IP address.

Monitoring need to be more intelligent

When talking about intelligence, everybody is thinking about Artificial Intelligence. It is much simpler in monitoring, cause they are ridicolous stupid at the moment and you don't have to throw AI against the problem. Diagnose from bottom up, low complexity to higher complexity, which means also cheap to expensive in the sense of needed hardware and network resources. We want to monitor high level services, a monitoring tool can help diagnosing a problem by himself and can provide a lot of useful information, for example:

We test a 200 OK code on for 200 OK with a timeout of 2 seconds.

Instead of just giving a "Service Down", the monitoring tool can diagnose itself with a few cheap and simple tests. Diagnose the problem from the perspective of the monitoring system to give a NOC guy an overview what went wrong and safe him time. Just an example for the test above, was the connection refused or was the HTTP error code just something else than 200 OK?

Connection refused

  1. Can IPv4/IPv6 addresses be looked up by the host name
  2. Can the IPv4/IPv6 addresses be reached over ICMP?
  3. If not was is the trace route output for IPv4/IPv6 addresses
  4. If possible give me a link to logs from last hop nodes in the time area +/- service polling interval
  5. When they can be reached is the TCP port 443 port open
  6. If not give me a link to warning+ logs from the web server in the time area +/- service polling interval

Not 200 OK

  1. Use the resolved IPv4/IPv6 address of the web server and give me a link to warning+ logs from the web server in time area +/- service polling interval

This is not something where you need a rocket scientist for.

Most of the things are configured in permanent monitoring, e.g. ICMP or DNS lookups, but are mostly just necessary when you need to diagnose a problem. You only care about them, when a high level application service fails. You can do a similar thing with response times. You really care about your application response time for a longer period of time. Just when it went through the roof you have to ask immediately, was the network path slow (ICMP)? Was the name lookup slow(DNS)? Was the web server response slow(HTTP)?

During work building Docker executables, I ran in an interesting corner case. Fortunately the Docker IRC channel helped me to investigate with special credits to Ravensoul.

When you build a container as an executable you can use the ENTRYPOINT for your binary to execute and CMD as a default overwritable argument. In most cases the CMD is the --help argument to provide a useful default behavior in case you just run the container without anything specified.

In my case I've built a Ruby based executable and for the reason I need the environment variables, I've used as ENTRYPOINT the bash -c <command> command and used the CMD default argument --help like this:

ENTRYPOINT ["/bin/bash", "-c", "/path/to/myRuby"]

CMD ["--help"]

I've noticed the --help argument was not used when you just run the container. To verify the problem and isolate the environment, I've created a small example for investigation:

FROM alpine

ENTRYPOINT ["/bin/bash", "-c", "ps"]

CMD ["--help"]

When I ran this container I've noticed the ps command is executed but not the argument --help. It turned out the problem is /bin/bash -c usage as ENTRYPOINT. When you execute /bin/bash -c 'echo ${0}' myFirstArgument you will notice the myFirstArgument becomes ${0} which is the name of the script itself.

man /bin/bash:

If there are arguments after the string, they are assigned to the positional parameters, starting with $0

To get around this problem, I've wrapped my command in an and used ${@} to pass all arguments which fixed my problem.

Happy dockering.