2016-04-19

Monitoring, metrics collection and visualization using InfluxDB and Grafana

In addition to providing you the Aiven service, our crew also does a fair amount of software consulting in the cloud context. A very common topic we are asked to help on is metrics collection and monitoring. Here's a walk through on how to utilize InfluxDB and Grafana services for one kind of solution to the problem. We offer both as a managed Aiven service for quick and easy adoption.

Case-example

As an example, here's a dashboard screenshot from a pilot project we built recently for our customer:



This particular instance is used to monitor the health of quality assurance system on an industrial manufacturing line. The system being monitored uses IP based video cameras coupled with IO triggers to record a JPEG image of the artifacts passing through various stages of processing steps. The resulting dashboard allows verification that the system is working properly with a single glance.

On the top of the dashboard you'll see a simple reading of temperature sensor of the device. Any large deviation from the norm would be a good warning of oncoming hardware fault:




The next plotted metric is the size of the JPEG compressed image from the imaging device:




Interestingly, this relatively simple metric reveals a lot about the health of both the sensor and any lenses and lighting sources involved. Due to the nature of JPEG encoding, the frame and the size varies slightly even in rather static scenes, so it makes a good quick overall indicator that the component is running fine and returning up-to-date content.

The two graphs at the bottom track the last update time from each of the camera feeds and each of the IO trigger services respectively:
  

 
Here, we expect each of the cameras to update several times a second. The IO triggers are interrogated in a long-poll mode with a timeout of 15 seconds. These limits yield natural maximum allowable limits for monitoring and alerting purposes. In fact, the left hand side shows two readings that correlate with temporary network glitches.


Building blocks

The visualization and dashboard tool shown above is Grafana. The underlying storage for the telemetry data is InfluxDB. In this case, we utilize Telegraf as a local StatsD compatible collection point for capturing and transmitting the data securely into InfluxDB instance. And finally, we use a number of taps and sensors across the network that feed the samples to Telegraf using StatsD client libraries in Node.js, Python and Java based on the component.

In this project we are using Aiven InfluxDB and Aiven Grafana hosted services, but any other InfluxDB / Grafana should work more or less the same way.

InfluxDB - The metrics database

We start by launching an InfluxDB service in Aiven:


The service is automatically launched in a minute or so.

InfluxDB is a time-series database with some awesome features:
  • Adaptive compression algorithms allow storing huge numbers of data points
  • Individual metrics data points can be tagged with key=value tags and queried based on them
  • Advanced query language allows queries whose output data requires little or no post-processing
  • It is FAST!


Telegraf - The metrics collector

Next we will need the connection parameters to our InfluxDB instance. The necessary information required (hostname, username, password, etc.) for connecting our Telegraf collecting agent to InfluxDB can be found from the from the service overview page:




We typically run a single Telegraf container per environment. In order to make Telegraf talk to our InfluxDB and to accept StatsD input, we will need to modify its configuration file telegraf.conf a little bit and add the following sections:


    [outputs]
        [outputs.influxdb]
        url = "https://teledb.htn-aiven-demo.aivencloud.com:21950"
        database = "dbb253c1e025704a4494f3f65412b70e30"
        username = "usr2059f5ef88fb46e49bd1f5fd0d464d80"
        password = "password_goes_here"
        ssl_ca = "/etc/telegraf/htn-aiven-demo-ca.crt"
        precision = "s" 
[inputs]
    [inputs.statsd]
    service_address = "127.0.0.1:8125"
    delete_gauges = true
    delete_counters = true
    delete_sets = false
    delete_timings = true
    percentiles = [90]
    allowed_pending_messages = 10000
    percentile_limit = 1000

We want our InfluxDB connection to be secure against man-in-the-middle attacks, so we have included the service's CA certificate in the configuration file. This will force the InfluxDB server to prove its identity to our Telegraf client. The certificate can be downloaded from the Aiven web console:




Here's an example StatsD code blob for Node.js component:
    var statsd = require('node-statsd')
    var statsd_client = new statsd({
        host: '<telegraf_ip>',
        port: 8125,
    });
    statsd_client.gauge('image_size,source=30', 48436,
        function(error, bytes) {
            statsd_client.close();
        }
    );
The StatsD UDP protocol uses super simple textual message format and sending a metric takes few CPU cycles, so even a high request-throughput server can transmit metrics per request processed, without hurting the overall performance. The StatsD receiver in Telegraf parses these incoming metrics messages and consolidates the metrics, typically storing data at a much slower pace in to the metrics databases. This really helps keeping both the source's software's and the metrics database's load levels under control.

In the above code sample, we use Telegraf's StatsD extension for tagging support with the source=20 parameter. This handy little feature is what allows us to easily slice and display the collected metrics by each sensor or just plot all metrics, regardless of the source sensor. This is one of the killer features of InfluxDB and Telegraf!

OK, so now we are transmitting metrics from our application thru the Telegraf daemon to our InfluxDB database. Next up is building a Grafana dashboard that visualizes the collected data.

Grafana - The dashboard

We launch our Grafana from the Aiven console by creating a new service:




Normally an InfluxDB needs to be manually added as a data source in Grafana, however in this case we can skip that step as InfluxDB and Grafana services launched under the same project in Aiven are automatically configured to talk to each other.

We like Grafana a lot because it makes it simple to define visually appealing, yet useful graphs and it integrates with InfluxDB well. Grafana has a user-friendly query builder specifically for building queries for InfluxDB, and with a little practice it takes little time to conjure fabulous charts from almost any source data.

The Grafana web URL, username and password are available on the service overview page:




Opening Grafana in the browser, logging in with the credentials from above and defining a simple graph with an InfluxDB query editor... PROFIT!




That's it for now. Getting application metrics delivered from the application to a pretty dashboard doesn't take much effort nowadays!

What next?

We use Telegraf, InfluxDB and Grafana rather extensively in our own Aiven monitoring infrastructure. However, we have add a couple more components, such as Apache Kafka, to the stack, but that is a topic for an upcoming blog post. Stay tuned! :-)


Hosted InfluxDB and Grafana at Aiven.io

InfluxDB and Grafana are available at our Aiven.io service, you can sign up for a free trial at aiven.io.

Have fun monitoring your apps!

Cheers,

    Team Aiven

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.