2016-04-28

PostgreSQL cloud backups with PGHoard

PGHoard is the cloud backup and restore solution we're using in Aiven. We started PGHoard development in early 2015 when the Aiven project was launched as a way to provide real-time streaming backups of PostgreSQL to a potentially untrusted cloud object storage.

PGHoard has an extensible object storage interface, which currently works with the following cloud object stores:
  • Amazon Web Services S3
  • Google Cloud Storage
  • OpenStack Swift
  • Ceph's RADOSGW utilizing either the S3 or Swift drivers 
  • Microsoft Azure Storage (currently experimental)
  •  

Data integrity

PostgreSQL backups consist of full database backups, basebackups, plus write ahead logs and related metadata, WAL. Both basebackups and WAL are required to create and restore a consistent database.

PGHoard handles both the full, periodic backups (driving pg_basebackup) as well as streaming the write-ahead-log of the database.  Constantly streaming WAL as it's generated allows PGHoard to restore a database to any point in time since the oldest basebackup was taken.  This is used to implement Aiven's Database Forks and Point-in-time-Recovery as described in our PostgreSQL FAQ.

To save disk space and reduce the data that needs to be sent over the network (potentially incurring extra costs) backups are compressed by default using Google's Snappy, a fast compression algorithm with a reasonable compression ratio. LZMA (a slower algorithm with very high compression ratio) is also supported.

To protect backups from unauthorized access and to ensure their integrity PGHoard can also transparently encrypt and authenticate the data using RSA, AES and SHA256.  Each basebackup and WAL segments gets a unique random AES key which is encrypted with RSA.  HMAC-SHA256 is used for file integrity checking.

Restoration is key

As noted in the opening paragraph, PGHoard is a backup and restore tool: backups are largely useless unless they can be restored.  Experience tells us that backups, even if set up at some point, are usually not restorable unless restore is routinely tested, but experience also shows that backup restoration is rarely practiced unless it's easy to do and automate.

This is why PGHoard also includes tooling to restore backups, allowing you to create new master or standby databases from the object store archives.  This makes it possible to set up a new database replica with a single command, which first restores the database basebackup from object storage and then sets up PostgreSQL's recovery.conf to fetch the remaining WAL files from the object storage archive and optionally connect to an existing master server after that.

Preparing PostgreSQL for PGHoard

First, we will need to create a replication user account. We'll just use the psql command-line client for this:

postgres=# CREATE USER backup WITH REPLICATION PASSWORD 'secret';
CREATE ROLE


We also need to allow this new user to make connections to the database. In PostgreSQL this is done by editing the pg_hba.conf configuration file and adding a line something like this:

host  replication  backup  127.0.0.1/32  md5

We'll also need to ensure our PostgreSQL instance is configured to allow WAL replication out from the server and it has the appropriate wal_level setting. We'll edit postgresql.conf and edit or add the following settings:

max_wal_senders = 2  # minimum two with pg_receivexlog mode!
wal_level = archive  # 'hot_standby' or 'logical' are also ok


Finally, since we have modified PostgreSQL configuration files, we'll need to restart PostgreSQL to take the new settings into use by running "pg_ctl restart", "systemctl restart postgresql" or "service postgresql restart", etc depending on the Linux distribution being used.  Note that it's not enough to "reload" PostgreSQL in case the WAL settings were changed.

Now we are ready on the PostgreSQL side and can move on to PGHoard.

Installing PGHoard

PGHoard's source distribution includes packaging scripts for Debian, Fedora and Ubuntu.  Instructions for building distribution specific packages can be found in the PGHoard README.  As PGHoard is a Python package it can also be installed on any system with Python 3 by running "pip3 install pghoard".

Taking backups with PGHoard

PGHoard provides a number of tools that can be launched from the command-line:
  • pghoard - The backup daemon itself, can be run under systemd or sysvinit
  • pghoard_restore - Backup restoration tool
  • pghoard_archive_sync - Command for verifying archive integrity
  • pghoard_create_keys - Backup encryption key utility
  • pghoard_postgres_command - Used as PostgreSQL's archive_command and restore_command
First, we will launch the pghoard daemon to start taking backups. pghoard requires a small JSON configuration file that contains the settings for the PostgreSQL connection and for the target backup storage. We'll name the file pghoard.json:

{
    "backup_location": "./metadata",
    "backup_sites": {
        "example-site": {
            "nodes": [
                {
                    "host": "127.0.0.1",
                    "password": "secret",
                    "port": 5432,
                    "user": "backup"
                }
            ],
            "object_storage": {
                "storage_type": "local",
                "directory": "./backups"
            }
        }
    }
}


In the above file we just list where pghoard keep's its local working directory (backup_location), our PostgreSQL connection settings (nodes) and where we want to store the backups (object_storage). In this example we'll just write the backup files to a local disk instead of a remote cloud object storage.

Then we just need to run the pghoard daemon and point it to our configuration file:

$ pghoard --short-log --config pghoard.json
DEBUG   Loading JSON config from: './pghoard.json', signal: None
INFO    pghoard initialized, own_hostname: 'ohmu1', cwd: '/home/mel/backup'
INFO    Creating a new basebackup for 'example-site' because there are currently none
INFO    Started: ['/usr/bin/pg_receivexlog', '--status-interval', '1', '--verbose', '--directory', './metadata/example-site/xlog_incoming', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'"], running as PID: 8809
INFO    Started: ['/usr/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--progress', '--verbose', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'", '--pgdata', './metadata/example-site/basebackup_incoming/2016-04-28_0'], running as PID: 8815, basebackup_location: './metadata/example-site/basebackup_incoming/2016-04-28_0/base.tar'
INFO    Compressed 16777216 byte file './metadata/example-site/xlog_incoming/000000010000000000000025' to 805706 bytes (4%), took: 0.056s
INFO    'UPLOAD' transfer of key: 'example-site/xlog/000000010000000000000025', size: 805706, took 0.003s
INFO    Ran: ['/usr/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--progress', '--verbose', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'", '--pgdata', './metadata/example-site/basebackup_incoming/2016-04-28_0'], took: 0.331s to run, returncode: 0
INFO    Compressed 16777216 byte file './metadata/example-site/xlog_incoming/000000010000000000000026' to 797357 bytes (4%), took: 0.057s
INFO    'UPLOAD' transfer of key: 'example-site/xlog/000000010000000000000026', size: 797357, took 0.011s
INFO    Compressed 80187904 byte file './metadata/example-site/basebackup_incoming/2016-04-28_0/base.tar' to 15981960 bytes (19%), took: 0.335s
INFO    'UPLOAD' transfer of key: 'example-site/basebackup/2016-04-28_0', size: 15981960, took 0.026s



PGHoard automatically connected to the PostgreSQL database server, noticed that we don't have any backups and immediately created a new basebackup and started the realtime streaming of WAL files (which act as incremental backups). Each file stored in the backups was first compressed for optimizing the transfer and storage costs.

As long as you keep PGHoard running, it will make full backups using the default schedule (once per 24 hours) and continuously stream WAL files.

Looking at the contents of the "backups" directory, we see that our backups now contain a full database backup plus a couple of WAL files, and some metadata for each of the files:

$ find backups/ -type f
backups/example-site/xlog/000000010000000000000025
backups/example-site/xlog/000000010000000000000025.metadata
backups/example-site/xlog/000000010000000000000026
backups/example-site/xlog/000000010000000000000026.metadata
backups/example-site/basebackup/2016-04-28_0
backups/example-site/basebackup/2016-04-28_0.metadata


Available backups can be listed with the pghoard_restore tool:

$ pghoard_restore list-basebackups --config pghoard.json
Available 'example-site' basebackups:

Basebackup                                Backup size    Orig size  Start time
----------------------------------------  -----------  -----------  --------------------
example-site/basebackup/2016-04-28_0            15 MB        76 MB  2016-04-28T06:40:46Z


Looks like we are all set. Now let's try restore!

Restoring a backup

Restoring a backup is a matter of running a single command:

$ pghoard_restore get-basebackup --config pghoard.json --target-dir restore-test
Found 1 applicable basebackup

Basebackup                                Backup size    Orig size  Start time
----------------------------------------  -----------  -----------  --------------------
example-site/basebackup/2016-04-28_0            15 MB        76 MB  2016-04-28T06:40:46Z
    metadata: {'compression-algorithm': 'snappy', 'start-wal-segment': '000000010000000000000026', 'pg-version': '90406'}

Selecting 'example-site/basebackup/2016-04-28_0' for restore
Basebackup complete.
You can start PostgreSQL by running pg_ctl -D restore-test start
On systemd based systems you can run systemctl start postgresql
On SYSV Init based systems you can run /etc/init.d/postgresql start


The pghoard_restore command automatically chooses the latest available backup, downloads, unpacks (and decompresses and decrypts, when those options are used) it to the specified target directory. The end result will be a complete PostgreSQL data directory (e.g. something like /var/lib/postgresql/9.5/main or /var/lib/pgsql/data, depending on the distro), ready to be used by a PostgreSQL instance.

There are more command-line options for more detailed control over the restoration process, for example restoring to a particular point in time or transaction (PITR) or choosing whether the restored database will be acting as a master or a standby.

Backup encryption

In order to encrypt our backups, we'll need to create an encryption key pair. PGHoard provides a handy command for automatically creating a key pair and storing it into our configuration file:

$ pghoard_create_keys --key-id example --config pghoard.json
Saved new key_id 'example' for site 'example-site' in 'pghoard.json'
NOTE: The pghoard daemon does not require the 'private' key in its configuration file, it can be stored elsewhere to improve security


Note that in most cases you will want to extract the private key away from the configuration file and store it safely elsewhere away from the machine that makes the backups. The pghoard daemon only needs the encryption public key during normal operation. The private key is only required by the restore tool and the daemon while restoring a backup.

Uploading backups to the cloud

Sending backups to an object storage in the cloud is simple: we just need the cloud's access credentials and we'll modify the object_storage section pghoard.json:

            "object_storage": {
                "aws_access_key_id": "XXX",
                "aws_secret_access_key": "XXX",
                "bucket_name": "backups",
                "region": "eu-central-1",
                "storage_type": "s3"
            }


Now when we restart pghoard, the backups are sent to AWS S3 in Frankfurt:

$ pghoard --short-log --config pghoard.json
DEBUG   Loading JSON config from: './pghoard.json', signal: None
INFO    pghoard initialized, own_hostname: 'ohmu1', cwd: '/home/mel/backup'
INFO    Started: ['/usr/bin/pg_receivexlog', '--status-interval', '1', '--verbose', '--directory', './metadata/example-site/xlog_incoming', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'"], running as PID: 8001
INFO    Creating a new basebackup for 'example-site' because there are currently none
INFO    Started: ['/usr/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--progress', '--verbose', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'", '--pgdata', './metadata/example-site/basebackup_incoming/2016-04-28_1'], running as PID: 8014, basebackup_location: './metadata/example-site/basebackup_incoming/2016-04-28_1/base.tar'
INFO    Ran: ['/usr/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--progress', '--verbose', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'", '--pgdata', './metadata/example-site/basebackup_incoming/2016-04-28_1'], took: 0.350s to run, returncode: 0
INFO    Compressed and encrypted 16777216 byte file './metadata/example-site/xlog_incoming/000000010000000000000027' to 799445 bytes (4%), took: 0.406s
INFO    Compressed and encrypted 16777216 byte file './metadata/example-site/xlog_incoming/000000010000000000000028' to 797784 bytes (4%), took: 0.137s
INFO    Compressed and encrypted 80187904 byte file './metadata/example-site/basebackup_incoming/2016-04-28_1/base.tar' to 15982372 bytes (19%), took: 0.417s
INFO    'UPLOAD' transfer of key: 'example-site/xlog/000000010000000000000028', size: 797784, took 0.885sINFO    'UPLOAD' transfer of key: 'example-site/xlog/000000010000000000000027', size: 799445, took 1.104s
INFO    'UPLOAD' transfer of key: 'example-site/basebackup/2016-04-28_1', size: 15982372, took 4.911s



The restore tool works the same way regardless of where the backups are stored:

$ pghoard_restore list-basebackups --config pghoard.json
Available 'example-site' basebackups:

Basebackup                                Backup size    Orig size  Start time
----------------------------------------  -----------  -----------  --------------------
example-site/basebackup/2016-04-28_1            15 MB        76 MB  2016-04-28T09:39:37Z



PostgreSQL 9.2+ and Python 3.3+ required

Today we released PGHoard version 1.2.0 with support for Python 3.3 and PostgreSQL 9.2 plus enhanced support for handling network outages.  These features were driven by external users, in Aiven we always use the latest PostgreSQL versions (9.5.2 at the time of writing) and access object storages near the database machines.


PGHoard in Aiven.io

We're happy to talk more about PGHoard and help you set up your backups with it.  You can also sign up for a free trial of our Aiven.io PostgreSQL service where PGHoard will take care of your backups.


Cheers,
Team Aiven

2016-04-26

SSL-enabled custom domains in Aiven Grafana and Kibana


We've just rolled out a new feature in Aiven: custom domains with valid SSL certificates for web frontends.

Last week we introduced valid SSL certificates from the Let's Encrypt for our Grafana and Kibana (Elasticsearch frontend) services in Aiven.  This allows your browser to immediately recognize and trust the web frontend services you launch from Aiven which we host at domains like grafana.my-project.aivencloud.com.  Previously the services' certificates were signed by Aiven's own CA which caused web browsers to prompt a warning.


With the launch of the Custom Domains feature today you can create a CNAME in your own domain pointing to your Aiven service.  Once the CNAME is set up you can register it in Aiven using our Web Console and we'll automatically set up a valid SSL certificate for it.  This way you can set up secure services like grafana.example.com and search.example.com in a just a few clicks in our console.

Try SSL-enabled custom domains for free

You can sign up for a free trial of our services at Aiven.io and try all of our services with US$10 worth of free credits, including ones with SSL certificates for custom domains.

The SSL certificate feature is available in all Startup, Business and Premium plans for Grafana and Elasticsearch.  If the Let's Encrypt project lifts its SSL certificate creation limits we may be able to provide this service also for Hobbyist plans in the future.

We value your feedback

We are always interested in ways of making our service better. Please send your feedback and suggestions via email, Facebook, LinkedIn or using our support system.

Cheers,

Team Aiven

2016-04-19

Monitoring, metrics collection and visualization using InfluxDB and Grafana

In addition to providing you the Aiven service, our crew also does a fair amount of software consulting in the cloud context. A very common topic we are asked to help on is metrics collection and monitoring. Here's a walk through on how to utilize InfluxDB and Grafana services for one kind of solution to the problem. We offer both as a managed Aiven service for quick and easy adoption.

Case-example

As an example, here's a dashboard screenshot from a pilot project we built recently for our customer:



This particular instance is used to monitor the health of quality assurance system on an industrial manufacturing line. The system being monitored uses IP based video cameras coupled with IO triggers to record a JPEG image of the artifacts passing through various stages of processing steps. The resulting dashboard allows verification that the system is working properly with a single glance.

On the top of the dashboard you'll see a simple reading of temperature sensor of the device. Any large deviation from the norm would be a good warning of oncoming hardware fault:




The next plotted metric is the size of the JPEG compressed image from the imaging device:




Interestingly, this relatively simple metric reveals a lot about the health of both the sensor and any lenses and lighting sources involved. Due to the nature of JPEG encoding, the frame and the size varies slightly even in rather static scenes, so it makes a good quick overall indicator that the component is running fine and returning up-to-date content.

The two graphs at the bottom track the last update time from each of the camera feeds and each of the IO trigger services respectively:
  

 
Here, we expect each of the cameras to update several times a second. The IO triggers are interrogated in a long-poll mode with a timeout of 15 seconds. These limits yield natural maximum allowable limits for monitoring and alerting purposes. In fact, the left hand side shows two readings that correlate with temporary network glitches.


Building blocks

The visualization and dashboard tool shown above is Grafana. The underlying storage for the telemetry data is InfluxDB. In this case, we utilize Telegraf as a local StatsD compatible collection point for capturing and transmitting the data securely into InfluxDB instance. And finally, we use a number of taps and sensors across the network that feed the samples to Telegraf using StatsD client libraries in Node.js, Python and Java based on the component.

In this project we are using Aiven InfluxDB and Aiven Grafana hosted services, but any other InfluxDB / Grafana should work more or less the same way.

InfluxDB - The metrics database

We start by launching an InfluxDB service in Aiven:



The service is automatically launched in a minute or so.

InfluxDB is a time-series database with some awesome features:
  • Adaptive compression algorithms allow storing huge numbers of data points
  • Individual metrics data points can be tagged with key=value tags and queried based on them
  • Advanced query language allows queries whose output data requires little or no post-processing
  • It is FAST!


Telegraf - The metrics collector

Next we will need the connection parameters to our InfluxDB instance. The necessary information required (hostname, username, password, etc.) for connecting our Telegraf collecting agent to InfluxDB can be found from the from the service overview page:



We typically run a single Telegraf container per environment. In order to make Telegraf talk to our InfluxDB and to accept StatsD input, we will need to modify its configuration file telegraf.conf a little bit and add the following sections:


    [outputs]
        [outputs.influxdb]
        url = "https://teledb.htn-aiven-demo.aivencloud.com:21950"
        database = "dbb253c1e025704a4494f3f65412b70e30"
        username = "usr2059f5ef88fb46e49bd1f5fd0d464d80"
        password = "password_goes_here"
        ssl_ca = "/etc/telegraf/htn-aiven-demo-ca.crt"
        precision = "s" 
[inputs]
    [inputs.statsd]
    service_address = "127.0.0.1:8125"
    delete_gauges = true
    delete_counters = true
    delete_sets = false
    delete_timings = true
    percentiles = [90]
    allowed_pending_messages = 10000
    percentile_limit = 1000

We want our InfluxDB connection to be secure against man-in-the-middle attacks, so we have included the service's CA certificate in the configuration file. This will force the InfluxDB server to prove its identity to our Telegraf client. The certificate can be downloaded from the Aiven web console:



Here's an example StatsD code blob for Node.js component:
    var statsd = require('node-statsd')
    var statsd_client = new statsd({
        host: '<telegraf_ip>',
        port: 8125,
    });
    statsd_client.gauge('image_size,source=30', 48436,
        function(error, bytes) {
            statsd_client.close();
        }
    );
The StatsD UDP protocol uses super simple textual message format and sending a metric takes few CPU cycles, so even a high request-throughput server can transmit metrics per request processed, without hurting the overall performance. The StatsD receiver in Telegraf parses these incoming metrics messages and consolidates the metrics, typically storing data at a much slower pace in to the metrics databases. This really helps keeping both the source's software's and the metrics database's load levels under control.

In the above code sample, we use Telegraf's StatsD extension for tagging support with the source=20 parameter. This handy little feature is what allows us to easily slice and display the collected metrics by each sensor or just plot all metrics, regardless of the source sensor. This is one of the killer features of InfluxDB and Telegraf!

OK, so now we are transmitting metrics from our application thru the Telegraf daemon to our InfluxDB database. Next up is building a Grafana dashboard that visualizes the collected data.

Grafana - The dashboard

We launch our Grafana from the Aiven console by creating a new service:



Normally an InfluxDB needs to be manually added as a data source in Grafana, however in this case we can skip that step as InfluxDB and Grafana services launched under the same project in Aiven are automatically configured to talk to each other.

We like Grafana a lot because it makes it simple to define visually appealing, yet useful graphs and it integrates with InfluxDB well. Grafana has a user-friendly query builder specifically for building queries for InfluxDB, and with a little practice it takes little time to conjure fabulous charts from almost any source data.

The Grafana web URL, username and password are available on the service overview page:




Opening Grafana in the browser, logging in with the credentials from above and defining a simple graph with an InfluxDB query editor... PROFIT!



That's it for now. Getting application metrics delivered from the application to a pretty dashboard doesn't take much effort nowadays!

What next?

We use Telegraf, InfluxDB and Grafana rather extensively in our own Aiven monitoring infrastructure. However, we have add a couple more components, such as Apache Kafka, to the stack, but that is a topic for an upcoming blog post. Stay tuned! :-)


Hosted InfluxDB and Grafana at Aiven.io

InfluxDB and Grafana are available at our Aiven.io service, you can sign up for a free trial at aiven.io.

Have fun monitoring your apps!

Cheers,

    Team Aiven