Useful scripts for checking Postgres replication lag
We have been using Postgres replication successfully now on a number of projects - mainly to ensure that we have data backed up in real-time to a separate disaster recovery site. But a useful side effect of having this up-to-date backup on a Postgres instance, is we now have the ability to take regular snapshots of it without affecting performance of the live site, and to provide read-only views onto the data with the benefit of it being bang up to date.
One of our bespoke learning management systems - The Skills Network's EQUAL system now has a million enrolments and tens of millions of pieces of learner and tutor work. It is imperative that none of this work is ever lost.
We use a couple of scripts to monitor the performance and integrity of the slave by querying either the master or slave the Postgres database.
We monitor the current replication lag of the slave periodically to check things aren't going awry by running the following SQL command on the slave database.
slave=# select now()-pg_last_xact_replay_timestamp() as replication_lag; replication_lag ----------------- 00:00:00.419338 (1 row)
This returns a Postgres timestamp which in this case is showing less than a second. We very rarely see this value exceeding a couple of seconds.
We find it more useful to query the master database to check the active flag of the slave. Remember that you may have more than one slave returned if you have multiple. The names of the slaves are defined in their recovery.conf using the value primary_slot_name = 'slave'
master=# select * from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn -----------+--------+-----------+--------+----------+--------+------------+------+--------------+------------- slave | | physical | | | t | 16749 | | | D3/CC942000 (1 row)
The active column is a Postgres boolean, which we check every minute to ensure the slave is still attached and active, and send an email if not. Remember to write the script to raise the checking interval when its down though, or you may get a lot of emails!
Finally another check you can do is to see the size of the lag in bytes, though we havent needed to use this in terms of raising alerts. Amend the above query on the pg_replication_slots view to see the size of the pending replication in bytes.
master=# SELECT slot_name, database, active, pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS ret_bytes FROM pg_replication_slots; slot_name | database | active | ret_bytes -----------+----------+--------+---------------- slave | | t | 15856 (1 row)
Put a few of these queries together and kick back, never having to wonder if your Postgres replication is still active.