Possible disaster scenarios

Failure of:

Machine affected

HA system response

PSQL database service

Primary

Failover, automatic service reboot. Replication stops in the meantime.

Secondary

Automatic service reboot. Replication stops in the meantime.

Monitor

Automatic service reboot. Replication continues but no failover possible in the meantime.

Server shutdown

Primary

Failover. Replication stops, waits for a signal from secondary.

Secondary

Replication on primary stops. Waits for a signal from secondary.

Monitor

The primary database is still usable. Primary and secondary nodes wait for a connection to monitor. Replication continues.

All

Database unavailable, no replication, no failover possible.

Server reboot

Primary

Failover, automatic service reboot on startup.

Secondary

Automatic service reboot. Replication stops in the meantime.

Monitor

Automatic service reboot. Replication continues but no failover possible in the meantime.

All

Database unavailable, no replication, no failover possible.

pg_autoctl corrupted and/or deleted

All

Database unavailable, no replication, no failover possible.

Controlled switchover

Note

In a controlled switchover situation it is possible to organize the sequence of events in a way to avoid data loss and lower downtime to a minimum. Because the HA cluster described here uses synchronous replication, triggering a manual failover doesn’t risk data loss risks. The monitor server keeps the current primary health at the time when the failover is triggered, and drives the failover accordingly.

Triggering a controlled switchover is the same as a manual failover described above.

Recovery

Database service failure

If the PostgreSQL database fails on one of the machines, the system will automatically reboot the affected service, but the replication process is unavailable for the duration.

Server shutdown

If either of the component machines is shut down, a manual restart is required. The failover processes will automatically start with the machine, and reinitialize the connections. If only the monitor server is affected, replication continues and failover is still possible.

Server reboot

The failover system is configured to automatically restart with the server, and no manual intervention is required. If only the monitor server is affected, replication continues but no failover can be triggered until it’s available.

pg_autoctl setup failure

On the current primary database machine:

/usr/pgsql-12/bin/postgres -D /var/lib/pgsql/[node-?] -p [port]

Edit the preferences.cfg file for Central, and change the following line, using the connection string:

postgres://[node-?]:[port]/mmsuite?target_session_attrs=read-write

Restart Central:

systemctl restart mmcentral

Complete shutdown

If the startup scripts are correct in all of the machines a manual boot of the machines in the correct order (1. monitor; 2. primary; 3. secondary) will be enough to reinitialize the cluster. On each machine, use the ps -ef | grep monitor (or primary/secondary) command after boot to verify the pg_autoctl process is running.

If something’s not working, or you’d like to manually restart the services to recover, follow these steps.

Note

You can create bash scripts of each step to execute instead of manually running through them.

Start the monitor machine:

sudo su - postgres
export PATH="$PATH:/usr/pgsql-12/bin"
pg_autoctl run --pgdata ./[monitor]/

Start the primary machine:

sudo su - postgres
export PATH="$PATH:/usr/pgsql-12/bin"
pg_autoctl run --pgdata ./[node-1]/

If an error message states an instance is already running, remove the referenced file:

rm /tmp/pg_autoctl/var/lib/pgsql/[node-1]/pg_autoctl.pid

And re-run the application:

pg_autoctl run --pgdata ./[node-1]/

Start the secondary machine(s):

sudo su - postgres
export PATH="$PATH:/usr/pgsql-12/bin"
pg_autoctl run --pgdata ./[node-2]/

If an error message states an instance is already running, remove the referenced file:

rm /tmp/pg_autoctl/var/lib/pgsql/[node-2]/pg_autoctl.pid

And re-run the application:

pg_autoctl run --pgdata ./[node-2]/