Setting Up Automatic Database Failover
Automatic failover — mechanism that detects database master unavailability and automatically promotes replica to master without manual intervention. Goal: reduce RTO (Recovery Time Objective) from tens of minutes to seconds.
Tools for PostgreSQL
Patroni — de facto standard
Patroni — Python daemon running on each PostgreSQL node. Uses DCS (Distributed Consensus Store: etcd, Consul, ZooKeeper) for coordination and leader election.
# /etc/patroni/patroni.yml (on each node)
scope: production-cluster
namespace: /service/
name: node1
restapi:
listen: 0.0.0.0:8008
connect_address: 192.168.1.10:8008
etcd3:
hosts: 192.168.1.20:2379,192.168.1.21:2379,192.168.1.22:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 30
maximum_lag_on_failover: 1048576 # 1MB max lag on failover
synchronous_mode: false
postgresql:
listen: 0.0.0.0:5432
connect_address: 192.168.1.10:5432
data_dir: /var/lib/postgresql/14/main
authentication:
replication:
username: replication
password: replication_password
parameters:
max_connections: 200
shared_buffers: 256MB
wal_level: replica
hot_standby: on
Start:
systemctl start patroni
Check cluster status:
patronictl -c /etc/patroni/patroni.yml list
# Shows leader and replicas with lag
Manual switchover (planned):
patronictl -c /etc/patroni/patroni.yml switchover production-cluster
etcd for coordination
# etcd cluster on three servers
etcd --name etcd1 \
--data-dir /var/lib/etcd \
--listen-peer-urls http://192.168.1.20:2380 \
--listen-client-urls http://192.168.1.20:2379 \
--initial-advertise-peer-urls http://192.168.1.20:2380 \
--advertise-client-urls http://192.168.1.20:2379 \
--initial-cluster "etcd1=http://192.168.1.20:2380,etcd2=http://192.168.1.21:2380,etcd3=http://192.168.1.22:2380"
HAProxy for transparent switching
Patroni provides healthcheck endpoint:
-
GET /master→ 200 if node is leader -
GET /replica→ 200 if node is replica
# haproxy.cfg
frontend postgres_write
bind *:5000
default_backend postgres_master
backend postgres_master
option httpchk GET /master
http-check expect status 200
server node1 192.168.1.10:5432 check port 8008
server node2 192.168.1.11:5432 check port 8008
server node3 192.168.1.12:5432 check port 8008
MySQL: MySQL InnoDB Cluster
MySQL Router + Group Replication — official Oracle solution:
# Initialize cluster (MySQL Shell)
mysqlsh [email protected]:3306
JS> dba.createCluster('myCluster')
JS> cluster = dba.getCluster()
JS> cluster.addInstance('[email protected]:3306')
JS> cluster.status()
Setup MySQL Router:
mysqlrouter --bootstrap [email protected]:3306 --directory /etc/mysqlrouter
Router automatically directs writes to primary and updates routing on failover.
Testing failover
# Kill current master
patronictl -c /etc/patroni/patroni.yml failover production-cluster --force
# Monitor switching
watch -n 1 patronictl -c /etc/patroni/patroni.yml list
Typical Patroni + etcd failover time: 10–30 seconds.
Timeline
Setup of Patroni cluster on 3 nodes with HAProxy and etcd — 3–4 business days.







