Developing a Disaster Recovery Plan (DRP)
Disaster Recovery Plan (DRP) — documented set of procedures for restoring IT infrastructure after catastrophic failure. Without DRP, team panics, data is lost, recovery takes days instead of hours. Good DRP exists in two formats: detailed document for analysis and brief runbooks for execution at 3 AM.
DRP Structure
1. Classify scenarios by priority
| Scenario | RTO | RPO | Probability |
|---|---|---|---|
| Application server down | 15 min | 0 | High |
| Primary DB failure | 30 min | 5 min | Medium |
| Data center loss (region) | 4 h | 1 h | Low |
| Ransomware attack / data deletion | 8 h | 24 h | Low |
| Deployment error (critical regression) | 30 min | 0 | High |
2. Responsible parties and contacts
# drp/contacts.yml
incident_commander: "CTO"
primary_oncall: "DevOps Team"
contacts:
- role: "Incident Commander"
name: "Aleksei Petrov"
phone: "+7-xxx-xxx-xxxx"
telegram: "@apetrov"
- role: "DB Admin"
name: "Marina Sidorova"
phone: "+7-xxx-xxx-xxxx"
3. Inventory of critical components
# drp/inventory.yml
critical_systems:
- name: "PostgreSQL Primary"
host: "db-primary.internal"
backup_location: "s3://backups/postgres/"
backup_frequency: "hourly"
replication: "streaming to db-replica-1, db-replica-2"
Runbook: primary DB loss
# RUNBOOK: PostgreSQL Primary Failure
**Time:** 15-30 minutes
**Requirements:** AWS access, ssh to servers
## Steps
### 1. Confirm failure (2 min)
```bash
ssh db-primary.internal
psql -h db-primary.internal -U postgres -c "SELECT 1;"
2. Select best replica (3 min)
Check lag on each replica and select one with least lag.
3. Promote replica (5 min)
patronictl -c /etc/patroni/patroni.yml failover cluster-name --master db-replica-1
4. Redirect traffic (5 min)
Update DNS or HAProxy
5. Restart application (2 min)
kubectl rollout restart deployment/api -n production
6. Verify
curl https://api.example.com/health
psql -h db-primary.example.com -U app -c "SELECT count(*) FROM users;"
### Automate DR procedures
```bash
#!/bin/bash
# scripts/dr/db-failover.sh
# Automatic failover on confirmed primary failure
set -euo pipefail
# 1. Confirm failure
if pg_isready -h "$DB_PRIMARY_HOST" -U postgres -t 5; then
echo "[$(date -u)] Primary is up. Aborting."
exit 1
fi
# 2. Find best replica
BEST_REPLICA=""
BEST_LAG=999999999
for replica in $DB_REPLICAS; do
LAG=$(psql -h $replica -U postgres -tAc "
SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()))::int
" 2>/dev/null || echo 999999999)
if [ "$LAG" -lt "$BEST_LAG" ]; then
BEST_LAG=$LAG
BEST_REPLICA=$replica
fi
done
# 3. Promote (Patroni)
patronictl -c /etc/patroni/patroni.yml failover \
postgres-cluster --master $BEST_REPLICA --force
# 4. Notify Slack
curl -s -X POST "$SLACK_WEBHOOK" -H 'Content-type: application/json' \
-d "{\"text\": \"DB Failover completed. New primary: $BEST_REPLICA (lag was ${BEST_LAG}s)\"}"
Timeline
Development of complete DRP with runbooks for 5–10 scenarios, inventory, and automation scripts — 3–5 business days.







