Monitoring and Testing Disaster Recovery for 1C-Bitrix
A written recovery plan without regular verification does not work. The team does not know the real RTO, backups may be corrupted, and the production configuration may have changed since the last drill. DR monitoring is not just observing the current state — it is regularly confirming that the recovery plan can be executed within the declared timeframe.
What to Monitor in the Context of DR
Backup State
Monitor not just whether a backup was created, but its integrity:
#!/bin/bash
# Check the latest DB dump
BACKUP_FILE="/backups/db/bitrix_$(date +%Y%m%d).sql.gz"
MIN_SIZE=104857600 # 100 MB — minimum expected size
if [ ! -f "$BACKUP_FILE" ]; then
echo "CRITICAL: Backup file not found: $BACKUP_FILE"
exit 2
fi
FILE_SIZE=$(stat -c%s "$BACKUP_FILE")
if [ "$FILE_SIZE" -lt "$MIN_SIZE" ]; then
echo "CRITICAL: Backup too small: ${FILE_SIZE} bytes"
exit 2
fi
# Check gzip integrity
if ! gzip -t "$BACKUP_FILE" 2>/dev/null; then
echo "CRITICAL: Backup file is corrupted"
exit 2
fi
echo "OK: Backup size ${FILE_SIZE} bytes, integrity OK"
This script runs in Nagios/Zabbix/Prometheus as an external check. An alert fires if the backup is missing, too small, or corrupted.
DB Replication
-- Seconds_Behind_Master > 300 — alert
SHOW SLAVE STATUS\G
In Zabbix — via zabbix_get with a MySQL agent or a custom UserParameter:
# zabbix_agentd.conf
UserParameter=mysql.slave.lag,mysql -u monitor -pXXX -e "SHOW SLAVE STATUS\G" 2>/dev/null | grep "Seconds_Behind_Master" | awk '{print $2}'
Free Space on the Backup Server
# Warning when <20% free space remains
df -h /backups | awk 'NR==2 {gsub(/%/,""); if ($5 > 80) print "WARNING: disk " $5 "% used"}'
Recovery Endpoint Availability
A simple healthcheck on the backup server, monitored from both the primary DC and an external monitoring service:
// /health.php on the backup server
<?php
header('Content-Type: application/json');
$checks = [];
// Check DB availability
try {
$pdo = new PDO('mysql:host=127.0.0.1;dbname=bitrix_db', 'bitrix_ro', '***');
$pdo->query("SELECT 1");
$checks['db'] = 'ok';
} catch (Exception $e) {
$checks['db'] = 'fail';
}
// Check Redis
$redis = new Redis();
$checks['redis'] = $redis->connect('127.0.0.1', 6379) ? 'ok' : 'fail';
// Check Bitrix filesystem
$checks['files'] = file_exists('/var/www/bitrix/bitrix/php_interface/dbconn.php') ? 'ok' : 'fail';
$status = in_array('fail', $checks) ? 503 : 200;
http_response_code($status);
echo json_encode(['status' => $status === 200 ? 'ok' : 'degraded', 'checks' => $checks]);
Regular DR Drills: Methodology
Quarterly drill — full restore to an isolated test stand:
- Take the latest DB and file backup
- Deploy to a clean server
- Time each stage
- After restore — run an automated smoke test
#!/bin/bash
# dr_smoke_test.sh — runs after restore
BASE_URL="https://test-recovery.example.com"
check() {
local name="$1"
local url="$2"
local expected="$3"
response=$(curl -sf --max-time 30 "$url")
if echo "$response" | grep -q "$expected"; then
echo "PASS: $name"
else
echo "FAIL: $name — expected '$expected' not found"
FAILED=1
fi
}
check "Homepage" "$BASE_URL/" "1C-Bitrix"
check "Catalog" "$BASE_URL/catalog/" "Catalog"
check "Cart API" "$BASE_URL/bitrix/components/bitrix/sale.basket.basket/" "basket"
check "Health endpoint" "$BASE_URL/health.php" '"status":"ok"'
[ -z "$FAILED" ] && echo "All checks passed" || echo "Some checks FAILED"
Monthly drill — DB-only restore. Verify dump currency: restore to a test server, run queries against b_sale_order, b_iblock_element, b_catalog_price — confirm that the data is current (latest records not older than the RPO).
-- Check data freshness after restore
SELECT MAX(DATE_INSERT) as latest_order FROM b_sale_order;
-- Should not be older than RPO (e.g., not older than 4 hours)
SELECT COUNT(*) FROM b_iblock_element WHERE ACTIVE = 'Y';
-- Compare with the expected number of active products
DR Metrics and SLA
| Metric | Target value | How it is measured |
|---|---|---|
| DB backup: age of last valid backup | < RPO (e.g. 4 h) | Monitoring + file timestamp |
| Replication: Seconds_Behind_Master | < 60 s under normal conditions | Zabbix/Prometheus |
| Drill duration (full restore) | Compared against RTO | Timed at each drill |
| Successful drills per quarter | ≥ 1 | Testing log |
| File backup age | < 24 h | rsync monitoring |
DR Reporting
After each drill, record:
- Date and time of drill
- Plan version (revision number)
- Time for each recovery stage
- Actual RTO vs planned RTO
- Issues discovered during the drill
- Plan updates following the drill
This log is not a formality. It reveals trends: whether RTO is degrading over time (the site grows, backups become larger, the procedure is not updated).
DR Monitoring Setup Timeline
Setting up backup monitoring, replication checks, and healthcheck endpoints with integration into Zabbix/Prometheus, plus the first drill with an automated smoke test — 3–5 business days.







