Configuring RTO/RPO for 1С-Bitrix Project
Business says: "the site must not be down for more than an hour". Engineer nods and goes to configure replication. After six months, it turns out that recovery from the latest backup takes 4 hours, and the business didn't know about it. RTO and RPO are not technical characteristics, they are agreements with the business that need to be documented and technically ensured.
What are RTO and RPO in the context of Bitrix
RPO (Recovery Point Objective) — maximum permissible data loss. If RPO = 1 hour, then in case of catastrophe, no more than one hour of transactions can be lost: orders, registrations, inventory changes.
RTO (Recovery Time Objective) — maximum permissible downtime. If RTO = 30 minutes, then 30 minutes after the incident, the site must be operational.
Typical values for an online store on Bitrix: RPO = 1 hour, RTO = 2 hours. For highload projects: RPO = 5 minutes, RTO = 15 minutes. The stricter the requirements — the more expensive the infrastructure.
Technical solutions for different RPO levels
RPO = several hours. Hourly pg_dump to external storage is sufficient. Simple, cheap, but slow to restore for large databases.
RPO = minutes. PostgreSQL streaming replication with synchronous mode (synchronous_commit = on). Each transaction is confirmed only after being written to the replica. Cost: +5–15 ms per transaction.
RPO = seconds. Patroni with synchronous replication + continuous WAL archiving via archive_command to S3. With WAL archiving, you can restore the database to any point in time (PITR — Point-in-Time Recovery).
# postgresql.conf for PITR
archive_mode = on
archive_command = 'aws s3 cp %p s3://backup-bucket/wal/%f'
Technical solutions for different RTO levels
RTO = several hours. Recovery from pg_dump + code deployment from git. Linearly depends on database size: 10 GB — approximately 45–90 minutes recovery.
RTO = 30–60 minutes. Standby server with hot replica. During incident — manual failover: promote replica, change DNS or application config. Not automatic, but fast.
RTO = less than 10 minutes. Automatic failover via Patroni + HAProxy. Without human intervention. Requires preliminary setup and regular testing.
Solution matrix for Bitrix
| Project Size | RPO | RTO | Infrastructure |
|---|---|---|---|
| Up to 5k orders/day | 1 hour | 4 hours | pg_dump to S3, deploy from git |
| 5–50k orders/day | 15 min | 1 hour | Streaming replica + manual failover |
| Over 50k orders/day | 1 min | 10 min | Patroni + HAProxy + WAL archiving |
Calculating real RTO: what is included in recovery time
Recovery time is the sum of all steps, not just "restore database":
- Incident detection — from 0 to 15 minutes (depends on monitoring)
- Failover decision — 5–10 minutes
- DB recovery/promotion — depends on RPO solution
- Application configuration change — 2–5 minutes
- Cache warming — first requests after recovery are slow, Redis/memcached are empty
- Health verification — 5–10 minutes
The "cache warming" point is often ignored in RTO calculations. After recovery, the database receives load from scratch: Bitrix cache is empty, OPcache is cold. First 5–10 minutes of operation — peak database load. Without rate limiting, this can overwhelm the newly restored server.
Documentation and testing
RTO/RPO without a documented runbook is worthless. The runbook should contain the exact sequence of commands for each failure scenario: primary DB failure, web server failure, /upload/ loss, server compromise.
# Example runbook section: PostgreSQL failover (manual)
# 1. Verify that primary is unavailable
pg_isready -h primary.db -p 5432
# 2. Promote replica
ssh replica.db 'pg_ctl promote -D /var/lib/postgresql/data'
# 3. Update Bitrix config
sed -i "s/primary.db/replica.db/" /var/www/bitrix/.settings.php
# 4. Clear cache
php /var/www/bitrix/bitrix/modules/main/cli/cache_clear.php
What we configure
- Determining target RPO and RTO together with the business
- Selecting and configuring infrastructure solution for the given parameters
- WAL archiving for PostgreSQL PITR when RPO < 15 minutes
- Runbook with recovery commands for each failure scenario
- Testing schedule: quarterly recovery from backup, measure actual RTO







