Configuring Bitrix24 On-Premise High Availability
High availability is not about clustering for its own sake. It is the answer to the question: what happens when each component of the system fails, and how quickly will it recover? For Bitrix24 On-Premise, the targets must be defined upfront: SLA 99.9% (8.7 hours of downtime per year) is fundamentally different from SLA 99.99% (52 minutes).
Single Point of Failure Analysis
Before building for high availability, identify all SPOFs in your installation:
| Component | Risk | Solution |
|---|---|---|
| Web server (single) | Complete outage on failure | Active-Active cluster |
| MySQL without replica | Data loss + downtime | Master-Slave + auto-failover |
| NFS (single) | File loss + downtime | GlusterFS or S3 |
| Redis (single) | Session loss (all users logged out) | Redis Sentinel |
| Load balancer | Complete outage | keepalived + VIP |
| DNS | Unreachable by hostname | Two DNS servers or Anycast |
Keepalived + Virtual IP for the Load Balancer
The most critical component — the load balancer itself must not be a SPOF:
# /etc/keepalived/keepalived.conf — MASTER node
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass your_secret
}
virtual_ipaddress {
192.168.1.100/24 # VIP — this IP is registered in DNS
}
track_script {
chk_nginx
}
}
vrrp_script chk_nginx {
script "killall -0 nginx"
interval 2
weight -20
}
When the MASTER fails, keepalived automatically transfers the VIP to the BACKUP node. Switchover takes 2–3 seconds.
Database Auto-Failover
Manual Master → Slave promotion during an incident means 15–30 minutes of downtime. Automatic failover via Orchestrator or MHA eliminates this:
Orchestrator — the most mature solution for MySQL/MariaDB:
# Install and configure Orchestrator
orchestrator-client -c topology -i db-master:3306
# On master failure, automatically promotes the best replica
After a master change, Bitrix24 must receive the new database address. This is handled by ProxySQL — a proxy in front of MySQL that transparently redirects connections when the topology changes.
GlusterFS for Fault-Tolerant Storage
NFS is simple and inexpensive, but when it fails the entire cluster loses access to files. GlusterFS is a distributed file system with built-in replication:
# On both storage nodes
gluster volume create bitrix-files replica 2 \
storage1:/data/bitrix storage2:/data/bitrix
gluster volume start bitrix-files
# Mount on web nodes
mount -t glusterfs storage1:/bitrix-files /home/bitrix/www/upload
When one node fails, GlusterFS continues operating on the second. Changes are synchronized automatically upon recovery.
Health Checks and Auto-Recovery
Monitoring without automated responses is only half the job. Configure automatic reactions:
- nginx health_check to remove unhealthy backends from the pool
- systemd auto-restart for nginx, php-fpm, and redis on crash
- Cron-based replication lag check with a Telegram alert when lag > 60 seconds
# Automatic replication check with alert
mysql -u monitor -e "SHOW SLAVE STATUS\G" | grep "Seconds_Behind_Master" | \
awk '{if($2>60) system("curl -s -X POST https://api.telegram.org/bot$TOKEN/sendMessage -d chat_id=$CHAT -d text=REPLICA_LAG_ALERT")}'
RTO/RPO for Various Failure Scenarios
| Scenario | RPO (data loss) | RTO (recovery time) |
|---|---|---|
| Web node failure | 0 | < 5 sec (keepalived) |
| DB master failure | < 5 sec | 1–2 min (Orchestrator) |
| NFS/GlusterFS failure | 0 (replication) | < 30 sec |
| Complete datacenter loss | Per backup RPO (1 hour) | 2–4 hours |
High availability costs money — at minimum a doubling of infrastructure. But the cost of one hour of downtime for a corporate portal used by 200 employees justifies the investment. Always calculate ROI before designing the architecture.







