Configuring Bitrix24 On-Premise Fault Tolerance

Our company is engaged in the development, support and maintenance of Bitrix and Bitrix24 solutions of any complexity. From simple one-page sites to complex online stores, CRM systems with 1C and telephony integration. The experience of developers is confirmed by certificates from the vendor.
Our competencies:
Development stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1173
  • image_bitrix-bitrix-24-1c_fixper_448_0.png
    Website development for FIXPER company
    811
  • image_bitrix-bitrix-24-1c_development_of_an_online_appointment_booking_widget_for_a_medical_center_594_0.webp
    Development based on Bitrix, Bitrix24, 1C for the company Development of an Online Appointment Booking Widget for a Medical Center
    564
  • image_bitrix-bitrix-24-1c_mirsanbel_458_0.webp
    Development based on 1C Enterprise for MIRSANBEL
    745
  • image_crm_dolbimby_434_0.webp
    Website development on CRM Bitrix24 for DOLBIMBY
    655
  • image_crm_technotorgcomplex_453_0.webp
    Development based on Bitrix24 for the company TECHNOTORGKOMPLEKS
    976

Configuring Bitrix24 On-Premise High Availability

High availability is not about clustering for its own sake. It is the answer to the question: what happens when each component of the system fails, and how quickly will it recover? For Bitrix24 On-Premise, the targets must be defined upfront: SLA 99.9% (8.7 hours of downtime per year) is fundamentally different from SLA 99.99% (52 minutes).

Single Point of Failure Analysis

Before building for high availability, identify all SPOFs in your installation:

Component Risk Solution
Web server (single) Complete outage on failure Active-Active cluster
MySQL without replica Data loss + downtime Master-Slave + auto-failover
NFS (single) File loss + downtime GlusterFS or S3
Redis (single) Session loss (all users logged out) Redis Sentinel
Load balancer Complete outage keepalived + VIP
DNS Unreachable by hostname Two DNS servers or Anycast

Keepalived + Virtual IP for the Load Balancer

The most critical component — the load balancer itself must not be a SPOF:

# /etc/keepalived/keepalived.conf — MASTER node
vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1

    authentication {
        auth_type PASS
        auth_pass your_secret
    }

    virtual_ipaddress {
        192.168.1.100/24  # VIP — this IP is registered in DNS
    }

    track_script {
        chk_nginx
    }
}

vrrp_script chk_nginx {
    script "killall -0 nginx"
    interval 2
    weight -20
}

When the MASTER fails, keepalived automatically transfers the VIP to the BACKUP node. Switchover takes 2–3 seconds.

Database Auto-Failover

Manual Master → Slave promotion during an incident means 15–30 minutes of downtime. Automatic failover via Orchestrator or MHA eliminates this:

Orchestrator — the most mature solution for MySQL/MariaDB:

# Install and configure Orchestrator
orchestrator-client -c topology -i db-master:3306
# On master failure, automatically promotes the best replica

After a master change, Bitrix24 must receive the new database address. This is handled by ProxySQL — a proxy in front of MySQL that transparently redirects connections when the topology changes.

GlusterFS for Fault-Tolerant Storage

NFS is simple and inexpensive, but when it fails the entire cluster loses access to files. GlusterFS is a distributed file system with built-in replication:

# On both storage nodes
gluster volume create bitrix-files replica 2 \
    storage1:/data/bitrix storage2:/data/bitrix

gluster volume start bitrix-files

# Mount on web nodes
mount -t glusterfs storage1:/bitrix-files /home/bitrix/www/upload

When one node fails, GlusterFS continues operating on the second. Changes are synchronized automatically upon recovery.

Health Checks and Auto-Recovery

Monitoring without automated responses is only half the job. Configure automatic reactions:

  • nginx health_check to remove unhealthy backends from the pool
  • systemd auto-restart for nginx, php-fpm, and redis on crash
  • Cron-based replication lag check with a Telegram alert when lag > 60 seconds
# Automatic replication check with alert
mysql -u monitor -e "SHOW SLAVE STATUS\G" | grep "Seconds_Behind_Master" | \
    awk '{if($2>60) system("curl -s -X POST https://api.telegram.org/bot$TOKEN/sendMessage -d chat_id=$CHAT -d text=REPLICA_LAG_ALERT")}'

RTO/RPO for Various Failure Scenarios

Scenario RPO (data loss) RTO (recovery time)
Web node failure 0 < 5 sec (keepalived)
DB master failure < 5 sec 1–2 min (Orchestrator)
NFS/GlusterFS failure 0 (replication) < 30 sec
Complete datacenter loss Per backup RPO (1 hour) 2–4 hours

High availability costs money — at minimum a doubling of infrastructure. But the cost of one hour of downtime for a corporate portal used by 200 employees justifies the investment. Always calculate ROI before designing the architecture.