1C-Bitrix Clustering
Imagine: a flash sale, 10,000 users simultaneously on the site, the server goes down with a 502 error, carts disappear, managers call support. We have seen this dozens of times. The solution is clustering: load balancing between servers, database replication, and automatic failover. Order an audit of your current infrastructure — in 2 days we will determine if and what kind of cluster is needed. Our experience: 40+ high-load projects on Bitrix.
Why is 1C-Bitrix clustering critical for fault tolerance?
80-90% of requests in a typical project are SELECT. Catalog, product pages, filters — all reads. Master-slave replication routes SELECTs to slave servers, leaving the master for writes only. The 'Web Cluster' module (Business edition and higher) routes requests automatically.
Common stumbling blocks: on master binlog_format = ROW. STATEMENT-based replication with NOW() or UUID() causes inconsistencies — leading to a week of debugging. Unique server-id, binary log enabled. On slave — read_only = ON, relay-log. Initialization via xtrabackup (not mysqldump, which locks tables for half an hour on a 20 GB database).
Metric #1 — Seconds_Behind_Master. If a slave lags by 5+ seconds, a customer places an order, returns to their personal account — and the order is missing (SELECT went to a lagging slave). The module allows manual exclusion of critical queries from slave routing.
Failover: Orchestrator or ProxySQL promote a slave to master in 15-30 seconds. The module supports up to 9 slave connections with configurable weights. Integrity check — pt-table-checksum from Percona Toolkit. Savings from inefficient infrastructure can be up to 40% of the budget, representing a significant annual amount for projects with 50,000+ unique visitors. For more information on replication, refer to MySQL Replication Documentation and Wikipedia: Database Replication.
When is clustering necessary?
Not every project needs it. Specific markers:
- 50,000-100,000 unique visitors per day — a single server starts returning 502 errors during peak hours
- Peak spikes of 5-10 times (sales, flash sales) — load grows in minutes, vertical scaling is not enough
- SLA 99.9% (no more than 8.7 hours of downtime per year) — unattainable with a single server
- Geographic distribution of users
Sometimes composite caching, SQL optimization, and vertical scaling are sufficient. We will honestly tell you if a cluster is not yet needed. Investments in clustering typically pay off within 3-6 months under peak loads. The average project budget is determined individually.
What does the cluster architecture consist of?
Load balancer. HAProxy, nginx upstream, or cloud LB. Round-robin for even distribution, ip-hash for session stickiness, least connections for adaptive balancing. Health checks remove dead servers from the pool. SSL termination on the balancer offloads web nodes.
Web servers. Identical nginx + php-fpm, each with a full copy of the code. Sessions in Redis/Memcached, not on disk (otherwise users lose their cart when switching servers). In the cloud — auto-scaling: load increases — servers are added, load decreases — they are removed.
Cache. Redis Cluster with data sharding across nodes. Redis Sentinel for small clusters. Memcached is fast but lacks persistence. Configuration in .settings.php — servers, weights, sharding strategy.
File storage. Uploads, images — accessible from each node. NFS for 2-3 servers, but it is a single point of failure. GlusterFS — distributed file system without single point of failure. S3 (MinIO, AWS, Yandex Object Storage) — offload static files to object storage, the Bitrix module works out of the box.
How to ensure failover at each cluster level?
| Level | Mechanism | RTO |
|---|---|---|
| Load balancer | Keepalived + VRRP | < 5 sec |
| Web servers | Health check | < 10 sec |
| MySQL master | Orchestrator / ProxySQL | < 30 sec |
| MySQL slave | Removal from pool | < 5 sec |
| Redis | Sentinel / Cluster failover | < 15 sec |
| Files | GlusterFS replication | Automatic |
The cluster is 5 times more reliable than a single server — if any node fails, the service continues to operate.
What are common clustering setup mistakes?
- Sessions on files — when a server goes down, users lose cart and authentication.
- Unmonitored Seconds_Behind_Master — sales suffer and SLA is unmet.
- Single point of failure at the file storage level (NFS without replication).
- Lack of replication monitoring — data inconsistencies go undetected.
We include checks for all these points in our audit and testing.
What is the clustering process?
- Load audit — load profile, bottlenecks, load testing. We find the ceiling of a single server.
- Design — components tailored to requirements and budget. Not everyone needs GlusterFS — sometimes NFS and backups suffice.
- Infrastructure — servers, network, firewalls. Ansible for automation — any node can be recreated in minutes.
- Migration — transfer with minimal downtime. Components are connected sequentially, each step verified.
- Testing — simulation of peak conditions. We crash the master, disconnect a web server, kill Redis — see how the system behaves.
- Documentation — architecture diagram, runbook, disaster recovery plans.
What does clustering work include?
| Deliverable | Description |
|---|---|
| Current load audit | Request profile, bottlenecks, load testing |
| Project documentation | Architecture diagram, runbook, disaster recovery plan |
| Infrastructure | Server, network, firewall setup (Ansible) |
| Migration | Transfer with minimal downtime, phased component connection |
| Testing | Simulation of peak conditions: crash master, disconnect web server, kill Redis |
| Team training | Documentation, 2 weeks of post-implementation consultations |
| Warranty | 6 months of correct cluster operation — if something goes wrong, we fix it within 24 hours |
What are the typical timelines?
| Task | Timeline |
|---|---|
| Audit and design | 1-2 weeks |
| Basic cluster (2 web + master-slave MySQL) | 2-3 weeks |
| Full cluster with failover at all levels | 4-6 weeks |
| Monitoring + load testing | 2-4 weeks |
Contact us to get an engineer consultation and a preliminary project estimate within 2 days. We will calculate the cost based on your specific needs. Order an audit to find out the exact architecture and budget.







