1C-Bitrix Clustering
Master-Slave MySQL Replication — The Core of the Whole Endeavor
Let's start with the main thing. 80-90% of queries in a typical Bitrix project are SELECTs. Catalog, product cards, filters, listings — all reads. Master-slave replication offloads those SELECTs to slave servers while the master handles only writes. The "Web Cluster" module (Business edition and above) routes queries automatically.
Where the setup usually trips up:
On master: binlog_format = ROW. Not STATEMENT — on complex queries with NOW(), UUID(), or non-deterministic functions, STATEMENT replication produces discrepancies between master and slave. Debugging that takes a week. Plus a mandatory unique server-id and enabled binary log.
On slave: read_only = ON, its own server-id, relay-log. Initialization via xtrabackup — mysqldump on a 20 GB database will lock tables for half an hour.
Seconds_Behind_Master — the number one metric. If the slave lags by 5+ seconds, a customer places an order, returns to their account — and the order isn't there because the SELECT went to a lagging slave. The "Web Cluster" module lets you exclude critical queries from slave routing, but this needs to be configured manually.
Failover: Orchestrator or ProxySQL promote a slave to master in 15-30 seconds. The module supports up to 9 slave connections with configurable load distribution weights. Integrity checks — pt-table-checksum from Percona Toolkit; on large databases, discrepancies without it are inevitable.
When You Actually Need a Cluster
Not every project does. Specific markers:
- 50,000-100,000 unique visitors per day — a single server starts returning 502s during peak hours
- Peak spikes of 5-10x (sales, flash sales) — load grows within minutes, you can't scale vertically
- SLA 99.9% (no more than 8.7 hours of downtime per year) — unachievable with a single server
- Geographically distributed users
Sometimes composite caching, SQL optimization, and vertical scaling are enough. We'll be honest if a cluster isn't needed yet.
Architecture — Four Layers
Load balancer. HAProxy, nginx upstream, or a cloud LB. Round-robin for even distribution, ip-hash for session affinity, least connections for adaptive balancing. Health checks automatically remove dead servers from the pool. SSL termination on the balancer offloads the web nodes.
Web servers. Identical nginx + php-fpm nodes, each with a full copy of the code. Critical: sessions in Redis/Memcached, not on disk — otherwise when switching between servers, the user "loses" their cart. In the cloud, we configure autoscaling — load increases, servers are added; load drops, extras are shut down.
Cache. Redis Cluster with data sharding across nodes — the preferred option. Redis Sentinel is simpler but for smaller clusters. Memcached is fast but has no persistence. Configuration in .settings.php — servers, weights, sharding strategy.
File storage. Uploads, product images — must be accessible from every node. NFS is a workable option for 2-3 servers, but it's a single point of failure. GlusterFS — a distributed filesystem with no single point of failure, data is replicated between nodes. S3 (MinIO, AWS, Yandex Object Storage) — offloading static files to object storage, the Bitrix module works out of the box.
Failover at Every Layer
| Layer | Mechanism | RTO |
|---|---|---|
| Load balancer | Keepalived + VRRP | < 5 sec |
| Web servers | Balancer health check | < 10 sec |
| MySQL master | Orchestrator / ProxySQL | < 30 sec |
| MySQL slave | Exclusion from pool | < 5 sec |
| Redis | Sentinel / Cluster failover | < 15 sec |
| Files | GlusterFS replication | Automatic |
Our Approach
- Load audit — load profile, bottlenecks, load testing. We find the ceiling of a single server.
- Design — components tailored to requirements and budget. Not everyone needs GlusterFS — sometimes NFS and backups are enough.
- Infrastructure — servers, networking, firewalls. Ansible for automation — any node can be recreated in minutes.
- Migration — transfer with minimal downtime. Components are connected sequentially, each step verified.
- Testing — simulating peak conditions. We kill the master, take down a web server, destroy Redis — and watch how the system behaves.
- Documentation — architecture diagram, runbook, disaster recovery plans.
Timelines
| Task | Timeline |
|---|---|
| Audit and design | 1-2 weeks |
| Basic cluster (2 web + master-slave MySQL) | 2-3 weeks |
| Full cluster with failover at all layers | 4-6 weeks |
| Monitoring + load testing | 2-4 weeks |







