What precisely occurs during a failover event?

Upon detecting a primary server failure, the cluster automatically promotes the most advanced replica to become the new primary. This process is consensus-driven, ensuring that no split-brain occurs. None of the involved parties need manual steps. The entire sequence completes in under half a minute.

How does Patroni ensure leader election is reliable?

Patroni leverages a distributed configuration store like etcd to maintain cluster state. Each node periodically updates its lease. If the leader fails to renew, the other nodes hold an election. The replica with the highest log position wins. No single node can override this without breaking consensus. The algorithm tolerates up to (N-1)/2 failures.

Compare Patroni with Orchestrator for MySQL.

Patroni is PostgreSQL-native and uses a DCS for automatic leader election. Orchestrator manages MySQL replication but lacks built-in consensus; it can be combined with other tools. For MySQL, InnoDB Cluster with Group Replication is the recommended alternative. None of our clients prefer Orchestrator for new deployments.

What influences failover duration?

Failover time depends on failure detection timeout (usually 10s), lease interval (5s), and replica promotion (a few seconds). Network latency and load also play a role. Our tested average is 12 seconds. None exceed 30 seconds in our benchmarks. A planned switchover is even faster.

How can I validate failover behaves correctly?

We provide automated testing scripts that simulate a primary failure by stopping the database service. The cluster should promote a new primary within the expected time. We also verify that the old primary does not accept writes once demoted. No manual verification is needed beyond initial setup.

What precisely occurs during a failover event?

Upon detecting a primary server failure, the cluster automatically promotes the most advanced replica to become the new primary. This process is consensus-driven, ensuring that no split-brain occurs. None of the involved parties need manual steps. The entire sequence completes in under half a minute.

How does Patroni ensure leader election is reliable?

Patroni leverages a distributed configuration store like etcd to maintain cluster state. Each node periodically updates its lease. If the leader fails to renew, the other nodes hold an election. The replica with the highest log position wins. No single node can override this without breaking consensus. The algorithm tolerates up to (N-1)/2 failures.

Compare Patroni with Orchestrator for MySQL.

Patroni is PostgreSQL-native and uses a DCS for automatic leader election. Orchestrator manages MySQL replication but lacks built-in consensus; it can be combined with other tools. For MySQL, InnoDB Cluster with Group Replication is the recommended alternative. None of our clients prefer Orchestrator for new deployments.

What influences failover duration?

Failover time depends on failure detection timeout (usually 10s), lease interval (5s), and replica promotion (a few seconds). Network latency and load also play a role. Our tested average is 12 seconds. None exceed 30 seconds in our benchmarks. A planned switchover is even faster.

How can I validate failover behaves correctly?

We provide automated testing scripts that simulate a primary failure by stopping the database service. The cluster should promote a new primary within the expected time. We also verify that the old primary does not accept writes once demoted. No manual verification is needed beyond initial setup.

Automated Database Failover: Patroni and InnoDB Cluster Implementation

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and maintenance of all types of websites:

Informational websites or web applications

Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators

E-commerce websites or web applications

Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers

Business process management web applications

CRM systems, ERP systems, corporate portals, production management systems, information parsers

Electronic service websites or web applications

Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Services we offer

Showing 1 of 1All 2062 services

Automated Database Failover: Patroni and InnoDB Cluster Implementation

Complex

~3-5 days

Frequently Asked Questions

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

B2B ADVANCE company website development
1360
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1188
Development of a web application for Enviok
929
Website development for FIXPER company
948

Show more works

The Core Challenge

Many deployments rely on simple async replication without automated failure handling. This results in RPO > 0 and RTO measured in hours. Our turnkey solution eliminates these risks entirely. With our service, you achieve RTO under 30 seconds and RPO zero for both PostgreSQL and MySQL.

How Patroni Works for PostgreSQL

For PostgreSQL, we deploy Patroni backed by an etcd cluster. Each PostgreSQL instance runs as a Patroni node. When the current primary fails, etcd orchestrates a leader election. The replica with minimal lag becomes the new primary. Typically, RTO is 10–25 seconds. RPO is zero with synchronous replication enabled. Patroni is battle-tested on workloads exceeding 10k transactions per second.

Reference: Patroni documentation

How InnoDB Cluster Works for MySQL

For MySQL, we set up InnoDB Cluster with Group Replication. All members are writable, but only one is designated as primary for writes. On failure, the group automatically reconfigures. The new primary is usually ready in under 15 seconds. No additional scripting is required. InnoDB Cluster is 2x faster in failover than traditional MySQL replication setups.

Comparison: Patroni vs InnoDB Cluster

Feature	Patroni (PostgreSQL)	InnoDB Cluster (MySQL)
Consensus mechanism	etcd, Consul, ZK	Group Replication (Paxos)
Typical RTO	10–25s	10–20s
Typical RPO	Zero (sync)	Zero (group)
Number of nodes	3+	3+
Automatic failover	Yes	Yes
Built-in load balancing	Requires HAProxy/ProxySQL	Requires ProxySQL/Router

Both solutions include HAProxy or ProxySQL for transparent connection routing. After failover, clients reconnect automatically. We also ship custom health checks and alerting rules. No component is left unmonitored.

What's Included in Our Deployment

Installation and configuration of Patroni or InnoDB Cluster across three or more servers
Integration with etcd or Consul for state management
Load balancer setup (HAProxy or ProxySQL)
Custom monitoring and alerting (Prometheus/Grafana)
Automated failover testing scripts
Comprehensive documentation and runbooks
Training for your operations team (2 sessions)
30 days of post-deployment support

How to Validate Failover

We provide automated testing scripts that simulate a primary failure by stopping the database service. The cluster promotes a new primary within the expected time. We also verify that the old primary does not accept writes once demoted. No manual verification is needed beyond initial setup.

Why Choose Our Solution?

Our deployment package covers installation, configuration, testing, and documentation. We have delivered this setup for 50+ clients across finance, e-commerce, and gaming. The architecture guarantees consistency. No data loss has occurred post-failover in our projects. Typical cost savings are 40% compared to manual failover procedures. Start with a proof-of-concept and scale out. We support PostgreSQL 12+ and MySQL 8.0+.

Click for a step-by-step failover test procedure

1. Stop the primary database service on the active node. 2. Wait for the election timeout (10s). 3. Verify that a new primary is elected and accepts connections. 4. Restart the old primary and confirm it re-joins as a replica. 5. Run application tests to ensure no data loss.

Our turnkey high availability deployment eliminates single points of failure. Contact us for a demo.

We regularly encounter a situation: "The site is not opening" at 3 a.m. — and it turns out that the VPS disk is full because nginx logs haven't been rotated for six months. Or the server went down under load on the day of an advertising campaign launch because the shared hosting had a limit of 50 concurrent connections. Setting up hosting and deployment is not about "where it's cheaper" but about what happens when something goes wrong. Our team helps avoid such incidents by designing infrastructure that accounts for real load patterns.

When to choose Vercel and Netlify?

Vercel is built for Next.js — deploy in one push, preview deployments for every PR, automatic CDN, Edge Functions, ISR without configuration. For frontend projects and JAMstack, it's the optimal choice: no operational overhead, time-to-deploy measured in minutes.

Real limitations: Vercel Serverless Functions run in us-east-1 by default (latency for Europe +80–100ms), Function timeout 300 seconds on Pro, Bandwidth 1TB/month on Pro. For heavy backend, you need workers or a separate server.

Netlify is closer to static sites and Edge Functions based on Deno Deploy. Build minutes are the main limitation on the free tier.

Criterion	Vercel	Netlify
Main specialization	Next.js, frameworks	Static, JAMstack
Edge Functions	V8 isolates (Node.js)	Deno Deploy
Preview Deployments	Built-in	Built-in
Serverless Functions	Yes, 300s limit	Yes, 10s limit
Free bandwidth limit	100 GB	100 GB

Why is Docker the foundation of predictable deployment?

"It works on my machine" — classic. Docker solves this through environment containerization. But a bad Dockerfile creates new problems.

A typical mistake: copying everything into the image without .dockerignore, resulting in an 800MB image instead of 80MB. node_modules inside the image weighs as much. Correct approach: multi-stage build.

FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build

FROM node:20-alpine AS runner
WORKDIR /app
COPY --from=builder /app/.next ./.next
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./package.json
EXPOSE 3000
CMD ["npm", "start"]

Final image: 180MB instead of 1.2GB. CI build time is reduced due to layer caching — if package.json hasn't changed, the layer with npm ci is taken from cache.

Docker Compose for local development and simple production scenarios: application + PostgreSQL + Redis in one configuration. For production on a single server, it's a perfectly viable option if there's no requirement for horizontal scaling.

More about containerization — Wikipedia: Docker.

How to set up Nginx as a reverse proxy?

Nginx in front of the application is standard for VPS and dedicated servers. Main functions: SSL termination, gzip, static files, rate limiting, upstream load balancing.

A configuration often done incorrectly: worker_processes auto — number of processes equals CPU count. worker_connections 1024 — that's 1024 per worker process. With 4 CPUs and 1024 connections = 4096 concurrent connections. For a high-traffic site, you need worker_connections 4096 and set keepalive_timeout 65.

For static assets with hash in the filename:

location ~* \.(js|css|woff2|png|webp)$ {
    expires 1y;
    add_header Cache-Control "public, immutable";
}

immutable tells the browser: don't revalidate this file even on hard refresh. This only works correctly with content-hashed filenames (which Vite/webpack do by default). Documentation — Wikipedia: Nginx.

AWS: flexibility and complexity

EC2 + Auto Scaling Group — classic for horizontal scaling. AMI with pre-installed application, Launch Template, ASG with min/desired/max instances, Application Load Balancer. When CPU > 70% for 3 minutes — scale out, when CPU < 30% for 15 minutes — scale in. Health check via ALB removes unhealthy instances from rotation.

ECS Fargate — containers without managing EC2. Deploy a Docker image, specify CPU/memory (512 CPU units = 0.5 vCPU, from 512MB memory), Fargate launches it. More expensive than Lambda, but no cold start and no timeout limitations. Suitable for long-running processes, WebSocket servers, heavy workers.

RDS for PostgreSQL with Multi-AZ: automatic failover in 1–2 minutes when primary fails. Read Replicas for scaling reads. RDS Proxy for connection pooling — Lambda functions cannot hold long-term connections, the proxy buffers this.

Kubernetes: when it is justified

K8s adds significant operational complexity. Justified when: multiple teams deploy independent services, fine-grained resource allocation per service is needed, canary deployments and blue/green without downtime are required.

AWS EKS, GKE, or managed k8s from Hetzner (cheaper). Helm charts for standard services. Horizontal Pod Autoscaler based on CPU and custom metrics (RPS via Prometheus).

For most startups and medium-sized projects, Kubernetes is overkill. ECS or Fly.io provide 80% of the capabilities with 20% of the operational complexity.

Monitoring and alerting

A server without monitoring is waiting for an incident. Minimal stack: Prometheus + Grafana (or Grafana Cloud for managed), alerting on disk > 80%, memory > 85%, CPU > 90% over 5 minutes, error rate > 1%. Uptime via Better Uptime or Upptime (self-hosted).

Logs: Loki + Grafana or CloudWatch Logs Insights. Structured JSON logs (winston, pino) are mandatory — otherwise, log searching becomes a pain.

What is included in hosting setup

Audit of current infrastructure and load profiling
Selection of target architecture (VPS, AWS, serverless, Kubernetes)
Setting up CI/CD pipeline (GitHub Actions, GitLab CI) with automatic deployment
IaC via Terraform or Pulumi (infrastructure as code)
Configuration of Nginx, SSL certificates, HTTP/2, brotli
Monitoring and alerting (Prometheus + Grafana, PagerDuty)
Documentation of runbooks and team training

Additionally, contact us if you need migration from current hosting or integration with external services.

Work process

Audit of current infrastructure (2–5 days)
Selection of target architecture with load and budget justification (1–3 days)
Setting up CI/CD pipeline (GitHub Actions, GitLab CI) (2–5 days)
IaC via Terraform or Pulumi (3–10 days)
Setting up monitoring and alerting (2–5 days)
Documentation of runbooks and team training (1–3 days)

Our experience — 7 years on the market, over 50 projects, guarantee of operability after deployment.

Timeline

Basic deployment on VPS with Docker + Nginx + CI/CD: 1–2 weeks.
Setting up AWS infrastructure with Auto Scaling, RDS, CDN: 3–6 weeks.
Migration to EKS from scratch: 6–12 weeks.
Setting up Vercel/Netlify for JAMstack: 3–5 days.

The cost is calculated individually depending on complexity and scope of work. Get a consultation — we'll evaluate your architecture in one day.