Avoiding Connection Exhaustion During Replica Failover

Connection exhaustion during replica promotion is not a stochastic anomaly; it is a deterministic failure mode resulting from misaligned pool configurations, aggressive client-side retry logic, and topology routing lag. This runbook provides a configuration-first, step-by-step execution guide to maintain connection stability, enforce strict pool boundaries, and execute controlled failovers without cascading service degradation.

Symptom Identification & Telemetry Baselines

Detect exhaustion before cascading failure. Monitor pool saturation thresholds (>90% utilized), TCP SYN retransmission spikes, proxy queue depth metrics, and application-level error patterns (FATAL: too many connections for role, connection refused, pool exhausted).

Real-Time Alert Thresholds

Deploy actionable alert rules in Prometheus/Grafana to trigger automated mitigation:

# Pool saturation warning (PgBouncer/ProxySQL)
(pgbouncer_pools_active_connections / pgbouncer_pools_max_connections) > 0.85
# Connection wait latency degradation
pgbouncer_pools_wait_time_seconds > 0.5
# Proxy health check failure rate
rate(haproxy_server_check_failures_total{backend="db_read_replicas"}[10s]) > 0.3

Database-side baseline validation:

SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
-- Must remain < max_connections * 0.85 during normal operations

Log Pattern Extraction

Differentiate transient connection churn from true exhaustion using structured log parsing:

  • Churn/Reconnect: (?i)connection.*reset|tcp.*reset|idle.*timeout|server.*closed
  • True Exhaustion: (?i)too many connections|pool exhausted|connection refused|FATAL.*role|ECONNREFUSED Correlate timestamps across proxy access logs, database audit logs, and application stdout to isolate the exact failover window and identify the originating client fleet.

Root Cause Analysis: Why Failover Triggers Exhaustion

Topology changes disrupt the connection lifecycle. DNS TTL propagation delays, proxy routing lag, and unbounded client retries generate a thundering herd. Stale connection validation often bypasses pool limits, and improper Connection Routing & Pooling Strategies configurations amplify retry storms during replica promotion events.

DNS vs Proxy Routing Latency

DNS-based routing introduces propagation windows (TTL 30s–300s) where clients resolve to decommissioned or read-only replicas. This creates split-brain connection states: half the fleet connects to an offline node, triggering immediate ECONNREFUSED or 57P01 errors. Proxy-based routing (L4/L7) eliminates DNS lag but requires active health-check synchronization to avoid routing to nodes in recovering or promoting states.

Retry Storm Mechanics

Default exponential backoff (base_delay * 2^attempt) without jitter causes synchronized retry waves. During a 5-second promotion window, 1,000 clients retrying at 1s, 2s, 4s intervals can generate >10,000 concurrent connection requests, instantly saturating max_connections and exhausting the proxy’s max_client_conn buffer. Without jitter, retries align to the same millisecond, overwhelming TCP handshake queues.

Pre-Failover Configuration & Pool Hardening

Align pool sizing with expected failover concurrency per Connection Pool Architecture for Read Replicas. Enforce strict limits, idle timeouts, and validation queries to prevent stale connections from consuming slots during topology transitions.

Proxy Layer Configuration

PgBouncer (pgbouncer.ini)

[databases]
app_db = host=replica1 port=5432 dbname=app_db

[pgbouncer]
listen_port = 6432
max_client_conn = 10000
default_pool_size = 50
min_pool_size = 10
reserve_pool_size = 10
server_lifetime = 3600
server_idle_timeout = 30
server_connect_timeout = 5
tcp_keepalive = 1
tcp_keepidle = 30
tcp_keepintvl = 10
tcp_keepcnt = 3

ProxySQL (proxysql.cnf)

mysql_servers:
 - hostgroup_id: 10
 hostname: "replica1"
 port: 3306
 max_connections: 2000
 max_replication_lag: 5
mysql_query_rules:
 - rule_id: 1
 match_pattern: "^SELECT"
 destination_hostgroup: 10
 apply: 1

Application Pool Tuning

Apply failover-safe defaults across common frameworks:

  • Java/HikariCP: maximumPoolSize=50, minimumIdle=10, connectionTimeout=2000, validationTimeout=1000, leakDetectionThreshold=30000
  • Python/SQLAlchemy: pool_size=20, max_overflow=10, pool_timeout=30, pool_recycle=1800, pool_pre_ping=True
  • Go database/sql: SetMaxOpenConns(50), SetMaxIdleConns(10), SetConnMaxLifetime(30 * time.Minute), SetConnMaxIdleTime(5 * time.Minute)

Mitigation Runbook: Active Failover Execution

Execute chronologically. Do not skip steps. Maintain strict observability during each phase.

Graceful Drain Sequence

  1. Pre-emptive Pause: Halt new connections to the target replica.
# PgBouncer
psql -p 6432 -c "PAUSE app_db;"
# ProxySQL
mysql -u admin -p -h 127.0.0.1 -P 6032 -e "UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostgroup_id=10;"
  1. Connection Eviction: Wait for active transactions to commit/rollback. Monitor pg_stat_activity until count(*) drops to 0.
  2. Client Notification: Trigger application-level SIGUSR1 or health endpoint /db/readiness to return 503 for read traffic.
  3. Resume Routing: Once drained, point pool to the standby.
psql -p 6432 -c "RESUME app_db;"

Circuit Breaker Activation

Throttle connection attempts during the promotion window to prevent retry storms.

  • Resilience4j: Configure failureRateThreshold=50, waitDurationInOpenState=5000ms, slidingWindowSize=20.
  • Envoy Proxy: Set circuit_breakers thresholds: max_connections: 1000, max_pending_requests: 1000, max_retries: 3.
  • Custom Middleware: Implement a token bucket or semaphore limiting concurrent connection attempts to pool_size * 1.2. Return 429 Too Many Requests or 503 Service Unavailable when exhausted.

Post-Failover Validation & Rollback Procedures

Verify connection normalization. Define strict rollback triggers: replication lag > 10s, connection error rate > 5% for > 60s, or primary promotion failure.

Connection Normalization Checks

Run validation queries immediately post-failover:

-- Verify active/idle ratio
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
-- Verify wait queue clearance (PgBouncer)
SHOW POOLS; -- wait column must be 0
-- Verify proxy routing table (ProxySQL)
SELECT * FROM mysql_servers WHERE hostgroup_id=10; -- status must be ONLINE

Confirm max_connections utilization is < 60% and tcp_keepalive probes are stable.

Automated Rollback Triggers

If validation fails, execute idempotent rollback:

# 1. Halt write routing immediately
psql -p 6432 -c "PAUSE app_db;"
# 2. Revert proxy routing to original topology
mysql -u admin -p -h 127.0.0.1 -P 6032 -e "UPDATE mysql_servers SET status='ONLINE' WHERE hostgroup_id=10 AND hostname='original_replica';"
# 3. Flush stale connections
psql -p 6432 -c "RECONNECT app_db;"
# 4. Resume traffic
psql -p 6432 -c "RESUME app_db;"

Log all state transitions. Do not attempt automatic retry without manual DBA sign-off.

Long-Term Architectural Safeguards

Prevent recurrence through topology-aware scaling and automated validation.

Topology-Aware Pool Scaling

Integrate pool auto-scalers with cluster state managers (e.g., Patroni, Orchestrator). When a replica transitions to promoting, dynamically reduce max_client_conn and increase server_idle_timeout to absorb connection spikes. Use Kubernetes HPA or custom controllers to adjust application SetMaxOpenConns based on pg_stat_activity metrics.

Chaos Engineering & Failover Drills

Schedule monthly controlled failover simulations using tools like Chaos Mesh or Gremlin. Inject network partitions, DNS TTL spikes, and proxy health-check delays. Validate that circuit breakers engage, pools drain gracefully, and rollback procedures execute within defined RTO/RPO windows. Document telemetry baselines and update runbooks quarterly.