Implementing Read/Write Splitting at the Proxy Layer
1. Architectural Foundations of Proxy-Based Routing
Decoupling application logic from database topology is a foundational reliability pattern. By introducing a dedicated proxy layer (e.g., ProxySQL, PgBouncer, HAProxy, or Envoy), teams centralize traffic distribution, enforce routing policies, and abstract the underlying primary/replica topology from service code. The proxy operates at either Layer 4 (TCP stream multiplexing) or Layer 7 (SQL-aware parsing), intercepting client connections, inspecting payloads, and forwarding traffic to the appropriate hostgroup.
Topology discovery and health-checking are the proxy’s primary operational responsibilities. Probes typically run at configurable intervals (health_check_interval_ms=2000), executing lightweight commands (pg_isready, SELECT 1, or mysql_ping). When a replica fails health checks consecutively (max_failing_checks=3), the proxy marks it as OFFLINE_SOFT or SHUNNED, draining existing connections and halting new read traffic. During degraded states, the proxy must implement a deterministic fallback: route all traffic to the primary, or reject reads with a 503 Service Unavailable equivalent to prevent stale data propagation.
When establishing baseline routing policies, teams should align proxy configurations with broader Connection Routing & Pooling Strategies to ensure consistent traffic distribution across primary and replica nodes. Misaligned pool sizing or routing thresholds at the proxy layer frequently cascade into connection storms during failover events.
# HAProxy L4/L7 Hybrid Configuration for PostgreSQL
defaults
timeout connect 1000ms
timeout client 30000ms
timeout server 30000ms
maxconn 20000
frontend db_read
bind *:5433
default_backend pg_read_replicas
backend pg_read_replicas
balance leastconn
option httpchk GET /health
server replica1 10.0.1.10:5432 check inter 2000 fall 3 rise 2 maxconn 5000
server replica2 10.0.1.11:5432 check inter 2000 fall 3 rise 2 maxconn 5000
# Degraded-state fallback: if all replicas fail, route to primary
server primary 10.0.1.5:5432 backup check inter 2000 fall 3 rise 2
2. Proxy Configuration Patterns & Query Parsing
Routing accuracy hinges on the proxy’s query parsing engine. Regex-based matching is computationally cheap but brittle; it struggles with whitespace variations, comments, and multi-line statements. AST-based parsing (used in modern ProxySQL and Vitess) normalizes queries, identifies statement boundaries, and accurately classifies intent. Production deployments typically employ a hybrid approach: regex for fast-path filtering, falling back to AST parsing for complex or ambiguous statements.
Critical routing rules must explicitly handle transaction boundaries. Once a BEGIN or SET autocommit=0 is detected, the proxy must pin the session to the primary hostgroup for the transaction’s duration. Failing to do so causes split-session routing, where a SELECT executes on a lagging replica mid-transaction, violating read-your-writes consistency. Prepared statements (PREPARE/EXECUTE) also require careful handling: the proxy must track statement IDs and ensure EXECUTE routes to the same hostgroup where PREPARE was issued.
# ProxySQL Query Rules (MySQL/PostgreSQL compatible)
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply) VALUES
(10, 1, '^SELECT\s.*\sFOR\sUPDATE', 10, 1), -- Pin locking reads to primary
(20, 1, '^SELECT\s', 20, 1), -- Route standard reads to replicas
(30, 1, '^(BEGIN|START|COMMIT|ROLLBACK|INSERT|UPDATE|DELETE|ALTER|CREATE|DROP)', 10, 1),
(99, 1, '.*', 10, 1); -- Default fallback to primary
# Critical thresholds
UPDATE global_variables SET variable_value=5000 WHERE variable_name='mysql-max_replication_lag';
UPDATE global_variables SET variable_value=1000 WHERE variable_name='mysql-connect_timeout_server';
Optimizing connection reuse at this layer requires understanding the underlying Connection Pool Architecture for Read Replicas to prevent pool exhaustion during traffic spikes and ensure efficient multiplexing. A common incident vector is false-positive read routing: SELECT ... FOR UPDATE, implicit transaction boundaries from ORM connection pooling, or SHOW VARIABLES queries accidentally hitting replicas. Mitigation requires strict deny-lists and explicit apply=1 flags to halt rule evaluation once a match occurs.
3. Application Integration & ORM Coordination
The application driver’s interaction with the proxy dictates routing transparency. In transparent proxy routing, the application connects to a single VIP or DNS endpoint, and the proxy handles all distribution. In explicit routing, the application uses connection hints, query tags, or separate connection strings. Transparent routing reduces code complexity but shifts consistency guarantees to the proxy’s configuration. Explicit routing provides deterministic control but increases coupling to infrastructure topology.
Replication lag is the primary operational constraint. Proxies must monitor Seconds_Behind_Master (MySQL) or pg_stat_replication (PostgreSQL). When lag exceeds a defined threshold (max_replication_lag=30), the proxy should temporarily route reads to the primary or return a routing error. Connection strings must include resilient retry parameters: connect_timeout=3s, socket_timeout=10s, and retries=3. DNS-based endpoint resolution (dns_cache_ttl=5s) should be paired with proxy-native failover to avoid stale DNS records during primary promotion.
For teams preferring application-level control over transparent proxy routing, evaluating ORM Middleware for Automatic Query Routing provides a complementary approach to enforce routing at the code layer. During degraded states (e.g., proxy network partition), the application driver must implement exponential backoff with jitter, circuit breakers, and a direct-primary fallback path. Without these safeguards, connection pool starvation and cascading timeouts will rapidly degrade service availability.
4. Framework-Specific Implementation: Spring Data JPA
Java/Spring ecosystems typically implement proxy routing via AbstractRoutingDataSource. This class intercepts connection requests, evaluates a routing key, and delegates to the appropriate physical datasource (primary or replica). The routing key is usually derived from @Transactional(readOnly = true), ThreadLocal context, or custom annotations.
Synchronization with Spring’s TransactionManager is critical. The routing decision must occur before the physical connection is acquired, and the ThreadLocal context must be cleared in afterCompletion to prevent routing leaks across pooled connections. Multi-tenant architectures extend this pattern by combining tenant identifiers with read/write flags, requiring careful bean lifecycle management to avoid cross-tenant data exposure.
Developers deploying this pattern in Java ecosystems should reference How to implement read/write splitting in Spring Data JPA for production-ready datasource routing configurations.
# application.yml
spring:
datasource:
hikari:
maximum-pool-size: 50
minimum-idle: 10
connection-timeout: 3000
validation-timeout: 2000
max-lifetime: 1800000
url: jdbc:postgresql://proxy-cluster.internal:6432/appdb
username: ${DB_USER}
password: ${DB_PASS}
jpa:
properties:
hibernate:
jdbc:
time_zone: UTC
default_schema: public
// AbstractRoutingDataSource Implementation Snippet
public class RoutingDataSource extends AbstractRoutingDataSource {
@Override
protected Object determineCurrentLookupKey() {
boolean readOnly = TransactionSynchronizationManager.isCurrentTransactionReadOnly();
return readOnly ? "REPLICA" : "PRIMARY";
}
}
// Degraded-state handling: If replica pool is exhausted, Hikari throws SQLTransientConnectionException.
// Implement a fallback DataSource that routes to the primary endpoint when replica health checks fail.
5. Validation, Observability & Failure Modes
Proxy routing correctness cannot be assumed; it must be validated continuously. Testing strategies should include synthetic query routing verification, artificial replication lag injection (e.g., pg_wal_receiver throttling or MySQL sleep() triggers), and controlled failover drills. Metrics must be exported to Prometheus/Grafana: proxy_query_distribution_ratio, replica_lag_seconds, connection_churn_rate, and routing_error_count. Alert thresholds should be set conservatively: proxy_replica_lag_seconds > 15 triggers a warning; > 30 forces primary-only routing.
Common failure modes include split-brain routing (proxy routes to a replica before DNS updates reflect primary promotion), stale reads (lag exceeds application tolerance), and transaction boundary leaks (proxy drops session state during connection recycling). Distributed tracing (Jaeger/Zipkin) with database span tags is essential for correlating proxy routing decisions with application latency spikes.
Before deploying to production, engineers must validate routing logic and failover behavior using How to test read/write splitting in local Docker environments to catch misconfigurations early. In degraded states, automated runbooks should trigger: disable replica routing, scale connection pool minimums, and alert on-call engineers. Graceful degradation is non-negotiable; routing to the primary under load is preferable to serving stale data or dropping transactions.