Step-by-Step Guide to Setting Up Cross-AZ Read Replicas
Establish baseline architecture requirements, define RTO/RPO targets, and align on cross-AZ network topology. Review core replication concepts in Database Replication Fundamentals & Architecture to standardize terminology before provisioning. Cross-AZ replicas mitigate zonal failures but introduce asynchronous replication lag; architect for eventual consistency where applicable and enforce strict isolation boundaries for transactional workloads.
Pre-Deployment Architecture Validation
Audit VPC route tables to confirm direct, low-latency peering between source and target availability zones. Validate security group egress rules permit TCP 5432 (or engine-specific port) bidirectionally without NAT traversal. Enforce cross-AZ bandwidth quotas to prevent throttling during initial base backup transfer. Verify primary instance WAL configuration (archive_mode=on), provisioned IOPS capacity (minimum 3x peak write throughput), and IAM roles for automated snapshotting. Lower DNS TTLs to ≤60s on all database endpoints to enable rapid routing failover during zonal degradation. Ensure NTP/Chrony synchronization across all AZs to prevent clock skew from breaking replication slot validation.
Step 1: Provisioning the Cross-AZ Replica Instance
Deploy the replica using infrastructure-as-code or CLI. Match the primary’s instance class, storage engine (e.g., gp3/io2), and KMS encryption keys to prevent decryption overhead and cross-AZ data transfer penalties. Apply explicit AZ placement tags for scheduler awareness and capacity reservation.
# AWS RDS CLI example
aws rds create-db-instance-read-replica \
--db-instance-identifier app-db-replica-az2 \
--source-db-instance-identifier app-db-primary-az1 \
--availability-zone us-east-1b \
--db-instance-class db.r6g.2xlarge \
--storage-type gp3 \
--allocated-storage 500 \
--no-auto-minor-version-upgrade
Disable automated backups (backup_retention_period=0) on the replica during initial sync to eliminate I/O contention and WAL shipping delays. Re-enable post-sync via parameter group apply.
Step 2: Configuring Replication & Connection Parameters
Tune engine-specific parameters to optimize async replication and connection handling. For PostgreSQL:
# postgresql.conf (Primary)
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
synchronous_commit = off # Async cross-AZ routing
# postgresql.conf (Replica)
hot_standby = on
hot_standby_feedback = on
max_standby_streaming_delay = 30s
Configure the connection pooler to enforce strict timeouts and idle session eviction:
# pgbouncer.ini
server_idle_timeout = 300
server_lifetime = 3600
tcp_keepalive = on
tcp_keepidle = 30
tcp_keepintvl = 10
tcp_keepcnt = 3
These settings prevent stale socket retention during transient cross-AZ network blips and ensure rapid connection recycling under load.
Step 3: Implementing Connection Routing & Read/Write Splitting
Deploy a stateful proxy layer (PgBouncer, HAProxy, or cloud-native router) between application servers and database endpoints. Implement read/write splitting using DNS CNAMEs or application-level middleware (e.g., Spring AbstractRoutingDataSource, Rails connects_to, or Django DATABASE_ROUTERS). Route SELECT queries to the replica endpoint and INSERT/UPDATE/DELETE to the primary. Reference advanced routing strategies in Designing Multi-Region Read Replica Topologies for latency-aware query distribution, sticky sessions for transactional consistency, and graceful connection draining during maintenance windows.
Step 4: Validation, Monitoring & Consistency Checks
Execute synthetic read/write workloads using pgbench or sysbench to verify routing behavior and failover thresholds. Monitor replication lag continuously:
-- Run on primary
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
extract(epoch from now() - replay_lag) AS lag_seconds
FROM pg_stat_replication;
Validate read-after-write consistency by testing REPEATABLE READ and READ COMMITTED isolation levels against application transaction boundaries. Configure alert routing for lag spikes exceeding 500ms and connection pool saturation >85%. Implement automated circuit breakers in the proxy layer to temporarily bypass replicas when lag thresholds are breached.
Troubleshooting & Runbook Execution
Structured diagnostic workflow for production incidents involving replica degradation, routing failures, or consistency violations.
Symptom Identification
Detect elevated replication lag (replay_lag > 5s), connection pool exhaustion (waiting_connections > 0), stale read anomalies (missing recently committed rows), or proxy routing loops (cascading 503s). Correlate symptoms with APM traces (DB spans showing retry storms), database error logs (postgresql.log), and cross-AZ network telemetry (VPC Flow Logs, CloudWatch Network In/Out).
Root Cause Analysis
Isolate the failure vector:
- Network Partitioning: Check AZ-to-AZ latency and packet loss via
ping/mtror cloud network metrics. - WAL Shipping Bottlenecks: Verify
wal_keep_segments/max_slot_wal_keep_sizeisn’t causing disk pressure on the primary. - I/O Contention: Monitor replica
iowaitanddisk_utilization. Cross-AZ bandwidth throttling often manifests as sustainedwal_receiverstalls. - Routing Misconfiguration: Audit DNS caching delays (
dig +trace) and application connection leak patterns (unreturned connections to pool).
Mitigation Strategies
Apply immediate containment:
- Dynamic Pool Resizing: Increase
max_client_connandreserve_pool_sizein PgBouncer to absorb retry storms. - Commit Behavior Adjustment: Temporarily set
synchronous_commit = localon the primary to reduce write latency while lag recovers (note: increases data loss risk on primary crash). - Traffic Rerouting: Shift 100% of read traffic to the primary via DNS weight adjustment or proxy config reload (
pgbouncer -R). - Vertical Scale: Upgrade replica instance class or storage IOPS if CPU/IO bottlenecks are confirmed.
- Circuit Breakers: Implement application-level fallbacks to bypass degraded replicas automatically when
lag_seconds > 2.0.
Rollback Procedures
If data divergence or unrecoverable lag occurs, execute controlled rollback:
- Graceful Drain: Halt new connections to the replica via proxy admin:
pgbouncer -RorHAProxy disable server db/replica-az2. - Disable Read Routing: Remove or set DNS CNAME weight to 0 for the replica endpoint.
- Promote/Isolate: If divergence is detected, promote the replica to standalone:
pg_ctl promote -D /var/lib/postgresql/dataoraws rds promote-read-replica. - Restore Primary Config: Revert IaC state to baseline. Verify checksum integrity using
pg_checksumsorpg_verifybackupbefore re-establishing replication. - Re-sync Topology: Drop the degraded replication slot (
SELECT pg_drop_replication_slot('replica_slot');), recreate the replica from a fresh base backup, and re-enable standard async replication. Validatepg_stat_replicationstate transitions fromcatchuptostreamingbefore restoring read routing.