Comparing PostgreSQL Streaming Replication vs MySQL GTID
1. Protocol Architecture & Baseline Configuration Validation
1.1 PostgreSQL WAL Streaming Mechanics (wal_level, max_wal_senders, hot_standby)
PostgreSQL replication operates on continuous Write-Ahead Log (WAL) streaming. The primary streams physical WAL segments to standby nodes, which apply them sequentially. Critical parameters dictate throughput and retention:
wal_level = replica(orlogicalif logical decoding is required)max_wal_senders >= replica_count + 2(reserve slots for failover/monitoring)hot_standby = on(enables read queries during recovery)wal_keep_size(modern replacement forwal_keep_segments; dictates disk retention for lagging replicas)
1.2 MySQL GTID Enforcement (gtid_mode=ON, enforce_gtid_consistency, binlog_format=ROW)
MySQL Global Transaction Identifiers (GTIDs) decouple replication topology from physical log positions. Each transaction receives a unique server_uuid:transaction_id pair. Strict enforcement is mandatory for topology-safe failovers:
gtid_mode = ONenforce_gtid_consistency = ON(blocks non-transactional statements andCREATE TABLE ... SELECT)binlog_format = ROW(required for deterministic GTID application)auto_position = 1on replicaCHANGE MASTER TO/CHANGE REPLICATION SOURCE TOstatements
1.3 Step-by-Step Parameter Verification Checklist
Before diagnosing runtime degradation, validate baseline state alignment. Review foundational replication concepts in Database Replication Fundamentals & Architecture to ensure parameter parity across your fleet.
Actionable Steps:
wal_level >= replicaand active replication slots on primary:
SELECT slot_name, slot_type, active, restart_lsn FROM pg_replication_slots;
gtid_mode=ONandenforce_gtid_consistency=ONacross all nodes:
SHOW VARIABLES LIKE 'gtid_mode';
SHOW VARIABLES LIKE 'enforce_gtid_consistency';
# PostgreSQL
psql -c "SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lag FROM pg_stat_replication;"
# MySQL
mysql -e "SHOW REPLICA STATUS\G" | grep -E "Seconds_Behind_Source|Retrieved_Gtid_Set|Executed_Gtid_Set"
2. Symptom Identification: Replication Lag & Routing Failures
2.1 Metric Thresholds: pg_stat_replication vs SHOW REPLICA STATUS
Application-level timeouts often precede visible database alerts. PostgreSQL exposes replay_lag (interval) and write_lag in pg_stat_replication. MySQL reports Seconds_Behind_Source (integer, can return NULL during I/O stalls). Establish strict SLA thresholds:
- Warning:
replay_lag > 5sorSeconds_Behind_Source > 5s - Critical:
replay_lag > 30sorSeconds_Behind_Source > 30s - Stall Indicator:
replay_lagstatic for >60s, orSeconds_Behind_Source = NULL
2.2 Read-After-Write Inconsistency Patterns
When routing layers (ProxySQL, PgBouncer, HAProxy) fail to track replication state, clients experience stale reads immediately following writes. Symptoms include:
- Application retries returning
404or missing rows afterINSERT/UPDATE - Session affinity bypass causing split-brain reads
FATAL: too many connections(PostgreSQL) when connection pools exhaust retries against lagging endpoints
2.3 Connection Router Circuit Breaker Triggers
Cross-region latency compounds routing failures. Topology-aware proxies must implement lag-based health checks. When thresholds are breached, routers should trigger OFFLINE_SOFT or READ_ONLY states rather than dropping traffic. For multi-region deployments, align routing fallback matrices with Designing Multi-Region Read Replica Topologies to prevent cascading primary overload.
Actionable Steps:
replay_lagandSeconds_Behind_Sourceagainst 5s/10s SLA thresholds via Prometheus/Grafana or Datadog.
# ProxySQL
SELECT * FROM stats_mysql_connection_pool WHERE hostgroup_id IN (READ_HOSTGROUP);
# PgBouncer
tail -f /var/log/pgbouncer/pgbouncer.log | grep "server connection"
transaction_idorxidin database logs.
3. Root Cause Analysis: WAL Retention Limits vs GTID Gap Detection
3.1 PostgreSQL: WAL Archiving Exhaustion & Inactive Slots
Orphaned replication slots (active = false) prevent WAL cleanup, causing pg_wal directory growth until disk exhaustion. When wal_keep_size is exceeded and slots are inactive, the primary cannot stream required segments, forcing replica re-sync.
- Diagnostic:
SELECT slot_name, restart_lsn, active FROM pg_replication_slots; - Failure Mode:
FATAL: requested WAL segment ... has already been removed
3.2 MySQL: Binary Log Purge Conflicts & GTID Set Divergence
Premature binary log purging (expire_logs_days or binlog_expire_logs_seconds) breaks auto_position=1 recovery if the replica hasn’t consumed the GTID range. The replica enters Retrieving_Gtid_Set mismatch state.
- Diagnostic: Compare
gtid_executedvsgtid_purged:
SHOW VARIABLES LIKE 'gtid_executed';
SHOW VARIABLES LIKE 'gtid_purged';
- Failure Mode:
Error_code: 1236(GTID set mismatch or missing binlog file)
3.3 Network I/O Bottleneck & Disk Subsystem Diagnostics
Differentiate transient network partitions from persistent configuration drift. High wal_receiver latency or MySQL Slave_IO_State: Waiting for master to send event indicates network saturation, while Slave_SQL_Running: No points to disk I/O or constraint violations.
Actionable Steps:
pg_replication_slotsfor inactiveactive=falseslots; drop if safe:SELECT pg_drop_replication_slot('slot_name');gtid_executedvsgtid_purgedsets for missing transaction ranges. Ifgtid_purgedcontains ranges not ingtid_executed, the replica requires re-sync.iostat -xz 1andmtr --report --report-cycles 5 <primary_ip>to isolate disk vs network bottlenecks. Targetawait < 10msand packet loss< 0.1%.
4. Mitigation Runbook: Re-Syncing & Connection Pool Re-Routing
4.1 Step 1: Isolate Primary & Drain Read Pools
Prevent write routing to degraded replicas and drain active connections gracefully.
# ProxySQL
UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostgroup_id=READ_HOSTGROUP AND hostname='replica_ip';
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;
# PgBouncer
# Pause database to drain active queries
PAUSE dbname;
4.2 Step 2: Execute Fast Re-Sync (pg_rewind vs MySQL Clone Plugin)
Avoid full logical dumps. Use timeline-aware physical sync tools.
PostgreSQL (pg_rewind):
# Stop replica, ensure primary is accessible
pg_ctl stop -D /var/lib/postgresql/data
pg_rewind --target-pgdata=/var/lib/postgresql/data --source-server="host=primary port=5432 user=replicator dbname=postgres"
MySQL (Clone Plugin):
-- On replica
INSTALL PLUGIN clone SONAME 'mysql_clone.so';
CLONE INSTANCE FROM 'repl_user'@'primary_ip':3306 IDENTIFIED BY 'secure_password';
4.3 Step 3: Validate Alignment & Re-Enable Routing
Verify replication catch-up before returning traffic.
- PostgreSQL:
SELECT replay_lag FROM pg_stat_replication WHERE client_addr = 'replica_ip';(Target:< 1s) - MySQL:
SHOW REPLICA STATUS\G(Target:Seconds_Behind_Source = 0,Slave_IO_Running: Yes,Slave_SQL_Running: Yes)
4.4 Step 4: Implement Read-Preference Overrides
Temporarily enforce read_preference=primary for critical write-heavy paths until replica consistency is confirmed. Update routing health checks to aggressive polling:
ping_interval=2max_lag_ms=5000- Reset PgBouncer
server_reset_querytoDISCARD ALL;to prevent stale connection reuse.
Actionable Steps:
pg_rewindorCLONE INSTANCE FROMas documented above.ping_interval=2andmax_lag_ms=5000.
-- PostgreSQL
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes FROM pg_stat_replication;
-- MySQL
SELECT @@global.gtid_executed;
5. Rollback Procedures & Consistency Verification
5.1 Point-in-Time Recovery (PITR) Fallback for PostgreSQL
If pg_rewind fails or data corruption is detected, branch to a recovery timeline.
- Configure
postgresql.confon replica:
restore_command = 'cp /path/to/archive/%f %p'
recovery_target_time = '2024-06-15 14:30:00 UTC'
recovery_target_action = 'promote'
- Create
standby.signal(if needed) and restart:pg_ctl start -D /var/lib/postgresql/data
5.2 Binary Log Replay & GTID Injection for MySQL
Manually bridge GTID gaps when binlogs are partially available.
-- Stop replica SQL thread
STOP REPLICA SQL_THREAD;
-- Inject missing GTID range
SET GTID_NEXT='server_uuid:transaction_id';
BEGIN; COMMIT;
SET GTID_NEXT='AUTOMATIC';
-- Resume auto-positioning
START REPLICA SQL_THREAD;
5.3 Data Checksum Validation & Routing State Reset
Before restoring full production traffic, verify structural and row-level integrity.
- PostgreSQL:
pg_checksums --check -D /var/lib/postgresql/data(requiresdata_checksums=onat initdb) - MySQL:
mysqlcheck --check --all-databases -u root -p
Reset connection router configurations to baseline ONLINE states. Monitor for a 15-minute stability window before decommissioning temporary routing overrides.
Actionable Steps:
postgresql.confrecovery targets and restart replica.gtid_executedcontinuity.pg_checksums --checkormysqlcheck --checkacross critical schemas.Error_code: 1236orFATALreplication errors in logs before closing incident.