Comparing PostgreSQL Streaming Replication vs MySQL GTID

1. Protocol Architecture & Baseline Configuration Validation

1.1 PostgreSQL WAL Streaming Mechanics (`wal_level`, `max_wal_senders`, `hot_standby`)

PostgreSQL replication operates on continuous Write-Ahead Log (WAL) streaming. The primary streams physical WAL segments to standby nodes, which apply them sequentially. Critical parameters dictate throughput and retention:

wal_level = replica (or logical if logical decoding is required)
max_wal_senders >= replica_count + 2 (reserve slots for failover/monitoring)
hot_standby = on (enables read queries during recovery)
wal_keep_size (modern replacement for wal_keep_segments; dictates disk retention for lagging replicas)

1.2 MySQL GTID Enforcement (`gtid_mode=ON`, `enforce_gtid_consistency`, `binlog_format=ROW`)

MySQL Global Transaction Identifiers (GTIDs) decouple replication topology from physical log positions. Each transaction receives a unique server_uuid:transaction_id pair. Strict enforcement is mandatory for topology-safe failovers:

gtid_mode = ON
enforce_gtid_consistency = ON (blocks non-transactional statements and CREATE TABLE ... SELECT)
binlog_format = ROW (required for deterministic GTID application)
auto_position = 1 on replica CHANGE MASTER TO/CHANGE REPLICATION SOURCE TO statements

1.3 Step-by-Step Parameter Verification Checklist

Before diagnosing runtime degradation, validate baseline state alignment. Review foundational replication concepts in Database Replication Fundamentals & Architecture to ensure parameter parity across your fleet.

Actionable Steps:

Verify wal_level >= replica and active replication slots on primary:

SELECT slot_name, slot_type, active, restart_lsn FROM pg_replication_slots;

Confirm gtid_mode=ON and enforce_gtid_consistency=ON across all nodes:

SHOW VARIABLES LIKE 'gtid_mode';
SHOW VARIABLES LIKE 'enforce_gtid_consistency';

Run baseline lag queries and log retention checks:

# PostgreSQL
psql -c "SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lag FROM pg_stat_replication;"

# MySQL
mysql -e "SHOW REPLICA STATUS\G" | grep -E "Seconds_Behind_Source|Retrieved_Gtid_Set|Executed_Gtid_Set"

2. Symptom Identification: Replication Lag & Routing Failures

2.1 Metric Thresholds: `pg_stat_replication` vs `SHOW REPLICA STATUS`

Application-level timeouts often precede visible database alerts. PostgreSQL exposes replay_lag (interval) and write_lag in pg_stat_replication. MySQL reports Seconds_Behind_Source (integer, can return NULL during I/O stalls). Establish strict SLA thresholds:

Warning: replay_lag > 5s or Seconds_Behind_Source > 5s
Critical: replay_lag > 30s or Seconds_Behind_Source > 30s
Stall Indicator: replay_lag static for >60s, or Seconds_Behind_Source = NULL

2.2 Read-After-Write Inconsistency Patterns

When routing layers (ProxySQL, PgBouncer, HAProxy) fail to track replication state, clients experience stale reads immediately following writes. Symptoms include:

Application retries returning 404 or missing rows after INSERT/UPDATE
Session affinity bypass causing split-brain reads
FATAL: too many connections (PostgreSQL) when connection pools exhaust retries against lagging endpoints

2.3 Connection Router Circuit Breaker Triggers

Cross-region latency compounds routing failures. Topology-aware proxies must implement lag-based health checks. When thresholds are breached, routers should trigger OFFLINE_SOFT or READ_ONLY states rather than dropping traffic. For multi-region deployments, align routing fallback matrices with Designing Multi-Region Read Replica Topologies to prevent cascading primary overload.

Actionable Steps:

Monitor replay_lag and Seconds_Behind_Source against 5s/10s SLA thresholds via Prometheus/Grafana or Datadog.
Trace ProxySQL/PgBouncer routing rules for stale-read fallback behavior:

# ProxySQL
SELECT * FROM stats_mysql_connection_pool WHERE hostgroup_id IN (READ_HOSTGROUP);
# PgBouncer
tail -f /var/log/pgbouncer/pgbouncer.log | grep "server connection"

Capture application stack traces for read-after-write violations; correlate with transaction_id or xid in database logs.

3. Root Cause Analysis: WAL Retention Limits vs GTID Gap Detection

3.1 PostgreSQL: WAL Archiving Exhaustion & Inactive Slots

Orphaned replication slots (active = false) prevent WAL cleanup, causing pg_wal directory growth until disk exhaustion. When wal_keep_size is exceeded and slots are inactive, the primary cannot stream required segments, forcing replica re-sync.

Diagnostic: SELECT slot_name, restart_lsn, active FROM pg_replication_slots;
Failure Mode: FATAL: requested WAL segment ... has already been removed

3.2 MySQL: Binary Log Purge Conflicts & GTID Set Divergence

Premature binary log purging (expire_logs_days or binlog_expire_logs_seconds) breaks auto_position=1 recovery if the replica hasn’t consumed the GTID range. The replica enters Retrieving_Gtid_Set mismatch state.

Diagnostic: Compare gtid_executed vs gtid_purged:

SHOW VARIABLES LIKE 'gtid_executed';
SHOW VARIABLES LIKE 'gtid_purged';

Failure Mode: Error_code: 1236 (GTID set mismatch or missing binlog file)

3.3 Network I/O Bottleneck & Disk Subsystem Diagnostics

Differentiate transient network partitions from persistent configuration drift. High wal_receiver latency or MySQL Slave_IO_State: Waiting for master to send event indicates network saturation, while Slave_SQL_Running: No points to disk I/O or constraint violations.

Actionable Steps:

Query pg_replication_slots for inactive active=false slots; drop if safe: SELECT pg_drop_replication_slot('slot_name');
Inspect gtid_executed vs gtid_purged sets for missing transaction ranges. If gtid_purged contains ranges not in gtid_executed, the replica requires re-sync.
Run iostat -xz 1 and mtr --report --report-cycles 5 <primary_ip> to isolate disk vs network bottlenecks. Target await < 10ms and packet loss < 0.1%.

4. Mitigation Runbook: Re-Syncing & Connection Pool Re-Routing

4.1 Step 1: Isolate Primary & Drain Read Pools

Prevent write routing to degraded replicas and drain active connections gracefully.

# ProxySQL
UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostgroup_id=READ_HOSTGROUP AND hostname='replica_ip';
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;

# PgBouncer
# Pause database to drain active queries
PAUSE dbname;

4.2 Step 2: Execute Fast Re-Sync (`pg_rewind` vs MySQL Clone Plugin)

Avoid full logical dumps. Use timeline-aware physical sync tools.

PostgreSQL (pg_rewind):

# Stop replica, ensure primary is accessible
pg_ctl stop -D /var/lib/postgresql/data
pg_rewind --target-pgdata=/var/lib/postgresql/data --source-server="host=primary port=5432 user=replicator dbname=postgres"

MySQL (Clone Plugin):

-- On replica
INSTALL PLUGIN clone SONAME 'mysql_clone.so';
CLONE INSTANCE FROM 'repl_user'@'primary_ip':3306 IDENTIFIED BY 'secure_password';

4.3 Step 3: Validate Alignment & Re-Enable Routing

Verify replication catch-up before returning traffic.

PostgreSQL: SELECT replay_lag FROM pg_stat_replication WHERE client_addr = 'replica_ip'; (Target: < 1s)
MySQL: SHOW REPLICA STATUS\G (Target: Seconds_Behind_Source = 0, Slave_IO_Running: Yes, Slave_SQL_Running: Yes)

4.4 Step 4: Implement Read-Preference Overrides

Temporarily enforce read_preference=primary for critical write-heavy paths until replica consistency is confirmed. Update routing health checks to aggressive polling:

ping_interval=2
max_lag_ms=5000
Reset PgBouncer server_reset_query to DISCARD ALL; to prevent stale connection reuse.

Actionable Steps:

Execute pg_rewind or CLONE INSTANCE FROM as documented above.
Update routing health checks to ping_interval=2 and max_lag_ms=5000.
Validate transaction alignment with checksum queries:

-- PostgreSQL
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes FROM pg_stat_replication;
-- MySQL
SELECT @@global.gtid_executed;

5. Rollback Procedures & Consistency Verification

5.1 Point-in-Time Recovery (PITR) Fallback for PostgreSQL

If pg_rewind fails or data corruption is detected, branch to a recovery timeline.

Configure postgresql.conf on replica:

restore_command = 'cp /path/to/archive/%f %p'
recovery_target_time = '2024-06-15 14:30:00 UTC'
recovery_target_action = 'promote'

Create standby.signal (if needed) and restart: pg_ctl start -D /var/lib/postgresql/data

5.2 Binary Log Replay & GTID Injection for MySQL

Manually bridge GTID gaps when binlogs are partially available.

-- Stop replica SQL thread
STOP REPLICA SQL_THREAD;
-- Inject missing GTID range
SET GTID_NEXT='server_uuid:transaction_id';
BEGIN; COMMIT;
SET GTID_NEXT='AUTOMATIC';
-- Resume auto-positioning
START REPLICA SQL_THREAD;

5.3 Data Checksum Validation & Routing State Reset

Before restoring full production traffic, verify structural and row-level integrity.

PostgreSQL: pg_checksums --check -D /var/lib/postgresql/data (requires data_checksums=on at initdb)
MySQL: mysqlcheck --check --all-databases -u root -p

Reset connection router configurations to baseline ONLINE states. Monitor for a 15-minute stability window before decommissioning temporary routing overrides.

Actionable Steps:

Configure postgresql.conf recovery targets and restart replica.
Execute GTID gap injection and verify gtid_executed continuity.
Run pg_checksums --check or mysqlcheck --check across critical schemas.
Restore routing health checks and monitor for 15-minute stability window. Verify zero Error_code: 1236 or FATAL replication errors in logs before closing incident.