It starts with a support ticket. A user updated their profile but the change isn’t showing. They refresh — old data is back. Your application reads from a MySQL replica that’s lagging behind the primary, returning stale data.
You SSH into the replica:
SHOW REPLICA STATUS\G
You find the line that matters:
Seconds_Behind_Master: 347
Nearly 6 minutes behind. Every read from this server returns data that’s 6 minutes old. For a profile update that’s annoying. For e-commerce inventory, a customer is buying a product that sold out 5 minutes ago.
How Replication Works
Three components:
Binlog on the primary. Every write (INSERT, UPDATE, DELETE, DDL) is recorded in the binary log.
IO thread on the replica. Connects to the primary, reads binlog events, writes them to the local relay log.
SQL thread on the replica. Reads relay log events and executes them. This is where data changes on the replica.
Lag happens at two points: IO thread can’t fetch fast enough (network), or SQL thread can’t replay fast enough (most common).
Step 1: Read SHOW REPLICA STATUS
SHOW REPLICA STATUS\G
On MySQL before 8.0.22, use SHOW SLAVE STATUS\G.
Critical fields:
Replica_IO_Running — must be Yes. If No, check network, credentials, and whether the primary’s binlog was purged.
Replica_SQL_Running — must be Yes. If No, check Last_SQL_Error.
Seconds_Behind_Master — the lag. Approximate.
Read_Master_Log_Pos — how far IO thread has read.
Exec_Master_Log_Pos — how far SQL thread has executed. Gap = queued work.
Relay_Log_Space — unprocessed data size. Large and growing = SQL thread can’t keep up.
Step 2: IO Thread or SQL Thread?
Check the primary’s position:
-- On primary:
SHOW MASTER STATUS\G
If replica’s Read_Master_Log_Pos is close to primary, the IO thread is fine. SQL thread is the bottleneck.
If Read_Master_Log_Pos is far behind: network bandwidth, slow primary storage, or SSL overhead on the IO thread.
90% of replication lag is the SQL thread. Let’s fix it.
Step 3: Why the SQL Thread Is Slow
Cause 1: Single-threaded replay.
By default, MySQL replays events on one thread. The primary processes thousands of concurrent transactions across 32 cores, but the replica replays them one at a time.
SHOW VARIABLES LIKE 'replica_parallel_workers';
If 0 or 1, you’re single-threaded.
Fix — enable parallel replication:
STOP REPLICA;
SET GLOBAL replica_parallel_workers = 4;
SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';
SET GLOBAL replica_preserve_commit_order = ON;
START REPLICA;
In my.cnf:
[mysqld]
replica_parallel_workers = 4
replica_parallel_type = LOGICAL_CLOCK
replica_preserve_commit_order = ON
Start with 4 workers. Going above 8-16 rarely helps.
Cause 2: Missing indexes on the replica.
With row-based replication, each row change is found by primary key on the replica. No primary key = full table scan per change. Multiply by thousands of changes per second and the SQL thread grinds.
Check what SQL thread is doing:
SHOW PROCESSLIST;
If you see slow UPDATE or DELETE from the system user, check that table’s indexes:
SHOW CREATE TABLE the_table;
Add missing primary keys.
Cause 3: Large transactions.
A single transaction updating 500,000 rows generates one massive binlog event. Replays as one operation, blocking everything behind it.
Fix on the application side — batch operations:
-- Instead of:
UPDATE orders SET status = 'archived' WHERE created_at < '2024-01-01';
-- Do:
UPDATE orders SET status = 'archived'
WHERE created_at < '2024-01-01' AND status != 'archived'
LIMIT 1000;
Loop until no rows affected. Each batch commits separately.
Cause 4: DDL operations.
ALTER TABLE on a large table blocks the SQL thread for its entire duration. Monitor lag during DDL — it’s expected and temporary.
Step 4: Monitoring That Works
Seconds_Behind_Master lies. It can show 0 between event bursts even when behind.
Use pt-heartbeat from Percona Toolkit:
On primary:
pt-heartbeat --update --database heartbeat --create-table
On replica:
pt-heartbeat --monitor --database heartbeat
Writes timestamps every second. True lag, no estimation.
Alert thresholds:
- < 1s — acceptable
- 1-10s — watch
- 10-60s — investigate
- > 60s — critical
The Diagnostic Sequence
- SHOW REPLICA STATUS — threads running? What’s the error?
- Compare positions — IO thread keeping up?
- SHOW PROCESSLIST — what’s the SQL thread doing?
- Check parallel replication — workers > 1?
- Check table indexes — primary keys present?
- Check for DDL — ALTER TABLE in progress?
Most lag resolves with parallel replication and proper indexing. The single-threaded SQL thread is by far the most common cause, and it’s the easiest fix. One config change that should be the default but somehow isn’t.