Postgres replica conflicts: Part 1
Why do we get replica conflicts and the reason.
We were occasionally seeing the below errors in our app logs when some queries run on the postgres replica.
“Canceling statement due to conflict with recovery”
There are many reasons why conflicts can occur on a hot standby(replica) as described here.
Also from across the web.
A replication conflict occurs whenever the recovery process cannot apply WAL information from the primary server to the standby, because the change would disrupt query processing there. These conflicts cannot happen with queries on the primary server, but they happen on the streaming replication standby server because the primary server has limited knowledge about what is going on on the standby.
Which Conflict ?
Now in our case why is the conflict happening ? You check the stats by running the below query on your replica
select *from pg_stat_database_conflicts;datid | datname | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock— — — -+ — — — — — -+ — — — — — — — — — + — — — — — — + — — — — — — — — + — — — — — — — — -+ — — — — — — — —16404 | myservice | 0 | 0 | 3 | 0 | 016403 | rdsadmin | 0 | 0 | 0 | 0 | 016401 | postgres | 0 | 0 | 0 | 0 | 014372 | template0 | 0 | 0 | 0 | 0 | 016402 | template1 | 0 | 0 | 0 | 0 | 0
So count 3 indicates the number of occurrences of conflicts of type confl_snapshot.
As per the docs, this means
Number of queries in this database that have been canceled due to old snapshots
From docs, the above error occurs when
Application of a vacuum cleanup record from WAL conflicts with standby transactions whose snapshots can still “see” any of the rows to be removed.
If the replica is applying a wal received from the primary and that wal’s application on replica includes removing dead rows, and if at the same time a query(or queries) running on the replica has visibility of the same dead rows, then a snapshot conflict occurs.
To over simplify(yeah overly hence might not be accurate), in postgres when a transaction starts it takes a snapshot (assume s1) and which is nothing but the state of the database when the transaction began.So assume at s1, the db had 100 rows.This is on the replica.
So while t1 was still in progress, a transaction t2 on master deleted 5 rows and these changes are streamed in the wal file to the replica.So now as t1 is running the replica tries to apply the wal changes but as t1’s snapshot s1 was aware of those 5 rows which are being deleted now, a conflict arises as the replica assumes t1 might be needing those deleted rows.
Note, the transaction t1 in this case doesn’t even need thos deleted rows, but that does not matter.Irrespective of the data being needed by the query, the query on the replica is cancelled due to the conflict.Its more about the snapshot time than the actual data being accessed by the query.
Continued in part 2 …