You are here

Replication Troubleshooting - Classic VS GTID

In previous posts, I was talking about how to set up MySQL replication, Classic Replication (based on binary logs information) and Transaction-based Replication (based on GTID). In this article I'll summarize how to troubleshoot MySQL replication for the most common issues we might face with a simple comparison how can we get them solved in the different replication methods (Classic VS GTID).

There are two main operations we might need to do in a replication setup:

  • Skip or ignore a statement that causes the replication to stop.
  • Re-initialize a slave when the Replication is broke and could not be started anymore.

Skip or Ignore statement


Basically, the slave should be always synchronized with its master having the same copy of data, but for some reasons there might be inconsistency between both of them (unsafe statement in SBR, Slave is not read_only and was modified apart of replication queries, .. etc) which causes errors and stops the replication, e.g. if the master inserted a record which was already inserted on the slave (Duplicate entry) or updated/deleted a row which was not exist on the slave, ... etc.

To solve this issue, we have to either reverse what we have done on the slave (e.g. delete the inserted rows) if that was made by mistake and is known or we can skip executing those statements on the slave and continue the replication again (I'll focus on skipping a statement in this post as it needs different interaction in Classic and GTID replication).

Sample error messages (from SHOW SLAVE STATUS output):
Last_SQL_Error: Could not execute Write_rows event on table test.t1; Duplicate entry '4' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000304, end_log_pos 285
Last_SQL_Error: Could not execute Update_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000304, end_log_pos 492
Last_SQL_Error: Could not execute Delete_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000304, end_log_pos 688

How to solve that issue ?


CLASSIC REPLICATION


Solving this problem is a straight forward process in the classic replication setup, what only we need is to issue the following SQL commands on the slave's:

SQL> SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; SQL> START SLAVE;

GTID REPLICATION


Solving this problem is not a straight forward in GTID replication like it is in the Classic replication and the variable SQL_SLAVE_SKIP_COUNTER wont be useful in this area anymore.

To get this problem solved in a GTID replication we will need to inject an empty transaction as follows:

  • Check which transaction is causing the problem:
    SQL> SHOW SLAVE STATUS\G . . Retrieved_Gtid_Set: b9b4712a-df64-11e3-b391-60672090eb04:1-7 Executed_Gtid_Set: 4f6d62ed-df65-11e3-b395-60672090eb04:1, b9b4712a-df64-11e3-b391-60672090eb04:1-6 Auto_Position: 1

    Retrieved_Gtid_Set means the retrieved GTIDs from the master

    Executed_Gtid_Set means the executed GTIDs on the slave.

    According to the above output, the slave retrieved GTIDs from 1:7 (b9b4712a-df64-11e3-b391-60672090eb04:1-7) and executed only from 1:6 (b9b4712a-df64-11e3-b391-60672090eb04:1-6), so the problem is in transaction number 7.

  • Inject an empty transaction:
    SQL> SET GTID_NEXT='b9b4712a-df64-11e3-b391-60672090eb04:7'; SQL> BEGIN;COMMIT; SQL> SET GTID_NEXT='AUTOMATIC'; SQL> START SLAVE;

    BE CAUTIOUS: The first part of Executed_Gtid_Set (4f6d62ed-df65-11e3-b395-60672090eb04:1) is the local executed GTIDs (not received from the master) while the second part (b9b4712a-df64-11e3-b391-60672090eb04:1-6) is the executed GTIDs which retrieved from the master (check the master's UUID by either checking the UUID value in "Retrieved_Gtid_Set" which is basically for the master's UUID or by issuing SHOW GLOBAL VARIABLES LIKE 'server_uuid'; on the master server). So we should make sure that we are using the master's UUID when injecting an empty transaction, otherwise, the problem will still remain and the slave wont be started.

Note:

After starting the slave successfully in either classic or GTID replication we might need to use a combination of Percona tools pt-table-checksum and pt-table-sync to fix the inconsistency problem.

Re-initialize/ re-build a slave


For many reasons, we might end up with only re-build a slave to get the replication working, e.g. if we stopped a slave for a while where the master purged the binary log file that is needed by that slave or there are many duplicate entry errors so that pt-table-checksum and pt-table-sync could not be used then we have to re-initialize the slave from the beginning by having a fresh backup from the master server and restore it on the slave. Lets check how can we do that in both replication methods.

How to solve that issue ?


CLASSIC REPLICATION


Sample error message:

Last_IO_Errno: 1236 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file'

Fix steps:

  • Backup the master server by the following command:
    shell> mysqldump -u root -p --all-databases --flush-privileges --single-transaction --master-data=2 --flush-logs --triggers --routines --events --hex-blob >/path/to/backupdir/full_backup-$TIMESTAMP.sql
  • Restore the backup file on the slave:
    shell> mysql -u root -p < /path/to/backupdir/full_backup-$TIMESTAMP.sql
  • Get the binary logs information when the backup was taken:
    shell> head -n 50 /path/to/backupdir/full_backup-$TIMESTAMP.sql|grep "CHANGE MASTER TO" CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000011', MASTER_LOG_POS=120;
  • Issue the "CHANGE MASTER TO" command using the new information:
    SQL> CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000011', MASTER_LOG_POS=120;
  • Start the slave:
    SQL> START SLAVE;

NOTE:

Xtrabackup tool could be used instead of mysqldump,especially, if the database size is big. Check out this link for more information.

GTID REPLICATION


Sample error message:

Last_IO_Errno: 1236 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'

Fix steps:

  • Backup the master server by the following command:
    shell> mysqldump -u root -p --all-databases --flush-privileges --single-transaction --flush-logs --triggers --routines --events --hex-blob >/path/to/backupdir/full_backup-$TIMESTAMP.sql
  • Check the GTID value when the backup was taken:
    shell> head -n 50 /path/to/backupdir/full_backup-$TIMESTAMP.sql|grep PURGED SET @@GLOBAL.GTID_PURGED='b9b4712a-df64-11e3-b391-60672090eb04:1-8';
  • Reset the GTID_EXECUTED and GTID_PURGED values on the slave:
    SQL> RESET MASTER;
  • Restore the backup file on the slave:
    shell> mysql -u root -p < /path/to/backupdir/full_backup-$TIMESTAMP.sql
  • Make sure that the values of GTID_EXEUCTED and GTID_PURGED are the correct ones:
    SQL> SHOW GLOBAL VARIABLES LIKE 'gtid_executed'; +---------------+------------------------------------------+ | Variable_name | Value | +---------------+------------------------------------------+ | gtid_executed | b9b4712a-df64-11e3-b391-60672090eb04:1-8 | +---------------+------------------------------------------+ 1 row in set (0.00 sec) SHOW GLOBAL VARIABLES LIKE 'gtid_purged'; +---------------+------------------------------------------+ | Variable_name | Value | +---------------+------------------------------------------+ | gtid_purged | b9b4712a-df64-11e3-b391-60672090eb04:1-8 | +---------------+------------------------------------------+ 1 row in set (0.01 sec)
  • Start the slave:
    SQL> START SLAVE;

NOTES:

  • If we didn't reset the GTID_EXECUTED and GTID_PURGED values on the slave before restoring the backup file, the following error will be appeared:
    shell> mysql -u root -p < /path/to/backupdir/full_backup-$TIMESTAMP.sql. ERROR 1840 (HY000): @@GLOBAL.GTID_PURGED can only be set when @@GLOBAL.GTID_EXECUTED is empty.

    The above error indicates that the statement at the beginning of the backup file - which is "SET @@GLOBAL.GTID_PURGED='b9b4712a-df64-11e3-b391-60672090eb04:1-8';" - failed because GTID_PURGED cannot be set unless GTID_EXECUTED is empty. Since GTID_EXECUTED is a read only variable, the only way to empty its value is to issue "RESET MASTER" on the slave server before restoring the backup file.

  • Xtrabackup tool could be used as well instead of mysqldump to get this problem solved and without the need to reset GTID_EXECUTED and GTID_PURGED values . Check out this link for more information.

Conclusion


While GTID provides many benefits over the classic replication but it has different troubleshooting and fix strategies which must be known first before deploying GTID in production systems.

Comments

Inserting an empty transaction can be dangerous. Another server might request that transaction from that server and then it gets an empty transaction with the same GTID as the original one. So this is only safe for slaves which will never become a master or if all possible slaves already have processed this gtid.
dveedencomment

If you used the combination of Percona tools (pt-table-checksum and pt-table-sync) - as I mentioned - to get the data synchronized after injecting an empty transaction and starting the slave you won't face that problem. Otherwise, yes this could be dangerous if that slave promoted to be master AND the other slaves didn't replicate that transaction yet or when you take a database backup from that slave as well.

abdel-mawlacomment