You are here

Logging Galera Cluster conflicts

We typically suggest our customers to use our MySQL/Galera Cluster my.cnf configuration template to avoid MySQL configuration and performance problems.

And we are paranoid as well. Thus we enable all useful logging:

wsrep_log_conflicts = 1

But this has also some consequences of more visibility...

If you monitor carefully your Galera Cluster for example with the FromDual Performance Monitor for MySQL and MariaDB, you might probably see some strange values increasing from time to time:

mysql< SHOW GLOBAL STATUS LIKE 'wsrep_local_%r_s';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| wsrep_local_cert_failures | 42    |
| wsrep_local_bf_aborts     | 13    |
+---------------------------+-------+

Those values are indicators that some transactions (Galera write sets) did to not succeed and were aborted by Galera. In this case the paranoid logging helps to find, what exactly was aborted and possibly helps to find out, if this can or should be fixed:

150410  1:44:18 [Note] WSREP: cluster conflict due to certification failure for threads:
150410  1:44:18 [Note] WSREP: Victim thread:
   THD: 151856, mode: local, state: executing, conflict: cert failure, seqno: 30399304
   SQL: UPDATE login SET lTsexpire = UNIX_TIMESTAMP(NOW()) + lTimeout WHERE lSessionId = 'va3ta7besku82k56ncv3bnhlj5'

*** Priority TRANSACTION:
TRANSACTION 464359568, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
1 lock struct(s), heap size 360, 0 row lock(s)
MySQL thread id 4, OS thread handle 0x7f1c0916c700, query id 8190690 Update_rows_log_event::find_row(30399302)

*** Victim TRANSACTION:
TRANSACTION 464359562, ACTIVE 0 sec
mysql tables in use 1, locked 1
2 lock struct(s), heap size 360, 1 row lock(s), undo log entries 1
MySQL thread id 151856, OS thread handle 0x7f1c09091700, query id 8190614 172.20.100.11 sam_angiz query end
UPDATE login SET lTsexpire = UNIX_TIMESTAMP(now()) + lTimeout WHERE lSessionId = 'va3ta7besku82k56ncv3bnhlj5'
*** WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 835205 page no 3 n bits 72 index `PRIMARY` of table `fromdual`.`login` trx table locks 1 total table locks 2  trx id 464359562 lock_mode X locks rec but not gap lock hold time 0 wait time before grant 0
150410  1:44:18 [Note] WSREP: cluster conflict due to high priority abort for threads:
150410  1:44:18 [Note] WSREP: Winning thread:
   THD: 4, mode: applier, state: executing, conflict: no conflict, seqno: 30399302
   SQL: (null)
150410  1:44:18 [Note] WSREP: Victim thread:
   THD: 151856, mode: local, state: committing, conflict: no conflict, seqno: -1
   SQL: UPDATE login SET lTsexpire = UNIX_TIMESTAMP(now()) + lTimeout WHERE lSessionId = 'va3ta7besku82k56ncv3bnhlj5'

In the above Galera conflict 2 login transactions where running at the same time. They both come with the same Session ID and want to update the expiry timestamp. Now how to solve or fix this:

  • First check, if this table has a Primary Key (tables without a PK causes full table scans which can last for long time, increasing the chance for conflicts).
  • Second check, if there is a (UNIQUE?) index on lSessionId. A missing index leads to full table scans which increases the chance for conflicts.
  • Third check WHY 2 logins from the same Session ID can arrive at the same time (within 1 second) on 2 different Galera nodes (Ajax requests, etc...). Try to avoid such situations.

Comments

We finally got the results. The table does NOT have a Primary Key and/or an index on lSessionId:

CREATE TABLE IF NOT EXISTS `genLogins` (
  `lSessionId` varchar(128) NOT NULL,
  `lUid` int(8) NOT NULL DEFAULT '0',
  `lLang` char(2) NOT NULL,
  `lTimeout` int(10) NOT NULL DEFAULT '0',
  `lReloadsess` enum('y','n') NOT NULL DEFAULT 'n',
  `lTsexpire` int(10) NOT NULL DEFAULT '0'
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Please also read: Disadvantages of explicitly NOT using InnoDB Primary Keys and Limitations of MySQL.

The alter_engine.php script of our MySQL BasEnv (MyEnv) can help you finding such tables...

Shinguzcomment

Trx # 86153878218 exists in both Galera Cluster conflics which are 46 seconds apart (= long running transaction?). Table `rt3`.`Tickets` seems to be involved in both cases. One should investigate also in this query or transaction we cannot see here...

MySQL thread id 6147, OS thread handle 0x7f5af77fe700, query id 5394536 192.168.1.42 node1
*** WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 67799 page no 2924419 n bits 136 index `PRIMARY` of table `rt3`.`Tickets` trx id 86153787673 lock_mode X locks rec but not gap
2020-06-09 08:27:24 9737 [Note] WSREP: cluster conflict due to high priority abort for threads:
2020-06-09 08:27:24 9737 [Note] WSREP: Winning thread:
   THD: 26, mode: applier, state: executing, conflict: no conflict, seqno: 1297074501
   SQL: (null)
2020-06-09 08:27:24 9737 [Note] WSREP: Victim thread:
   THD: 6147, mode: local, state: executing, conflict: no conflict, seqno: -1
   SQL: SELECT main.* FROM Attributes main  WHERE (main.ObjectId = 3137700935) AND (main.ObjectType = 'RT::Transaction')  ORDER BY main.id ASC
 
*** Priority TRANSACTION:
TRANSACTION 86153878502, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
MySQL thread id 50, OS thread handle 0x7f5ae820a700, query id 5510090 System lock
 
*** Victim TRANSACTION:
TRANSACTION 86153878218, ACTIVE 1 sec
, undo log entries 21

----

MySQL thread id 6152, OS thread handle 0x7f5b0415d700, query id 5510091 192.168.1.42 node2 update
INSERT INTO Transactions (OldValue, NewReference, OldReference, NewValue, Creator, ObjectId, Field, Type, ReferenceType, Created, Data, ObjectType) VALUES (NULL, '868352903', NULL, NULL, '108846305', '234300347', '5392', 'CustomField', 'RT::ObjectCustomFieldValue', '2020-06-09 08:28:10', NULL, 'RT::Ticket')
*** WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 67799 page no 387918 n bits 136 index `PRIMARY` of table `rt3`.`Tickets` trx id 86153878218 lock_mode X locks rec but not gap
2020-06-09 08:28:10 9737 [Note] WSREP: cluster conflict due to high priority abort for threads:
2020-06-09 08:28:10 9737 [Note] WSREP: Winning thread:
   THD: 50, mode: applier, state: executing, conflict: no conflict, seqno: 1297075194
   SQL: (null)
2020-06-09 08:28:10 9737 [Note] WSREP: Victim thread:
   THD: 6152, mode: local, state: executing, conflict: no conflict, seqno: -1
   SQL: INSERT INTO Transactions (OldValue, NewReference, OldReference, NewValue, Creator, ObjectId, Field, Type, ReferenceType, Created, Data, ObjectType) VALUES (NULL, '868352903', NULL, NULL, '108846305', '234300347', '5392', 'CustomField', 'RT::ObjectCustomFieldValue', '2020-06-09 08:28:10', NULL, 'RT::Ticket')

Shinguzcomment