My thoughts about MySQL (Cluster) replication

Sat, 2008-06-28 13:00 — Shinguz

According to Johans wishes I write down my concerns about MySQL (Cluster) replication. These items are things I run again and again into it with customers:

SQL-nodes are still loosing too easy connection to cluster after data node or management node restart (which leads into gaps, see next point). Automatic failover or reconnection is just a dream (maybe it works in about 90% of the cases at least).
Gaps: Whenever I do a cluster reconfiguration (should not be necessary too often, see loosing connection above) or a mysqld restart I get a gap (these are the planned ones, see also automatic channel failover). Then we have the not planned ones...
I cannot understand, why we are not able to keep at least a certain amount of traffic in a binlog-injector-buffer to avoid the majority of these gaps. This could be something like a Round-Robin buffer which stores all the cluster information which should be sent to the binlog-injector thread. When we configure this buffer to 1 Gbyte we can keep close to 2 minutes of traffic (10 Mbyte/s traffic) which should be fine to cover just hiccups and quick restarts of mysql. In other environments where we have much less traffic we can store there even hours of traffic which really should make look us unbreakable.
Together with an automatic channel failover this should be much more robust.
Automatic channel failover: It should not be too difficult to automatically switch from one channel to an other (or a third) one. If the issues above are solved, it is also less likely, that a channel failover is necessary.
A gap event should trigger a channel failover and a STOP SQL THREAD command and not just stopping the replication with an error. It should also be possible to make an automatic switch back to the original slave again if this slave is still talking to its master after the gap to see if it possibly could continue working...
Atomic operations: They do not have to be necessarily durable but atomic. This problem we have on the master side (as described in the issue 27201) and also on the slave side. When we guarantee atomicity we should not loose to much in performance (don't we?) but can avoid a lot of troubles in replication: The relay-log is not atomic with the slave at all. If I crash a slave under heavy load it will stick in troubles (in about 25% of the cases). This is reproducible and a bug is filed for this.
Binary-log mechanism is a nightmare for every DB operator: Making errors in MySQL replication and restore/recovery is very easy (including Cluster PITR). We should make the mysql-restore-recovery mechanism aware of the last applied transaction (global trx id?) and by default it should deny to apply binary-logs with the wrong file or position (at some other databases you cannot even force it to do it wrong!).
The Cluster Recovery/PITR is further not documented and also should follow this rules/mechanism. It would also be nice if, after ndb_restore -m -r -e is done it would ask for the binary log to apply next (at the right position!).
Rolling restart: From time to time it happens, that a rolling restart is necessary. This has to be done manually for each node (some customers were writing scripts for this) but even worse you also have to switch the tool to do this.
My suggestion is: a) Implement a ROLLING RESTART command and b) also allow to restart SQL-Nodes from the mgmd (mgmd and mysqld are talking to each other anyway (see binlog injector thread), so it should not be too difficult to send a SIGHUP from the mgmd to the mysqld).
Monitoring: MySQL Cluster Monitoring without the MGMD-API is a nightmare too. And also with the MGMD-API it is so poor! What I would like to have is a standardized interface to my system metrics (namely SQL!). And then all the important information (performance metrics, buffer fill degree, etc.) in some cluster tables. Basically all the stuff which is written to the cluster log with ALL REPORT MemoryUsage, ALL REPORT BackupStatus and ALL CLUSTERLOG STATISTICS=15 (or maybe ALL events?)). So I can access them easily.
Just a little (maybe?) bug I run into it with recent MySQL Cluster releases: A Gap requires now more than one (at least 2) set global sql_slave_skip_counter=1 statements to make slave running again... This makes the replication less predictable for automatising processes.