You are here

Uptime of a MariaDB Galera Cluster

A while ago somebody on Google Groups asked for the Uptime of a Galera Cluster. The answer is easy... Wait, no! Not so easy... The uptime of a Galera Node is easy (or not?). But Uptime of the whole Galera Cluster?

My answer then was: "Grep the error log." My answer now is still: "Grep the error log." But slightly different:

$ grep 'view(view_id' *
2019-03-07 16:10:26 [Note] WSREP: view(view_id(PRIM,0e0a2851,1) memb {
2019-03-07 16:14:37 [Note] WSREP: view(view_id(PRIM,0e0a2851,2) memb {
2019-03-07 16:16:23 [Note] WSREP: view(view_id(PRIM,0e0a2851,3) memb {
2019-03-07 16:55:56 [Note] WSREP: view(view_id(NON_PRIM,0e0a2851,3) memb {
2019-03-07 16:56:04 [Note] WSREP: view(view_id(PRIM,6d80bb1a,5) memb {
2019-03-07 17:00:28 [Note] WSREP: view(view_id(NON_PRIM,6d80bb1a,5) memb {
2019-03-07 17:01:11 [Note] WSREP: view(view_id(PRIM,24f67954,7) memb {
2019-03-07 17:18:58 [Note] WSREP: view(view_id(NON_PRIM,24f67954,7) memb {
2019-03-07 17:19:31 [Note] WSREP: view(view_id(PRIM,a380c8cb,9) memb {
2019-03-07 17:20:27 [Note] WSREP: view(view_id(PRIM,a380c8cb,11) memb {
2019-03-08  7:58:38 [Note] WSREP: view(view_id(PRIM,753a350f,15) memb {
2019-03-08 11:31:38 [Note] WSREP: view(view_id(NON_PRIM,753a350f,15) memb {
2019-03-08 11:31:43 [Note] WSREP: view(view_id(PRIM,489e3c67,17) memb {
2019-03-08 11:31:58 [Note] WSREP: view(view_id(PRIM,489e3c67,18) memb {
...
2019-03-22  7:05:53 [Note] WSREP: view(view_id(NON_PRIM,49dc20da,49) memb {
2019-03-22  7:05:53 [Note] WSREP: view(view_id(PRIM,49dc20da,50) memb {
2019-03-26 12:14:05 [Note] WSREP: view(view_id(NON_PRIM,49dc20da,50) memb {
2019-03-27  7:33:25 [Note] WSREP: view(view_id(NON_PRIM,22ae25aa,1) memb {

So this Cluster had an Uptime of about 18 days and 20 hours. Why can I seed this? Simple: In the brackets there is a number at the very right. This number seems to be the same as wsrep_cluster_conf_id which is reset by a full Galera Cluster shutdown.

So far so good. But, wait, what is the definition of Uptime? Hmmm, not so helpful, how should I interpret this for a 3-Node Galera Cluster?

I would say a good definition for Uptime of a Galera Cluster would be: "At least one Galera Node must be available for the application for reading and writing." That means PRIM in the output above. And we still cannot say from the output above if there was at least on Galera Node available (reading and writing) at any time. For this we have to compare ALL 3 MariaDB Error Logs... So it does not help, we need a good Monitoring solution to answer this question...

PS: Who has found the little fake in this blog?

Taxonomy upgrade extras: