6.6.3. Handling Multiple Host Failures

This section outline the procedure for ensuring the cluster can be operational with only one functional host, or "The last man standing"

Typically, when a node fails, the cluster will mark the node as failed and the cluster will continue to operate. If the failed node happend to be the primary then a failover will occur.

If a second node then fails, one of two things will happen.

If the manager process/host are still reachable on the failed node, then datasource will be marked as failed and the cluster will continue to operate.

If the manager process/host are no longer reachable, then the entire cluster goes into a failsafe state. This is intentional and a protection against possible split-brain, since the last remaining manager will not know if it cannot reach the other nodes because of a network partition, or a genuine outage of the host.

If the outage of the nodes is genuine, then you can manually bring up the last remaining node and still have a functional cluster whilst you begin the work of fixing the broken hosts.

The procedure below explains the steps. Note that in this scenario we are dealing with a 3-node cluster which failed in the following way:

  • db1 started as the master but then failed, db3 was promoted.

  • db3 failed next.

  • Since db1 and db3 were unreachable, a failover to db2 did not happen and instead all nodes where placed into the SHUNNED(FAILSAFE_SHUN) state.

This shows the output within cctrl after the multiple host failures:

[LOGICAL] /alpha > ls

COORDINATOR[db2:AUTOMATIC:ONLINE]

ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db2[7519](ONLINE, created=0, active=0)                                 |
+---------------------------------------------------------------------------------+

DATASOURCES:
+---------------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILSAFE_SHUN))                                               |
|STATUS [SHUNNED] [2025/06/04 01:51:34 PM UTC]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=STOPPED)                                                         |
|  REPLICATOR(state=STATUS NOT AVAILABLE)                                         |
|  DATASERVER(state=UNKNOWN)                                                      |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db2(slave:SHUNNED(FAILSAFE_SHUN), progress=-1, latency=-1.000)                   |
|STATUS [OK] [2025/06/04 01:49:28 PM UTC]                                         |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=slave, master=db3, state=OFFLINE)                              |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db3(master:SHUNNED(FAILSAFE_SHUN))                                               |
|STATUS [OK] [2025/06/04 01:51:38 PM UTC]                                         |
+---------------------------------------------------------------------------------+
|  MANAGER(state=STOPPED)                                                         |
|  REPLICATOR(state=STATUS NOT AVAILABLE)                                         |
|  DATASERVER(state=UNKNOWN)                                                      |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+

Since we, as the human operator, knows that db2 is a viable host, we can then issue the following steps to bring the host online as the primary:

[LOGICAL] /alpha > set policy maintenance
[LOGICAL] /alpha > set force true
[LOGICAL] /alpha > datasource db2 welcome
[LOGICAL] /alpha > datasource db2 master
[LOGICAL] /alpha > replicator db2 master
[LOGICAL] /alpha > datasource db2 online
[LOGICAL] /alpha > replicator db2 online
[LOGICAL] /alpha > set policy automatic

The cctrl ls output will now look like the following:

[LOGICAL] /alpha > ls

COORDINATOR[db2:AUTOMATIC:ONLINE]

ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db2[7519](ONLINE, created=0, active=0)                                 |
+---------------------------------------------------------------------------------+

DATASOURCES:
+---------------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILSAFE_SHUN))                                               |
|STATUS [SHUNNED] [2025/06/04 01:51:34 PM UTC]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=STOPPED)                                                         |
|  REPLICATOR(state=STATUS NOT AVAILABLE)                                         |
|  DATASERVER(state=UNKNOWN)                                                      |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db2(master:ONLINE, progress=7, THL latency=0.271)                   |
|STATUS [OK] [2025/06/04 01:49:28 PM UTC]                                         |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=master, state=ONLINE)                              |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db3(master:SHUNNED(FAILSAFE_SHUN))                                               |
|STATUS [OK] [2025/06/04 01:51:38 PM UTC]                                         |
+---------------------------------------------------------------------------------+
|  MANAGER(state=STOPPED)                                                         |
|  REPLICATOR(state=STATUS NOT AVAILABLE)                                         |
|  DATASERVER(state=UNKNOWN)                                                      |
|  CONNECTIONS(created=0, active=0)                                               |
+---------------------------------------------------------------------------------+

When the failed nodes have been recovered, if the node is viable and does not need any host or database restore, then once all software on the host has been started, you can issue:

[LOGICAL] /alpha > recover

or:

[LOGICAL] /alpha > datasource <host> recover

If these steps fail, then you will may need to reprovision the host using tprovision. For more details on this process, see Section 6.6.1, “Recover a failed Replica”