This section outline the procedure for ensuring the cluster can be operational with only one functional host, or "The last man standing"
Typically, when a node fails, the cluster will mark the node as failed and the cluster will continue to operate. If the failed node happend to be the primary then a failover will occur.
If a second node then fails, one of two things will happen.
If the manager process/host are still reachable on the failed node, then datasource will be marked as failed and the cluster will continue to operate.
If the manager process/host are no longer reachable, then the entire cluster goes into a failsafe state. This is intentional and a protection against possible split-brain, since the last remaining manager will not know if it cannot reach the other nodes because of a network partition, or a genuine outage of the host.
If the outage of the nodes is genuine, then you can manually bring up the last remaining node and still have a functional cluster whilst you begin the work of fixing the broken hosts.
The procedure below explains the steps. Note that in this scenario we are dealing with a 3-node cluster which failed in the following way:
db1
started as the master
but then failed, db3
was promoted.
db3
failed next.
Since db1
and db3
were unreachable, a failover to db2
did not happen
and instead all nodes where placed into the
SHUNNED(FAILSAFE_SHUN)
state.
This shows the output within cctrl after the multiple host failures:
[LOGICAL] /alpha > ls
COORDINATOR[db2:AUTOMATIC:ONLINE]
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db2[7519](ONLINE, created=0, active=0) |
+---------------------------------------------------------------------------------+
DATASOURCES:
+---------------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILSAFE_SHUN)) |
|STATUS [SHUNNED] [2025/06/04 01:51:34 PM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=STOPPED) |
| REPLICATOR(state=STATUS NOT AVAILABLE) |
| DATASERVER(state=UNKNOWN) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db2(slave:SHUNNED(FAILSAFE_SHUN), progress=-1, latency=-1.000) |
|STATUS [OK] [2025/06/04 01:49:28 PM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db3, state=OFFLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db3(master:SHUNNED(FAILSAFE_SHUN)) |
|STATUS [OK] [2025/06/04 01:51:38 PM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=STOPPED) |
| REPLICATOR(state=STATUS NOT AVAILABLE) |
| DATASERVER(state=UNKNOWN) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
Since we, as the human operator, knows that db2
is a viable host, we can then issue the following steps to bring the host online as the
primary:
[LOGICAL] /alpha >set policy maintenance
[LOGICAL] /alpha >set force true
[LOGICAL] /alpha >datasource db2 welcome
[LOGICAL] /alpha >datasource db2 master
[LOGICAL] /alpha >replicator db2 master
[LOGICAL] /alpha >datasource db2 online
[LOGICAL] /alpha >replicator db2 online
[LOGICAL] /alpha >set policy automatic
The cctrl ls output will now look like the following:
[LOGICAL] /alpha > ls
COORDINATOR[db2:AUTOMATIC:ONLINE]
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@db2[7519](ONLINE, created=0, active=0) |
+---------------------------------------------------------------------------------+
DATASOURCES:
+---------------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILSAFE_SHUN)) |
|STATUS [SHUNNED] [2025/06/04 01:51:34 PM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=STOPPED) |
| REPLICATOR(state=STATUS NOT AVAILABLE) |
| DATASERVER(state=UNKNOWN) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db2(master:ONLINE, progress=7, THL latency=0.271) |
|STATUS [OK] [2025/06/04 01:49:28 PM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db3(master:SHUNNED(FAILSAFE_SHUN)) |
|STATUS [OK] [2025/06/04 01:51:38 PM UTC] |
+---------------------------------------------------------------------------------+
| MANAGER(state=STOPPED) |
| REPLICATOR(state=STATUS NOT AVAILABLE) |
| DATASERVER(state=UNKNOWN) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
When the failed nodes have been recovered, if the node is viable and does not need any host or database restore, then once all software on the host has been started, you can issue:
[LOGICAL] /alpha >recover
or: [LOGICAL] /alpha >datasource <host> recover
If these steps fail, then you will may need to reprovision the host using tprovision. For more details on this process, see Section 6.6.1, “Recover a failed Replica”