6.3.4. Understanding Datasource States

All datasources will be in one of a number of states that indicate their current operational status.

6.3.4.1. ONLINE State

A datasource in the ONLINE state is considered to be operating normally, with replication, connector and other traffic being handled as normal.

6.3.4.2. OFFLINE State

A datasource in the OFFLINE state does not accept connections through the connector for either reads or writes.

When the dataservice is in the AUTOMATIC policy mode, a datasource in the OFFLINE state is automatically recovered and placed into the ONLINE state. If this operation fails, the datasource remains in the OFFLINE state.

When the dataservice is in MAINTENANCE or MANUAL policy mode, the datasource will remain in the OFFLINE state until the datasource is explicitly switched to the ONLINE state.

6.3.4.3. FAILED State

When a datasource fails, for example when a failure in one of the services for the datasource stops responding or fails, the datasource will be placed into the FAILED state. In the example below, the underlying dataserver has failed:

+---------------------------------------------------------------------------------+
|db3(slave:FAILED(DATASERVER 'db3@alpha' STOPPED), progress=3, latency=0.194)     |
|STATUS [CRITICAL] [2025/01/24 02:59:42 PM UTC]                                   |
|REASON[DATASERVER 'db3@alpha' STOPPED]                                           |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=slave, master=db1, state=ONLINE)                               |
|  DATASERVER(state=STOPPED)                                                      |
|  CONNECTIONS(created=8, active=0)                                               |
+---------------------------------------------------------------------------------+

For a FAILED datasource, the recover command within cctrl can be used to attempt to recover the datasource to the operational state. If this fails, the underlying fault must be identified and addressed before the datasource is recovered.

6.3.4.4. SHUNNED State

A SHUNNED datasource implies that the datasource is OFFLINE. Unlike the OFFLINE state, a SHUNNED datasource is not automatically recovered.

A datasource in a SHUNNED state is not connected or actively part of the dataservice. Individual services can be reconfigured and restarted. The operating system and any other maintenance to be performed can be carried out while a host is in the SHUNNED state without affecting the other members of the dataservice.

Datasources can be manually or automatically shunned. The current reason for the SHUNNED state is indicated in the status output. For example, in the sample below, the node db3 was manually shunned for maintenance reasons:

+---------------------------------------------------------------------------------+
|db3(slave:SHUNNED(MANUALLY-SHUNNED), progress=3, latency=0.000)                  |
|STATUS [SHUNNED] [2025/01/24 03:00:42 PM UTC]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=slave, master=db1, state=ONLINE)                               |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=8, active=0)                                               |
+---------------------------------------------------------------------------------+
6.3.4.4.1. Various SHUNNED States

A SHUNNED node can have a number of different sub-states depending on certain actions or events that have happened within the cluster. These are as folllows:

Below are various examples and possible troubleshooting steps and soultions, where applicable.

Warning

Please THINK before you issue ANY commands. These are examples ONLY, and are not to be followed blindly because every situation is different

The DRAIN-CONNECTIONS state means that the datasource [NODE|CLUSTER] drain [timeout] command has been successfully completed and the node or cluster is now SHUNNED as requested.

The datasource drain command will prevent new connections to the specified data source, while ongoing connections remain untouched. If a timeout (in seconds) is given, ongoing connections will be severed after the timeout expires. This command returns immediately, no matter whether a timeout is given or not. Under the hood, this command will put the data source into SHUNNED state, with lastShunReason set to DRAIN-CONNECTIONS. This feature is available as of version 7.0.2

[LOGICAL] / > use gloabl
[LOGICAL] /global > datasource beta drain
[LOGICAL] /global > ls
DATASOURCES:
+---------------------------------------------------------------------------------+
|alpha(composite master:ONLINE, global progress=215, max latency=0.937)           |
|STATUS [OK] [2025/01/24 03:08:54 PM UTC]                                         |
+---------------------------------------------------------------------------------+
|  alpha(master:ONLINE, progress=8, max latency=0.937)                            |
|  alpha_from_beta(relay:ONLINE, progress=207, max latency=0.633)                 |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|beta(composite master:SHUNNED(DRAIN-CONNECTIONS), global progress=8, max         |
|latency=0.955)                                                                   |
|STATUS [SHUNNED] [2025/01/24 03:11:50 PM UTC]                                    |
+---------------------------------------------------------------------------------+
|  beta(master:SHUNNED, progress=207, max latency=0.611)                          |
|  beta_from_alpha(relay:ONLINE, progress=8, max latency=0.955)                   |
+---------------------------------------------------------------------------------+

[LOGICAL] /global > recover

The FAILSAFE_SHUN state means that there was a complete network partition so that none of the nodes were able to communicate with each other. The database writes are blocked to prevent a split-brain from happening.

+----------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILSAFE_SHUN), progress=56747909871, THL                |
|latency=12.157)                                                             |
|STATUS [OK] [2025/01/24 02:08:54 PM UTC]                                    |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE)                                                      |
| REPLICATOR(role=master, state=ONLINE)                                      |
| DATASERVER(state=ONLINE)                                                   |
| CONNECTIONS(created=374639937, active=0)                                   |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db2(slave:SHUNNED(FAILSAFE_SHUN), progress=-1, latency=-1.000)              |
|STATUS [SHUNNED] [2025/01/24 03:11:50 PM UTC]                               |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE)                                                      |
| REPLICATOR(role=slave, master=db1, state=OFFLINE)                          |
| DATASERVER(state=STOPPED)                                                  |
| CONNECTIONS(created=70697946, active=0)                                    |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db3(slave:SHUNNED(FAILSAFE_SHUN), progress=56747909871, latency=12.267)     |
|STATUS [SHUNNED] [2025/01/24 02:32:50 PM UTC]                               |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE)                                                      |
| REPLICATOR(role=slave, master=db1, state=ONLINE)                           |
| DATASERVER(state=ONLINE)                                                   |
| CONNECTIONS(created=168416988, active=0)                                   |
+----------------------------------------------------------------------------+

cctrl> set force true
cctrl> datasource db1 welcome
cctrl> datasource db1 online (if needed)
cctrl> recover

The MANUALLY-SHUNNED state means that an administrator has issued the datasource {NODE|CLUSTER} shun command using cctrl or the REST API, resulting in the specified node or cluster being SHUNNED.

Warning

Unless unavoidable, and directed by Continuent Support, you should never manually shun the primary node

+---------------------------------------------------------------------------------+
|db2(slave:SHUNNED(MANUALLY-SHUNNED), progress=8, latency=0.937)                  |
|STATUS [SHUNNED] [2025/01/24 03:16:56 PM UTC]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=slave, master=db1, state=ONLINE)                               |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=9, active=0)                                               |
+---------------------------------------------------------------------------------+

cctrl> datasource db2 welcome
cctrl> datasource db2 online (if needed)
cctrl> recover

The CONFLICTS-WITH-COMPOSITE-MASTER state means that we already have an active primary in the cluster and we can’t bring this primary online because of this.

+----------------------------------------------------------------------------+
|db1(master:SHUNNED(CONFLICTS-WITH-COMPOSITE-MASTER),                        |
|progress=25475128064, THL latency=0.010)                                    |
|STATUS [SHUNNED] [2025/01/24 03:16:56 PM UTC]                               |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE)                                                      |
| REPLICATOR(role=master, state=ONLINE)                                      |
| DATASERVER(state=ONLINE)                                                   |
| CONNECTIONS(created=2568, active=0)                                        |
+----------------------------------------------------------------------------+

The FAILSAFE AFTER Shunned by fail-safe procedure state means that the Manager voting Quorum encountered an unrecoverable problem and shut down database writes to prevent a Split-brain situation.

+----------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILSAFE AFTER Shunned by fail-safe                      |
|procedure), progress=96723577, THL latency=0.779)                           |
|STATUS [OK] [2025/01/24 03:08:57 PM UTC]                                    |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE)                                                      |
| REPLICATOR(role=master, state=ONLINE)                                      |
| DATASERVER(state=ONLINE)                                                   |
| CONNECTIONS(created=135, active=0)                                         |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db2(slave:SHUNNED(FAILSAFE AFTER Shunned by fail-safe                       |
|procedure), progress=96723575, latency=0.788)                               |
|STATUS [SHUNNED] [2025/01/24 03:16:56 PM UTC]                               |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE)                                                      |
| REPLICATOR(role=slave, master=db1, state=ONLINE)                           |
| DATASERVER(state=ONLINE)                                                   |
| CONNECTIONS(created=28, active=0)                                          |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db5(slave:SHUNNED:ARCHIVE (FAILSAFE AFTER Shunned by                        |
|fail-safe procedure), progress=96723581, latency=0.905)                     |
|STATUS [OK] [2025/01/24 03:08:57 PM UTC]                                    |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE)                                                      |
| REPLICATOR(role=slave, master=db1, state=ONLINE)                           |
| DATASERVER(state=ONLINE)                                                   |
| CONNECTIONS(created=23, active=0)                                          |
+----------------------------------------------------------------------------+

cctrl> set force true
cctrl> datasource db1 welcome
cctrl> datasource db1 online (if needed)
cctrl> recover

The SUBSERVICE-SWITCH-FAILED state means that the cluster tried to switch the Primary role to another node in response to an admin request, but was unable to do so due to a failure at the sub-service level in a Composite Active/Active cluster.

+---------------------------------------------------------------------------------+
|db1(relay:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6668586,                   |
|latency=1.197)                                                                   |
|STATUS [OK] [2025/01/24 03:08:57 PM UTC]                                         |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE)                                                           |
| REPLICATOR(role=relay, master=db4, state=ONLINE)                                |
| DATASERVER(state=ONLINE)                                                        |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db2(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6668586,                   |
|latency=1.239)                                                                   |
|STATUS [SHUNNED] [2025/01/24 03:16:56 PM UTC]                                    ||
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE)                                                           |
| REPLICATOR(role=slave, master=db1, state=ONLINE)                                |
| DATASERVER(state=ONLINE)                                                        |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db3(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=6668591,                   |
|latency=0.501)                                                                   |
|STATUS [OK] [2025/01/24 03:08:57 PM UTC]                                         |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE)                                                           |
| REPLICATOR(role=slave, master=pip-db1, state=ONLINE)                            |
| DATASERVER(state=ONLINE)                                                        |
+---------------------------------------------------------------------------------+

cctrl> use {SUBSERVICE-NAME-HERE}
cctrl> set force true
cctrl> datasource db1 welcome
cctrl> datasource db1 online (if needed)
cctrl> recover

The FAILED-OVER-TO-node state means that the cluster automatically and successfully invoked a failover from one node to another. The fact that there appear to be two masters is completely normal after a failover, and indicates the cluster should be manually recovered once the node which failed is fixed.

Note

After recovery, the failed master will be recovered as a slave. The custer does not automatically switch back to the original master. Should this be required, you should issue the switch to nodename command.

+---------------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILED-OVER-TO-db3), progress=8, THL latency=0.825)           |
|STATUS [SHUNNED] [2025/01/24 03:22:57 PM UTC]                                    |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=master, state=DEGRADED)                                        |
|  DATASERVER(state=STOPPED)                                                      |
|  CONNECTIONS(created=1, active=0)                                               |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db2(slave:ONLINE, progress=8, latency=0.937)                                     |
|STATUS [OK] [2025/01/24 03:23:17 PM UTC]                                         |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=slave, master=db1, state=ONLINE)                               |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=9, active=0)                                               |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db3(master:ONLINE, progress=10, THL latency=0.165)                               |
|STATUS [OK] [2025/01/24 03:23:02 PM UTC]                                         |
+---------------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                          |
|  REPLICATOR(role=master, state=ONLINE)                                          |
|  DATASERVER(state=ONLINE)                                                       |
|  CONNECTIONS(created=1, active=0)                                               |
+---------------------------------------------------------------------------------+

cctrl> recover
cctrl> switch to db (if required)
6.3.4.4.1.8. SHUNNED(SET-RELAY)

The SET-RELAY state means that the cluster was in the middle of a switch which failed to complete for either a Composite Active/Passive Passive cluster, or in a Composite Active/Active sub-service.

+---------------------------------------------------------------------------------+
|db1(relay:SHUNNED(SET-RELAY), progress=-1, latency=-1.000)                       |
|STATUS [SHUNNED] [2025/01/24 03:22:57 PM UTC]                                    |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE)                                                           |
| REPLICATOR(role=relay, master=db4, state=SUSPECT)                               |
| DATASERVER(state=ONLINE)                                                        |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db2(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=14932,                     |
|latency=0.000)                                                                   |
|STATUS [SHUNNED] [2025/01/24 03:22:57 PM UTC]                                    |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE)                                                           |
| REPLICATOR(role=slave, master=db1, state=ONLINE)                                |
| DATASERVER(state=ONLINE)                                                        |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|db3(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=14932,                     |
|latency=0.000)                                                                   |
|STATUS [SHUNNED] [2025/01/24 03:22:57 PM UTC]                                    |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=db1, state=ONLINE) |
| DATASERVER(state=ONLINE) |
+---------------------------------------------------------------------------------+

cctrl> use {PASSIVE-SERVICE-NAME-HERE}
cctrl> set force true
cctrl> datasource db1 welcome
cctrl> datasource db1 online (if needed)
cctrl> recover

The FAILOVER-ABORTED AFTER UNABLE TO COMPLETE FAILOVER state means that the cluster tried to automatically fail over the Primary role to another node but was unable to do so.

+----------------------------------------------------------------------------+
|db1(master:SHUNNED(FAILOVER-ABORTED AFTER UNABLE TO COMPLETE FAILOVER       |
|       FOR DATASOURCE 'db1'. CHECK COORDINATOR MANAGER LOG),                |
|       progress=21179013, THL latency=4.580)                                |
|STATUS [SHUNNED] [2025/01/24 03:22:57 PM UTC]                               |
+----------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                     |
|  REPLICATOR(role=master, state=ONLINE)                                     |
|  DATASERVER(state=ONLINE)                                                  |
|  CONNECTIONS(created=294474815, active=0)                                  |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db2(slave:ONLINE, progress=21179013, latency=67.535)                        |
|STATUS [OK] [2025/01/24 03:23:02 PM UTC]                                    |
+----------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                     |
|  REPLICATOR(role=slave, master=db1, state=ONLINE)                          |
|  DATASERVER(state=ONLINE)                                                  |
|  CONNECTIONS(created=22139851, active=1)                                   |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|db3(slave:ONLINE, progress=21179013, latency=69.099)                        |
|STATUS [OK] [2025/01/24 03:23:02 PM UTC]                                    |
+----------------------------------------------------------------------------+
|  MANAGER(state=ONLINE)                                                     |
|  REPLICATOR(role=slave, master=db1, state=ONLINE)                          |
|  DATASERVER(state=ONLINE)                                                  |
|  CONNECTIONS(created=66651718, active=7)                                   |
+----------------------------------------------------------------------------+

The CANNOT-SYNC-WITH-HOME-SITE state is a composite-level state which means that the sites were unable to see each other at some point in time. This scenario may need a manual recovery at the composite level for the cluster to heal.

From alpha side:
beta(composite master:SHUNNED(CANNOT-SYNC-WITH-HOME-SITE)

From beta side:
alpha(composite master:SHUNNED(CANNOT-SYNC-WITH-HOME-SITE)

cctrl compositeSvc> recover