In an Composite Active/Active topology, a switch or a failover not only promotes a
Replica to be a new Primary, but also will require the ability to reconfigure
cross-site communications. This process therefore assumes that cross-site
communication is online and working. In some situations, it may be possible
that cross-site communication is down, or for some reason cross-site
replication is in an OFFLINE:ERROR
state - for example a DDL or DML statement that worked in the local cluster
may have failed to apply in the remote cluster.
If a switch or failover occurs and the process is unable to reconfigure the
cross-site replicators, the local switch will still succeed, however the
associated cross-site services will be placed into a
SHUNNED(SUBSERVICE-SWITCH-FAILED)
state.
The guide explains how to recover from this situation.
The examples are based on a 2-cluster topology, named nyc
and london
and the
composite dataservice named global
. The cluster is configured with the
following dataservers:
nyc
: db1
(Primary),
db2
(Replica), db3
(Replica)
nyc
: db4
(Primary),
db5
(Replica), db6
(Replica)
The cross site replicators in both clusters are in an OFFLINE:ERROR
state
due to failing DDL.
A switch was then issued, promoting db3
as the new Primary
in nyc
and db5
as
the new Primary in london
At the top-level, the switch was a success, however, in the cctrl output below, you can see that all dataources
in the london_from_nyc
sub-service are in the
SHUNNED(SUBSERVICE-SWITCH-FAILED)
state,
and partial reconfiguration has happened. The same can also be observed in the
nyc_from_london
sub-service
shell>cctrl
Tungsten Cluster 7.0.3 build 141 nyc: session established [LOGICAL:EXPERT] / >use london_from_nyc
london_from_nyc: session established, encryption=false, authentication=false [LOGICAL:EXPERT] /london_from_nyc >ls
COORDINATOR[db6:AUTOMATIC:ONLINE] DATASOURCES: +---------------------------------------------------------------------------------+ |db4(relay:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=-1, latency=-1.000) | |STATUS [SHUNNED] [2025/01/29 11:55:42 AM UTC] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=relay, master=db3, state=SUSPECT) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=0, active=0) | +---------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------+ |db5(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=-1, latency=-1.000) | |STATUS [SHUNNED] [2025/01/29 11:55:42 AM UTC] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db4, state=SUSPECT) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=0, active=0) | +---------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------+ |db6(slave:SHUNNED(SUBSERVICE-SWITCH-FAILED), progress=-1, latency=-1.000) | |STATUS [SHUNNED] [2025/01/29 11:55:42 AM UTC] | +---------------------------------------------------------------------------------+ | MANAGER(state=ONLINE) | | REPLICATOR(role=slave, master=db4, state=SUSPECT) | | DATASERVER(state=ONLINE) | | CONNECTIONS(created=0, active=0) | +---------------------------------------------------------------------------------+
The role of db5
should relay
and db4
should be slave
.
The replicators for db4
and
db6
should be Replicas of db5
.
db5
has correctly configured to the new Primary in
london
and db3
in nyc
.
The actual state of the cluster in each scenario maybe different depending upon the cause of the loss of cross-site communication. Using the steps below, apply the necessary actions that relate to your own cluster state, if in any doubt always contact Continuent Support for assistance.
The first step is to ensure the initial replication errors have been resolved and that the replicators are in an online state, the steps to resolve the replicators will depend on the reason for the error.
From one node, connect into cctrl at the expert level
shell> cctrl -expert
Next, connect to the cross-site subservice, in this example,
london_from_nyc
[LOGICAL:EXPERT] / > use london_from_nyc
london_from_nyc: session established, encryption=false, authentication=false
Enable override of commands issued
[LOGICAL:EXPERT] /london_from_nyc > set force true
Bring the datasources online:
[LOGICAL:EXPERT] /london_from_nyc >datasource db4 online
[LOGICAL:EXPERT] /london_from_nyc >datasource db5 online
[LOGICAL:EXPERT] /london_from_nyc >datasource db6 online
Issue the switch to to force the cluster to move the relay role to the correct node:
[LOGICAL:EXPERT] /london_from_nyc >switch to db5
WARNING: This is an expert-level command: Switch in a subservice will make the RELAY different from MASTER. This may cause data corruption or make the cluster unavailable. Do you want to continue? (y/n)>y
SET POLICY: MAINTENANCE => MAINTENANCE EVALUATING SLAVE: db5(stored=12, applied=12, latency=257.083, datasource-group-id=0) SELECTED SLAVE: db5@london_from_nyc Savepoint switch_1(cluster=london_from_nyc, source=db4, created=2025/01/29 13:33:32 UTC) created PRIMARY IS REMOTE. USING 'thl://db3:2112/' for the MASTER URI PUT THE NEW RELAY 'db5@london_from_nyc' ONLINE PUT THE PRIOR RELAY 'db4@london_from_nyc' ONLINE AS A SLAVE SWITCH TO 'db5@london_from_nyc' WAS SUCCESSFUL
In this example, cctrl warns about switching the relay. In normal operation this should be
noted since the relay
and master
roles should be the same host. In this case we can ignore the warning since we are intentioanlly
moving the relay
to satisfy the condition that we are
being warned to avoid!
In some situations the relay
role may have switched, but the replicators
are in a mixed state and not configured correctly, for example, db5
may be
showing correctly as the new relay
however the replicator may not have updated
and may be configured to point to the old master
in the
nyc
cluster (db1
instead of
db3
).
In that situation, the following steps should be followed:
From one node, connect into cctrl at the expert level
shell> cctrl -expert
Next, connect to the cross-site subservice, in this example,
london_from_nyc
[LOGICAL:EXPERT] / > use london_from_nyc
london_from_nyc: session established, encryption=false, authentication=false
Enable override of commands issued
[LOGICAL:EXPERT] /london_from_nyc > set force true
Bring the relay datasource online:
[LOGICAL:EXPERT] /london_from_nyc > datasource db5 online
Next, place the service into MAINTENANCE
Mode
[LOGICAL:EXPERT] /london_from_nyc > set policy maintenance
If you need to change the source for the relay replicator to the correct, new, Primary in the remote cluster, take the replicator offline. If the relay source is correct, then move on to step 9
[LOGICAL:EXPERT] /london_from_nyc > replicator db5 offline
Change the source of the relay replicator
[LOGICAL:EXPERT] /london_from_nyc > replicator db5 relay nyc/db3
Bring the relay replicator online
[LOGICAL:EXPERT] /london_from_nyc > replicator db5 online
For each datasource that requires the replicator altering, issue the following commands
[LOGICAL:EXPERT] /london_from_nyc >replicator {datasource} offline
[LOGICAL:EXPERT] /london_from_nyc >replicator {datasource} slave {relay}
[LOGICAL:EXPERT] /london_from_nyc >replicator {datasource} online
For example:
[LOGICAL:EXPERT] /london_from_nyc >replicator db4 offline
[LOGICAL:EXPERT] /london_from_nyc >replicator db4 slave db5
[LOGICAL:EXPERT] /london_from_nyc >replicator db4 online
Once all replicators are using the correct source, we can then bring the cluster back
[LOGICAL:EXPERT] /london_from_nyc > cluster welcome
Some of the datasources may still be in the SHUNNED state, so for each of those, you can then issue the following
[LOGICAL:EXPERT] /london_from_nyc > datasource {datasource} online
For example:
[LOGICAL:EXPERT] /london_from_nyc > datasource db4 online
Once all nodes are online, we can then return the cluster to AUTOMATIC
[LOGICAL:EXPERT] /london_from_nyc > set policy automatic
Whichever process you followed above, you would now repeat on the other cluster if required.