Reference Guide

Failure/Recovery from failure of Site/Node

This chapter presents the following topics:

Topics:

• Planned failover

• Operation steps

Planned failover

Windows Admin Center has a Switch Direction feature that allows you to migrate workloads from one site to the other. This

must be initiated on each volume. VMs hosted on the volumes follow the volumes to the migrated site after 10 minutes. This

feature is helpful in scenarios such as:

● There is a planned downtime

● A potential weather event could take the site down

To use the Switch Direction feature, go to Windows Admin Center and select Storage Replica on the left pane. Then select the

SR Partnership for which you would like to change the Replication Direction. Select More and click on Switch Direction.

In the event of a site failure, if a volume is replicating synchronously then the data and the log volume automatically come online

on the surviving site, along with VMs associated with this volume because the RPO is 0. For asynchronous replication the data

and the log volume do not come online automatically because the RPO is not equal to 0.

When the failed site comes back online, the Replica and Replica-Log volume are moved to the primary site with persistent disk

reservations, and replication begins again. For a synchronous replicated volume, the replication direction cannot be changed until

replication is 100 percent complete.

Operation steps

The following sections describe the steps to take in the event of different failure types.

Node failure

Handling a node failure on either site in a stretched cluster environment is no different than managing one in a traditional or

standalone Azure Stack HCI cluster. A complete node failure would result in operating system or HBA corruption or complete

hardware failure on the node. In either case, restoring system functionality is the priority.

The high level steps to do this are:

1. Replace the hardware as needed.

2. Re-install the operating system on the operating system drives (if needed).

3. Join the system to the domain.

4. Ensure you assign the new node IPs specific to the site where the node is hosted.

5. Add the node to the existing stretched cluster.

6. Based on the IP subnets used or the Cluster Fault Domains added, the cluster adds the drives to the correct pool.

7. Wait for the storage jobs to complete.

8. During this process the workloads on the affected site would still be running and there should be no interruption of

replication.

Failure/Recovery from failure of Site/Node 19