Reference Guide

Failure/Recovery from failure of Site/Node
This chapter presents the following topics:
Topics:
Planned failover
Operation steps
Planned failover
Windows Admin Center has a Switch Direction feature that allows you to migrate workloads from one site to the other. This
must be initiated on each volume. VMs hosted on the volumes follow the volumes to the migrated site after 10 minutes. This
feature is helpful in scenarios such as:
There is a planned downtime
A potential weather event could take the site down
To use the Switch Direction feature, go to Windows Admin Center and select Storage Replica on the left pane. Then select the
SR Partnership for which you would like to change the Replication Direction. Select More and click on Switch Direction.
In the event of a site failure, if a volume is replicating synchronously then the data and the log volume automatically come online
on the surviving site, along with VMs associated with this volume because the RPO is 0. For asynchronous replication the data
and the log volume do not come online automatically because the RPO is not equal to 0.
When the failed site comes back online, the Replica and Replica-Log volume are moved to the primary site with persistent disk
reservations, and replication begins again. For a synchronous replicated volume, the replication direction cannot be changed until
replication is 100 percent complete.
Operation steps
The following sections describe the steps to take in the event of different failure types.
Node failure
Handling a node failure on either site in a stretched cluster environment is no different than managing one in a traditional or
standalone Azure Stack HCI cluster. A complete node failure would result in operating system or HBA corruption or complete
hardware failure on the node. In either case, restoring system functionality is the priority.
The high level steps to do this are:
1. Replace the hardware as needed.
2. Re-install the operating system on the operating system drives (if needed).
3. Join the system to the domain.
4. Ensure you assign the new node IPs specific to the site where the node is hosted.
5. Add the node to the existing stretched cluster.
6. Based on the IP subnets used or the Cluster Fault Domains added, the cluster adds the drives to the correct pool.
7. Wait for the storage jobs to complete.
8. During this process the workloads on the affected site would still be running and there should be no interruption of
replication.
6
Failure/Recovery from failure of Site/Node 19