Reference Guide

Site failure

A site failure in a stretched cluster topology requires rebuilding all of the nodes of the affected site. If the failure happens at the

primary site, the following scenarios occur:

● All volumes hosted on the affected site and associated VMs become inaccessible.

● After a brief period, the volumes move to the secondary site.

● The VMs restart on the secondary site.

● Depending on whether synchronous or asynchronous replication is being used, you either have zero data loss or data loss

within the limits of the defined RPO:

○ For the replica volumes configured with synchronous replication, the VMs are crash consistent. Application recovery

depends on the available backup/recovery of the application.

○ For the replica volumes configured with asynchronous replication, the VMs are not crash consistent. The default RPO is

30 seconds. It can be configured using PowerShell or Windows Admin Center. Application recovery still depends on the

available backup/recovery of the application.

Site recovery

Follow these steps to recover the nodes on the failed site:

1. Remove the failed nodes from the cluster and remove the computer names from the Active Directory.

2. Remove SRPartnership and SRGroups using PowerShell cmdlets. Replication can also be disabled from the Failover Cluster

Manager.

3. Bring up all the nodes on the affected site. The node names and IPs used should be the same as those used before the

crash.

4. Join the nodes to the domain.

5. Add all the nodes to the existing stretched cluster at the same time.

6. All drives in the new site will be added to a new pool.

7. Re-create and enable replication for replica volumes and associated log volumes using Failover Cluster Manager or

PowerShell cmdlets.

Failure/Recovery from failure of Site/Node