White Papers

Dell HPC NFS Storage Solution High Availability Configurations with Large Capacities
24
2) Heartbeat link failure - simulated by disconnecting the private network link on the active
server.
When the heartbeat link is removed from the active server, both servers detect the missing
heartbeat and attempt to fence each other. The active server is unable to fence the passive
since the missing link prevents it from communicating over the private network. The passive
server successfully fences the active server and takes ownership of the HA service.
3) Public link failure - simulated by disconnecting the InfiniBand or 10 Gigabit Ethernet link on the
active server.
The HA service is configured to monitor this link. When the public network link is disconnected
on the active server, the cluster service stops on the active server and is relocated to the
passive server.
4) Private switch failure - simulated by powering off the private network switch.
When the private switch fails, both servers detect the missing heartbeat from the other server
and attempt to fence each other. Fencing is unsuccessful since the network is unavailable and
the HA service continues to run on the active server.
5) Fence device failure - simulated by disconnecting the iDRAC cable from the server.
If the iDRAC on a server fails, the server will be fenced via the network PDUs which are defined
as secondary fence devices in the cluster configuration files.
6) One SAS link failure - simulated by disconnecting one SAS link between the PowerEdge R710
server and the PowerVault MD3200 storage.
In the case where only one SAS link fails, the cluster service is not interrupted. Since there are
multiple paths from the server to the storage, a single SAS link failure does not break the data
path from the clients to the storage and thus does not trigger a cluster service failover.
For cases (1) through (6) it was observed that the HA service failover takes in the range of a
half to one minute. This reaction time is faster with this version of the cluster suite than with
the previous version
(4)
. Thus in a healthy cluster, any failure event should be noted by the Red
Hat cluster management daemon and acted upon within minutes. Note that this is the failover
time on the NFS servers; the impact to the clients could be longer.
7) Multiple SAS link failures - simulated by disconnecting all SAS links between one PowerEdge
R710 server and the PowerVault MD3200 storage.
When all SAS links on the active server fail, the multipath daemon on the active server retries
the path to the storage based on the parameters configured in the multipath.conf file. This is
set to 150 seconds by default. After this process times out, the HA service will attempt to
failover to the passive server.
If the cluster service is unable to cleanly stop the LVM and the file system because of the
broken path, a watchdog script will reboot the active server after five minutes. At this point
the passive server will fence the active server, restart the HA service and provide the data path
to the clients. This failover can therefore take anywhere in the range of three to eight
minutes.