White Papers

Dell HPC NFS Storage Solution High Availability Configurations with Large Capacities
13
3.2. Potential failures and fault tolerant mechanisms in NSS-HA
In the real world, there are many different types of failures and faults which can impact the
functionality of NSS-HA. Table 1 lists the potential failures which can be tolerated in an NSS-HA
solution based on the architecture described in Section 3. The analysis below assumes that the HA
cluster service is running on the “active” server, the “passive” server is the other component of the
cluster.
NSS-HA mechanisms to handle failures Table 1.
Failure type
Mechanism to handle failure
Single local disk failure on a
server
Operating system installed on a two-disk RAID 1
device with one hot spare. Single disk failure is
unlikely to bring down server.
Single server failure
Monitored by the cluster service. Service fails over to
passive server.
Power supply or power bus
failure
Dual power supplies in each server. Each power
supply connected to a separate power bus. Server will
continue functioning with a single power supply.
Fence device failure
iDRAC used as primary fence device. Switched PDUs
used as secondary fence devices.
SAS cable/port failure
Two SAS cards in each NFS server. Each card has a SAS
cable to storage. A single SAS card/cable failure will
not impact data availability.
Dual SAS cable/card failure
Monitored by the cluster service. If all data paths to
the storage are lost, service fails over to the passive
server.
InfiniBand /10GbE link failure
Monitored by the cluster service. Service fails over to
passive server.
Private switch failure
Cluster service continues on the active server. If
there is an additional component failure, service is
stopped and system administrator intervention
required.
Heartbeat network interface
failure
Monitored by the cluster service. Service fails over to
passive server.
RAID controller failure on
MD3200 storage array
Dual controllers in MD3200. The second controller
handles all data requests. Performance may be
degraded but functionality is not impacted.