White Papers

Dell HPC NFS Storage Solution High Availability Configurations with Large Capacities

2.1. Availability in NSS-HA

A major goal of the NSS-HA solution is to improve storage service availability in the presence of

possible failures or faults. This goal is achieved by a “failover” process implemented by Red Hat

Enterprise High Availability Cluster software stack.

Figure 2 shows a typical scenario of how storage service availability is guaranteed in the NSS-HA

solution. In this scenario, assume a kernel crash occurs on an NFS server (the active one) which is the

NFS gateway for the compute cluster. The service availability is protected by three steps:

1) Failure detection – Resources related to the storage service, such as file system, service IP

address, etc., are defined, configured and monitored for health by the HA cluster. Any

interruption in access to the storage will be detected. In this case, once a kernel crash occurs

at NFS server 1 (the active one), a message in terms of loss of heartbeat signal will pass to NFS

server 2, and server 2 will recognize that the server 1 has failed.

2) Fencing – In the HA cluster, once a node notices that the other node has failed, it will fence

(reboot) the failed node via a fence device. This is to ensure that only one server accesses the

data at any point to protect data integrity. In NSS-HA, a node can fence the other via the Dell

iDRAC or an APC PDU. The fence devices and corresponding fence commands are configured as

part of the HA cluster configuration process. In this case, NFS server 2 will fence NFS server 1.

3) Service failover – In the HA cluster, only after a node successfully fences the other can the

service failover process be started. Failover means that the HA service running previously on

the failed server will be now transferred to the healthy one. In this case, once NFS server 2 has

successfully fenced server 1, the HA service will be transferred to and started on NFS server 2.

A failure scenario in NSS-HA Figure 2.

From the perspective of the compute cluster, there will be degradation in performance during the

actual HA failover process. But the failover is transparent to the compute cluster as far as possible and

user applications continue to function and access data as before.

The HA service can be defined and configured in the cluster configuration process. In the NSS-HA, NFS

export, the service IP via which the compute nodes access the NFS server, and LVM are configured as a

HA service.