White Papers

Dell HPC NFS Storage Solution High Availability Configurations with Large Capacities

NFS Server configuration

7) The XFS file system is mounted with the wsync option.

8) The XFS file system is exported using the NFS sync option.

9) Number of concurrent NFS threads is increased from a default of 8 to 256 on the NFS servers.

10) The default OS scheduler is changed from cfq to deadline.

11) MTU is set to 9000 on the 10 Gigabit Ethernet networks.

12) The default NFS protocol used to export the XFS file system is configured to be version 3.

Performance analysis of NFSv3 compared to NFSv4 showed NFSv3 to be significantly better

performing for some cases. If the security enhancements of NFSv4 are preferable, it can be

used at the cost of better performance. Section 6.5 discusses the performance difference

between v3 and v4. Appendix A: NSS-HA Recipe includes details on how to set the NFS protocol

version for the NSS-HA solution.

5.4. Functionality tests

The HA functionality of the solution was tested by simulating several component failures. The design of

the tests and the test results are similar to previous versions of the solution since the broad

architecture of the solution has not changed with this release. A quick summary is provided in this

section. For detailed explanations refer to the Solution Guide titled “Dell HPC NFS Storage Solution

High Availability Configurations, Version 1.1”.

Functionality was verified for an NFSv3 as well as NFSv4 based solution.

The following failures were simulated on the cluster.

1) Server failure

2) Heartbeat link failure

3) Public link failure

4) Private switch failure

5) Fence device failure

6) One SAS link failure

7) Multiple SAS link failures

This section briefly outlines the NSS-HA response to these failures. Details on how to configure the

solution to handle these failure scenarios are provided in Appendix A: NSS-HA Recipe

Server response to a failure

The server response to a failure event within the HA cluster was recorded. Time to recover from a

failure was used as a performance metric. Time was measured from the point when the fault was

injected in the server running the HA service (active) until the service was migrated and running on

the other server (passive).

1) Server failure - simulated by introducing a kernel panic.

When the active server fails, the heartbeat between the two servers is interrupted. The passive

server waits for a defined period of time and then attempts to fence the active server. Once

fencing is successful, the passive server takes ownership of the cluster service. Clients cannot

access the data until the failover process is complete.