White Papers

Dell HPC NFS Storage Solution High Availability Configurations with Large Capacities
23
NFS Server configuration
7) The XFS file system is mounted with the wsync option.
8) The XFS file system is exported using the NFS sync option.
9) Number of concurrent NFS threads is increased from a default of 8 to 256 on the NFS servers.
10) The default OS scheduler is changed from cfq to deadline.
11) MTU is set to 9000 on the 10 Gigabit Ethernet networks.
12) The default NFS protocol used to export the XFS file system is configured to be version 3.
Performance analysis of NFSv3 compared to NFSv4 showed NFSv3 to be significantly better
performing for some cases. If the security enhancements of NFSv4 are preferable, it can be
used at the cost of better performance. Section 6.5 discusses the performance difference
between v3 and v4. Appendix A: NSS-HA Recipe includes details on how to set the NFS protocol
version for the NSS-HA solution.
5.4. Functionality tests
The HA functionality of the solution was tested by simulating several component failures. The design of
the tests and the test results are similar to previous versions of the solution since the broad
architecture of the solution has not changed with this release. A quick summary is provided in this
section. For detailed explanations refer to the Solution Guide titled “Dell HPC NFS Storage Solution
High Availability Configurations, Version 1.1”.
Functionality was verified for an NFSv3 as well as NFSv4 based solution.
The following failures were simulated on the cluster.
1) Server failure
2) Heartbeat link failure
3) Public link failure
4) Private switch failure
5) Fence device failure
6) One SAS link failure
7) Multiple SAS link failures
This section briefly outlines the NSS-HA response to these failures. Details on how to configure the
solution to handle these failure scenarios are provided in Appendix A: NSS-HA Recipe
Server response to a failure
The server response to a failure event within the HA cluster was recorded. Time to recover from a
failure was used as a performance metric. Time was measured from the point when the fault was
injected in the server running the HA service (active) until the service was migrated and running on
the other server (passive).
1) Server failure - simulated by introducing a kernel panic.
When the active server fails, the heartbeat between the two servers is interrupted. The passive
server waits for a defined period of time and then attempts to fence the active server. Once
fencing is successful, the passive server takes ownership of the cluster service. Clients cannot
access the data until the failover process is complete.