Availability Guide for Problem Management

Auditing Systems for Fault Tolerance
Availability Guide for Problem Management125509
7-5
Configuring Your Hardware for Fault Tolerance
Configuring Your Hardware for Fault Tolerance
You can ensure that your hardware configuration is fault tolerant by performing the
following tasks (some of which can be automated) in your system environment:
Testing backup paths
Performing powerfail testing
Configuring your hardware adequately for stress periods
Using mirrored disk drives
Avoiding a system freeze
Testing Backup Paths
While the preferred path to a device is operable, the system does not use any of the other
paths to that device. If an access path fails, the system switches to the backup path. Once
the fault on the original path is cleared, there may be a need to force the system to
switch back to the original path.
Since backup paths normally go untested, there is no guarantee that they are functional.
There must be a means of forcing the system to use these paths for testing purposes.
You can use DIVER to stop a processor and cause all processes and all lines to switch to
the backup. Forcing the system to use the backup processor for the device may affect the
performance of the processor to which the device has been switched, particularly if that
processor is being heavily used when you issue the command.
Having tested the other access paths successfully, it may be preferable to reinstate the
original paths.
Performing Powerfail Testing
Powerfail testing allows you to evaluate the ability of the hardware and software to
recover correctly from the loss of power either to single or multiple components of the
system or to the entire system.
What Happens During a Power Failure
Between the power supply and the processor is a large capacitance which, when the
power fails, discharges its power, giving the processor enough time to shut down its
environment in a controlled manner.
The processor’s battery keeps the contents of memory intact for a maximum of
approximately six hours, but does not have enough power to run the processors and
peripherals. If power is restored before the battery drains, the Power On interrupt code
Note. These tests should not done at the busiest time, nor when the system is inactive. This
would result in a test scenario that does not represent normal system activity.
Note. It is important to know your primary and backup paths. A system configuration diagram
can give you this information. Section 2, Preventing Unplanned Outages, provides
information about system configuration diagrams.