Specifications

Chapter 4. Continuous availability and manageability 119
Draft Document for Review May 12, 2014 12:46 pm 5102ch04.fm
򐂰 Fault monitoring
Built-in self-test (BIST) checks processor, cache, memory, and associated hardware that
is required for proper booting of the operating system, when the system is powered on at
the initial installation or after a hardware configuration change (for example, an upgrade).
If a non-critical error is detected or if the error occurs in a resource that can be removed
from the system configuration, the booting process is designed to proceed to completion.
The errors are logged in the system nonvolatile random access memory (NVRAM). When
the operating system completes booting, the information is passed from the NVRAM to the
system error log where it is analyzed by error log analysis (ELA) routines. Appropriate
actions are taken to report the boot-time error for subsequent service, if required.
򐂰 Concurrent access to the service processors menus of the ASMI
This access allows non disruptive abilities to change system default parameters,
interrogate service processor progress and error logs, and set and reset server indicators
(Guiding Light for midrange and high-end servers, Light Path for low-end servers),
accessing all service processor functions without having to power down the system to the
standby state. This allows the administrator or service representative to dynamically
access the menus from any web browser-enabled console that is attached to the Ethernet
service network, concurrently with normal system operation.
򐂰 Managing the interfaces for connecting uninterruptible power source systems to the
POWER processor-based systems, performing timed power-on (TPO) sequences, and
interfacing with the power and cooling subsystem
Error checkers
IBM POWER processor-based systems contain specialized hardware detection circuitry that
is used to detect erroneous hardware operations. Error checking hardware ranges from parity
error detection coupled with processor instruction retry and bus retry, to ECC correction on
caches and system buses.
All IBM hardware error checkers have distinct attributes:
򐂰 Continuous monitoring of system operations to detect potential calculation errors.
򐂰 Attempts to isolate physical faults based on runtime detection of each unique failure.
򐂰 Ability to initiate a wide variety of recovery mechanisms designed to correct the problem.
The POWER processor-based systems include extensive hardware and firmware
recovery logic.
Fault isolation registers
Error-checker signals are captured and stored in hardware fault isolation registers (FIRs).
The associated logic circuitry is used to limit the domain of an error to the first checker that
encounters the error. In this way, runtime error diagnostics can be deterministic so that for
every check station, the unique error domain for that checker is defined and documented.
Ultimately, the error domain becomes the field-replaceable unit (FRU) call, and manual
interpretation of the data is not normally required.
First-failure data capture
First-failure data capture (FFDC) is an error isolation technique. It ensures that when a fault is
detected in a system through error checkers or other types of detection methods, the root
cause of the fault will be captured without the need to re-create the problem or run an
extended tracing or diagnostics program.
For the vast majority of faults, a good FFDC design means that the root cause is detected
automatically without intervention by a service representative. Pertinent error data related to
the fault is captured and saved for analysis. In hardware, FFDC data is collected from the fault