Specifications
Chapter 4. Continuous availability and manageability 115
Draft Document for Review May 12, 2014 12:46 pm 5102ch04.fm
The L2 and L3 cache of the POWER8 processor-based systems can hold an unmodified
copy of data in a portion of main memory. In this case, an uncorrectable error simply
triggers a reload of a cache line from main memory.
In cases where the data cannot be recovered from another source, a technique named
Special Uncorrectable Error (SUE) handling is used to prevent an uncorrectable error in
memory or cache from immediately causing the system to terminate. That is, the system tags
the data and determines whether it will ever be used again:
If the error is irrelevant, SUE will not force a checkstop.
If data is used, termination can be limited to the program/kernel or hypervisor that owns
the data, or freeze the I/O adapters that are controlled by an I/O hub controller if data is
going to be transferred to an I/O device.
When an uncorrectable error is detected, the system modifies the associated ECC word,
thereby signaling to the rest of the system that the “standard” ECC is no longer valid. The
service processor is then notified and takes appropriate actions. When running AIX 5.2, or
later, or Linux, and a process attempts to use the data, the operating system is informed of
the error and might terminate, or only terminate a specific process that is associated with the
corrupt data, depending on the operating system and firmware level and whether the data
was associated with a kernel or non-kernel process.
Only in the case where the corrupt data is used by the POWER Hypervisor must the entire
system be rebooted, thereby preserving overall system integrity.
Depending on system configuration and the source of the data, errors encountered during I/O
operations might not result in a machine check. Instead, the incorrect data is handled by the
processor host bridge (PHB) chip. When the PHB chip detects a problem, it rejects the data,
preventing data from being written to the I/O device.
The PHB then enters a freeze mode, halting normal operations. Depending on the model and
type of I/O being used, the freeze might include the entire PHB chip, or simply a single bridge,
resulting in the loss of all I/O operations that use the frozen hardware until a power-on reset of
the PHB is done. The impact to partitions depends on how the I/O is configured for
redundancy. In a server configured for failover availability, redundant adapters spanning
multiple PHB chips can enable the system to recover transparently, without partition loss.
4.2.6 PCI Enhanced Error Handling
IBM estimates that PCI adapters can account for a significant portion of the hardware-based
errors on a large server. Although servers that rely on boot-time diagnostics can identify
failing components to be replaced by hot-swap and reconfiguration, runtime errors pose a
more significant problem.
PCI adapters are generally complex designs involving extensive on-board instruction
processing, often on embedded microcontrollers. They tend to use industry standard grade
components with an emphasis on product cost relative to high reliability. In certain cases, they
might be more likely to encounter internal microcode errors or many of the hardware errors
described for the rest of the server.
The traditional means of handling these problems is through adapter internal error reporting
and recovery techniques in combination with operating system device driver management
and diagnostics. In certain cases, an error in the adapter might cause transmission of bad
data on the PCI bus itself, resulting in a hardware-detected parity error and causing a global
machine-check interrupt, eventually requiring a system reboot to continue.