Specifications

Chapter 4. Continuous availability and manageability 115

Draft Document for Review May 12, 2014 12:46 pm 5102ch04.fm

򐂰 The L2 and L3 cache of the POWER8 processor-based systems can hold an unmodified

copy of data in a portion of main memory. In this case, an uncorrectable error simply

triggers a reload of a cache line from main memory.

In cases where the data cannot be recovered from another source, a technique named

Special Uncorrectable Error (SUE) handling is used to prevent an uncorrectable error in

memory or cache from immediately causing the system to terminate. That is, the system tags

the data and determines whether it will ever be used again:

򐂰 If the error is irrelevant, SUE will not force a checkstop.

򐂰 If data is used, termination can be limited to the program/kernel or hypervisor that owns

the data, or freeze the I/O adapters that are controlled by an I/O hub controller if data is

going to be transferred to an I/O device.

When an uncorrectable error is detected, the system modifies the associated ECC word,

thereby signaling to the rest of the system that the “standard” ECC is no longer valid. The

service processor is then notified and takes appropriate actions. When running AIX 5.2, or

later, or Linux, and a process attempts to use the data, the operating system is informed of

the error and might terminate, or only terminate a specific process that is associated with the

corrupt data, depending on the operating system and firmware level and whether the data

was associated with a kernel or non-kernel process.

Only in the case where the corrupt data is used by the POWER Hypervisor must the entire

system be rebooted, thereby preserving overall system integrity.

Depending on system configuration and the source of the data, errors encountered during I/O

operations might not result in a machine check. Instead, the incorrect data is handled by the

processor host bridge (PHB) chip. When the PHB chip detects a problem, it rejects the data,

preventing data from being written to the I/O device.

The PHB then enters a freeze mode, halting normal operations. Depending on the model and

type of I/O being used, the freeze might include the entire PHB chip, or simply a single bridge,

resulting in the loss of all I/O operations that use the frozen hardware until a power-on reset of

the PHB is done. The impact to partitions depends on how the I/O is configured for

redundancy. In a server configured for failover availability, redundant adapters spanning

multiple PHB chips can enable the system to recover transparently, without partition loss.

4.2.6 PCI Enhanced Error Handling

IBM estimates that PCI adapters can account for a significant portion of the hardware-based

errors on a large server. Although servers that rely on boot-time diagnostics can identify

failing components to be replaced by hot-swap and reconfiguration, runtime errors pose a