Specifications

Chapter 4. Continuous availability and manageability 111

Draft Document for Review May 12, 2014 12:46 pm 5102ch04.fm

Intermittent errors are generally not repeatable, often because of cosmic rays or other

sources of radiation.

With the instruction retry function, when an error is encountered in the core, in caches and

certain logic functions, the POWER8 processor first automatically retries the instruction. If the

source of the error was truly transient, the instruction succeeds and the system can continue

as before.

Alternate processor retry

Hard failures are more difficult; they are permanent errors that are replicated each time that

the instruction is repeated. Retrying the instruction does not help in this situation because the

instruction will continue to fail.

As introduced with POWER6, POWER8 processors can extract the failing instruction from the

faulty core and retry it elsewhere in the system. The failing core is then dynamically

deconfigured and scheduled for replacement.

Dynamic processor deallocation

Dynamic processor deallocation enables automatic deconfiguration of processor cores when

patterns of recoverable core-related faults are detected. Dynamic processor deallocation

prevents a recoverable error from escalating to an unrecoverable system error, which might

otherwise result in an unscheduled server outage. Dynamic processor deallocation relies on

the service processor’s ability to use FFDC-generated recoverable error information to notify

the POWER Hypervisor when a processor core reaches its predefined error limit. The

POWER Hypervisor then dynamically deconfigures the failing core and notifies the system

administrator that a replacement is needed. The entire process is transparent to the partition

owning the failing instruction.

Single processor checkstop

As in the POWER6 processor, the POWER8 processor provides single core check-stopping

for certain processor logic, command, or control errors that cannot be handled by the

availability enhancements in the preceding section.

This approach significantly reduces the probability of any one processor affecting total system

availability by containing most processor checkstops to the partition that was using the

processor at the time that full checkstop goes into effect.

Even with all these availability enhancements to prevent processor errors from affecting

system-wide availability, errors might occur that can result in a system-wide outage.

Before POWER6: On IBM systems prior to POWER6, such an error typically caused a

checkstop