Specifications
Chapter 4. Continuous availability and manageability 111
Draft Document for Review May 12, 2014 12:46 pm 5102ch04.fm
Intermittent errors are generally not repeatable, often because of cosmic rays or other
sources of radiation.
With the instruction retry function, when an error is encountered in the core, in caches and
certain logic functions, the POWER8 processor first automatically retries the instruction. If the
source of the error was truly transient, the instruction succeeds and the system can continue
as before.
Alternate processor retry
Hard failures are more difficult; they are permanent errors that are replicated each time that
the instruction is repeated. Retrying the instruction does not help in this situation because the
instruction will continue to fail.
As introduced with POWER6, POWER8 processors can extract the failing instruction from the
faulty core and retry it elsewhere in the system. The failing core is then dynamically
deconfigured and scheduled for replacement.
Dynamic processor deallocation
Dynamic processor deallocation enables automatic deconfiguration of processor cores when
patterns of recoverable core-related faults are detected. Dynamic processor deallocation
prevents a recoverable error from escalating to an unrecoverable system error, which might
otherwise result in an unscheduled server outage. Dynamic processor deallocation relies on
the service processor’s ability to use FFDC-generated recoverable error information to notify
the POWER Hypervisor when a processor core reaches its predefined error limit. The
POWER Hypervisor then dynamically deconfigures the failing core and notifies the system
administrator that a replacement is needed. The entire process is transparent to the partition
owning the failing instruction.
Single processor checkstop
As in the POWER6 processor, the POWER8 processor provides single core check-stopping
for certain processor logic, command, or control errors that cannot be handled by the
availability enhancements in the preceding section.
This approach significantly reduces the probability of any one processor affecting total system
availability by containing most processor checkstops to the partition that was using the
processor at the time that full checkstop goes into effect.
Even with all these availability enhancements to prevent processor errors from affecting
system-wide availability, errors might occur that can result in a system-wide outage.
Before POWER6: On IBM systems prior to POWER6, such an error typically caused a
checkstop