Specifications

5102ch04.fm Draft Document for Review May 12, 2014 12:46 pm
118 IBM Power System S822 Technical Overview and Introduction
(transient) errors in the processor core. Soft failures in the processor core are transient
(intermittent) errors, often because of cosmic rays or other sources of radiation, and
generally are not repeatable. When an error is encountered in the core, the POWER8
processor first automatically retries the instruction. If the source of the error was truly
transient, the instruction succeeds and the system continues as before. On IBM systems
before POWER6, this error caused a checkstop.
Hard failures are more difficult; they are true logical errors that are replicated each time the
instruction is repeated. Retrying the instruction does not help in this situation. As in
POWER6, POWER6+, POWER7, and POWER7+ all POWER8 processors can extract the
failing instruction from the faulty core and retry it elsewhere in the system for several faults,
after which the failing core is dynamically deconfigured and called out for replacement.
These systems are designed to avoid a full system outage.
򐂰 Uncorrectable error recovery
The auto-restart (reboot) option, when enabled, can reboot the system automatically
following an unrecoverable firmware error, firmware hang, hardware failure, or
environmentally induced (AC power) failure.
The auto-restart (reboot) option must be enabled from the Advanced System Management
Interface (ASMI) or from the Control (Operator) Panel.
򐂰 Partition availability priority
Availability priorities can be assigned to partitions. If an alternate processor recovery event
requires spare processor resources to protect a workload, when no other means of
obtaining the spare resources is available, the system determines which partition has the
lowest priority and attempts to claim the needed resource. On a properly configured
POWER8 processor-based server, this way allows that capacity to be first obtained from,
for example, a test partition instead of a financial accounting system.
򐂰 POWER8 cache availability
The L2 and L3 caches in the POWER8 processor are protected with double-bit detect,
single-bit correct error detection code (ECC). In addition, the caches maintain a cache line
delete capability. A threshold of correctable errors detected on a cache line can result in
the data in the cache line being purged and the cache line removed from further operation
without requiring a reboot. An ECC uncorrectable error detected in the cache can also
trigger a purge and delete operation of the cache line. This step results in no loss of
operation if the cache line contained data that is unmodified from what was stored in
system memory. Modified data would be handled through Special Uncorrectable Error
handling. L1 data and instruction caches also have a retry capability for intermittent error
and a cache set delete mechanism for handling solid failures. In addition, the POWER8
processors also have the ability to dynamically substitute a faulty bit-line in an L3 cache
dedicated to a processor with a spare bit-line.For soft errors in caches, key design
elements include ECC in the L2 and L3 caches, plus a retry mechanism to handle L1
cache faults.
For some persistent errors in the processor core, alternate processor recovery allows
workload running on one core to be migrated over to another core without taking any
applications down in the process. This technique does require co-operation of the
PowerVM hypervisor, but with proper virtualization and with sufficient spare capacity can
be transparent to operating systems and applications.
Persistent recoverable errors in the L1/L2 and L3 caches can be handled first by removing
from use the portion of the cache containing the error.
Even when uncorrectable errors occur in caches and may cause some sort of application
or other code interruption, cache line-delete can prevent repeat faults without needing to
replace any hardware.