Specifications

ManualsBrandsQuantum Data ManualsProjector822S

131

132

133

134

135

136

137

138

139

140

5102ch04.fm Draft Document for Review May 12, 2014 12:46 pm

118 IBM Power System S822 Technical Overview and Introduction

(transient) errors in the processor core. Soft failures in the processor core are transient

(intermittent) errors, often because of cosmic rays or other sources of radiation, and

generally are not repeatable. When an error is encountered in the core, the POWER8

processor first automatically retries the instruction. If the source of the error was truly

transient, the instruction succeeds and the system continues as before. On IBM systems

before POWER6, this error caused a checkstop.

Hard failures are more difficult; they are true logical errors that are replicated each time the

instruction is repeated. Retrying the instruction does not help in this situation. As in

POWER6, POWER6+, POWER7, and POWER7+ all POWER8 processors can extract the

failing instruction from the faulty core and retry it elsewhere in the system for several faults,

after which the failing core is dynamically deconfigured and called out for replacement.

These systems are designed to avoid a full system outage.

򐂰 Uncorrectable error recovery

The auto-restart (reboot) option, when enabled, can reboot the system automatically

following an unrecoverable firmware error, firmware hang, hardware failure, or

environmentally induced (AC power) failure.

The auto-restart (reboot) option must be enabled from the Advanced System Management

Interface (ASMI) or from the Control (Operator) Panel.

򐂰 Partition availability priority

Availability priorities can be assigned to partitions. If an alternate processor recovery event

requires spare processor resources to protect a workload, when no other means of

obtaining the spare resources is available, the system determines which partition has the

lowest priority and attempts to claim the needed resource. On a properly configured

POWER8 processor-based server, this way allows that capacity to be first obtained from,

for example, a test partition instead of a financial accounting system.

򐂰 POWER8 cache availability

The L2 and L3 caches in the POWER8 processor are protected with double-bit detect,

single-bit correct error detection code (ECC). In addition, the caches maintain a cache line

delete capability. A threshold of correctable errors detected on a cache line can result in

the data in the cache line being purged and the cache line removed from further operation

without requiring a reboot. An ECC uncorrectable error detected in the cache can also

trigger a purge and delete operation of the cache line. This step results in no loss of

operation if the cache line contained data that is unmodified from what was stored in

system memory. Modified data would be handled through Special Uncorrectable Error

handling. L1 data and instruction caches also have a retry capability for intermittent error

and a cache set delete mechanism for handling solid failures. In addition, the POWER8

processors also have the ability to dynamically substitute a faulty bit-line in an L3 cache

dedicated to a processor with a spare bit-line.For soft errors in caches, key design

elements include ECC in the L2 and L3 caches, plus a retry mechanism to handle L1

cache faults.

For some persistent errors in the processor core, alternate processor recovery allows

workload running on one core to be migrated over to another core without taking any

applications down in the process. This technique does require co-operation of the

PowerVM hypervisor, but with proper virtualization and with sufficient spare capacity can

be transparent to operating systems and applications.

Persistent recoverable errors in the L1/L2 and L3 caches can be handled first by removing

from use the portion of the cache containing the error.

Even when uncorrectable errors occur in caches and may cause some sort of application

or other code interruption, cache line-delete can prevent repeat faults without needing to

replace any hardware.