Specifications

ManualsBrandsQuantum Data ManualsProjector822S

121

122

123

124

125

126

127

128

129

130

5102ch04.fm Draft Document for Review May 12, 2014 12:46 pm

112 IBM Power System S822 Technical Overview and Introduction

4.2.3 Memory protection

A memory protection architecture that provides good error resilience for a relatively small L1

cache might be inadequate for protecting the much larger system main store. Therefore, a

variety of protection methods are used in all POWER processor-based systems to avoid

uncorrectable errors in memory.

Memory protection plans must account for many factors, including the following factors:

򐂰 Size

򐂰 Desired performance

򐂰 Memory array manufacturing characteristics

POWER8 processor-based systems have various protection schemes designed to prevent,

protect, or limit the effect of errors in main memory:

򐂰 Chipkill

Chipkill is an enhancement that enables a system to sustain the failure of an entire

DRAM chip. An ECC word uses 18 DRAM chips from two DIMM pairs, and a failure on any

of the DRAM chips can be fully recovered by the ECC algorithm. The system can continue

indefinitely in this state with no performance degradation until the failed DIMM can

be replaced.

򐂰 72-byte ECC

In POWER8, an ECC word consists of 72 bytes of data. Of these, 64 bytes are used to

hold application data. The remaining eight bytes are used to hold check bits and additional

information about the ECC word.DIMMs designed by IBM with a memory buffer on each of

the DIMM and DRAM modules for holding data, doing error checking and correcting plus

spare DRAM modules to allow a failed DRAM module to be replaced with a spare to avoid

replacing a DIMM for such a failure. This a a improvement over POWER7 based one

socket and two socket servers which had the same level of ECC but no spare DRAMs.

򐂰 Hardware scrubbing

Hardware scrubbing is a method used to handle intermittent errors. IBM POWER

processor-based systems periodically address all memory locations. Any memory

locations with a correctable error are rewritten with the correct data.

򐂰 Cyclic redundancy check (CRC)

The bus that is transferring data between the processor and the memory uses CRC error

detection with a failed operation-retry mechanism and the ability to dynamically retune the

bus parameters when a fault occurs. In addition, the memory bus has spare capacity to

substitute a data bit-line whenever it is determined to be faulty.

򐂰 Memory Channel Repair

The memory channel design includes a CRC error checking capability. This includes the

ability to re-try a failed bus operation and to re-train the channel when excessive CRC

errors are seen.

The design includes the ability to dynamically replace one of the bits on the bus (dynamic

bit-lane sparing) based on a hardware detected error. The firmware and hardware do not

support detection of which bit is at fault when there is a CRC error other than as part of a

re-train operation. Therefore, the POWER8 System does not support dynamic bit-lane

sparing based on firmware detecting too many re-training (or channel init) operations.