Specifications

5102ch04.fm Draft Document for Review May 12, 2014 12:46 pm
112 IBM Power System S822 Technical Overview and Introduction
4.2.3 Memory protection
A memory protection architecture that provides good error resilience for a relatively small L1
cache might be inadequate for protecting the much larger system main store. Therefore, a
variety of protection methods are used in all POWER processor-based systems to avoid
uncorrectable errors in memory.
Memory protection plans must account for many factors, including the following factors:
򐂰 Size
򐂰 Desired performance
򐂰 Memory array manufacturing characteristics
POWER8 processor-based systems have various protection schemes designed to prevent,
protect, or limit the effect of errors in main memory:
򐂰 Chipkill
Chipkill is an enhancement that enables a system to sustain the failure of an entire
DRAM chip. An ECC word uses 18 DRAM chips from two DIMM pairs, and a failure on any
of the DRAM chips can be fully recovered by the ECC algorithm. The system can continue
indefinitely in this state with no performance degradation until the failed DIMM can
be replaced.
򐂰 72-byte ECC
In POWER8, an ECC word consists of 72 bytes of data. Of these, 64 bytes are used to
hold application data. The remaining eight bytes are used to hold check bits and additional
information about the ECC word.DIMMs designed by IBM with a memory buffer on each of
the DIMM and DRAM modules for holding data, doing error checking and correcting plus
spare DRAM modules to allow a failed DRAM module to be replaced with a spare to avoid
replacing a DIMM for such a failure. This a a improvement over POWER7 based one
socket and two socket servers which had the same level of ECC but no spare DRAMs.
򐂰 Hardware scrubbing
Hardware scrubbing is a method used to handle intermittent errors. IBM POWER
processor-based systems periodically address all memory locations. Any memory
locations with a correctable error are rewritten with the correct data.
򐂰 Cyclic redundancy check (CRC)
The bus that is transferring data between the processor and the memory uses CRC error
detection with a failed operation-retry mechanism and the ability to dynamically retune the
bus parameters when a fault occurs. In addition, the memory bus has spare capacity to
substitute a data bit-line whenever it is determined to be faulty.
򐂰 Memory Channel Repair
The memory channel design includes a CRC error checking capability. This includes the
ability to re-try a failed bus operation and to re-train the channel when excessive CRC
errors are seen.
The design includes the ability to dynamically replace one of the bits on the bus (dynamic
bit-lane sparing) based on a hardware detected error. The firmware and hardware do not
support detection of which bit is at fault when there is a CRC error other than as part of a
re-train operation. Therefore, the POWER8 System does not support dynamic bit-lane
sparing based on firmware detecting too many re-training (or channel init) operations.