Release Notes

3 Memory Errors and Dell PowerEdge YX4X Server Memory RAS Features

A Primer on Memory Errors

To fully understand the memory RAS response capabilities of PowerEdge servers, it is first helpful to

have an understanding of the various types of possible memory errors.

DRAM issues can be broadly classified into two categories described below:

o Soft Errors

o Soft errors are transient in nature and may often be caused by electrical disturbances in

the memory sub-system components. These disturbances could occur in any one of

many locations within the memory subsystem including the processor memory

controller, processor-internal buses, processor cache, processor socket or connector,

motherboard bus traces, discrete memory buffer chips (if present), DIMM connectors,

or individual DRAM components on DIMMs.

o Soft errors may be caused by phenomena such as high-energy particle strikes in the

memory subsystem or electrical noise in the circuits. Single or multiple bits can be

affected, with single-bit errors corrected using demand or patrol scrubbing.

o Hard Errors

o Hard errors are persistent in nature and cannot be resolved over a period of time,

through system resets, or through system power-cycles. These types of errors could

occur as a result of stuck-at faults (i.e. degradation of a single lane on a bus or a single

memory cell in a DRAM component), due to failure of an entire device (for example

connector, processor, memory buffer, or DRAM components), due to improper bus

initialization, or memory power issues. Failures within a DRAM component may consist

of entire device failure, bank region failure within a device, pin failure, column, or cell

failure.

o Hard errors may be caused by physical part damage, Electrostatic Discharge (ESD),

electrical overcurrent conditions, over temperature conditions, or irregularities in

processor or DRAM fabrication or module assembly.

The two categories of DRAM errors previously described can ultimately lead to two types of memory

errors:

o Correctable Errors (CEs)

o Correctable errors are errors that can be detected and corrected by the server platform.

These are typically single-bit errors, though based on CPU and memory configuration,

may also be some types of multi-bit errors (corrected by Advanced ECC). Correctable

errors can be caused by both soft and hard errors and will not disrupt operation of

PowerEdge servers.

o As DRAM based memory shrinks in geometry to grow in capacity, an increasing number

of correctable errors are expected to occur as a natural part of uniform scaling.

Additionally, due to various other DRAM scaling factors (e.g. decreasing cell

capacitance) there is an expected increase in the number of error generating

phenomenon such as Variable Retention Time (VRT) [1] and Random Telegraph Noise

(RTN) [2].

o Within the server industry, it is an increasingly accepted understanding shared by Dell

that some correctable errors per DIMM is unavoidable and does not inherently warrant