Release Notes

14 Memory Errors and Dell PowerEdge YX4X Server Memory RAS Features
Machine Check Architecture Recovery, or MCA Recovery, is an advanced RAS feature which when used in
conjunction with supported operating systems, can prevent some types of uncorrectable memory errors
from crashing the entire system. Essentially, the processor’s memory controller will detect an
uncorrectable error, signal to the OS that the detection has occurred for a memory page and allow the
OS to gracefully contain the issue. The outcome depends entirely on the point of UCE detection and
whether the impacted memory is associated with kernel space or user space.
If the uncorrectable error is detected in the execution path, it means that the error was detected at the
point of consumption by the processor. These are considered Software Recoverable Action Required
(SRAR) errors. If the corrupted memory was destined for the kernel space, then the OS will kernel panic
and the system will crash as per normal UCE behavior. If it was destined for user space, then the OS will
kill the associated process without impacting the rest of the system.
If the uncorrectable error is detected in the non-execution path, it means that the error was detected by
memory patrol scrub and was not about to be imminently consumed by the processor. Detection of
these unconsumed uncorrectable errors are marked in the System Event Log as a critical event,
MEM9072: “The system memory has faced uncorrectable multi-bit memory errors in the non-execution
path of a memory device at the location <location>.
Other Memory RAS Capabilities on PowerEdge servers
Memory Map Out If critical failures (such as uncorrectable errors) are detected in the memory
training and test phase of POST, PowerEdge servers will automatically map out the affected
DIMMs from the system memory pool. This prevents the faulty DIMM from incurring potential
service outages. The affected DIMM will not be mapped back into the memory pool until there
is a memory configuration change (such as a DIMM replacement).
Achieving Maximum Memory Up Time
Based on the memory RAS features discussed in the previous section, the following is a summary of how
users can configure their systems to achieve maximum memory up time:
Configure server using genuine Dell DIMMs
o Benefit: Memory modules are fully validated and assured by Dell; additional self-healing
(PPR) resources above and beyond industry standards
Configure server with x4 DRAM based DIMMs
o Benefit: Single DRAM Device Correction (and ADDDC on Intel platforms)
Configure server to operate in the following redundancy modes (in descending order of
protection):
o Best Configure server to operate in Memory Mirroring Mode
Benefit: RAID1 level memory protection, significantly reduced probability of
UCEs
Downside: 50% memory capacity reduction
o Better Configure server to operate in Fault Resilient Mode
Benefit: Significantly reduced probability of UCEs in critical portions of memory
used by operating systems