Whitepaper Memory Errors and Dell EMC PowerEdge YX4X Server Memory RAS Features Revision: 1.3 Issue Date: 11/20/2020 Issue Date: 1/22/2021 Introduction Memory sub-system errors are some of the most common types of errors seen on modern computing systems. Understanding how memory errors occur and how to prevent or avoid them can be a complex subject – one that has challenged countless numbers of industry researchers and developers over the last 30 years.
Revisions Date Description January 3, 2020 • Initial release • Removed content for platforms based on AMD EPYC and Xeon E processors Added more information to primer on uncorrectable errors Added clarification on PPR resources for genuine Dell DIMMs Added MEM8000 SEL event to recommended user actions list Added clarification to MEM9072 SEL event details and recommended user action Added content specific to updates contained in BIOS 2.7.
Huong Nguyen BIOS Development, Technical Staff, Dell EMC Ching-Lung Chao BIOS Development, Technical Staff, Dell EMC Fred Spreeuwers IPS Engineering, Technical Staff, Dell EMC Mark Dykstra IPS Engineering, Senior Principal Engineer, Dell EMC Rene Franco Memory Systems Engineering, Senior Manager, Dell EMC Mark Farley Component Quality Engineering, Senior Principal Engineer, Dell EMC A Primer on Memory Errors To fully understand the memory RAS response capabilities of PowerEdge servers, it is fir
o o o Uncorrectable Errors (UCEs) o Uncorrectable errors are multi-bit errors that could not be corrected by the server platform. These can be caused by any combination of soft or hard errors, but typically occur as a result of multiple hard errors. o Not all multi-bit errors are uncorrectable. CPUs that support Advanced ECC can correct some types of multi-bit errors, depending on the bit error pattern.
Unconsumed Outcome based on OS error containment Poisoned upon detection; error waits to be consumed Error waits to be consumed A Primer on Dell EMC PowerEdge Server Memory RAS Capabilities Previously discussed memory errors are mitigated through PowerEdge server memory RAS capabilities which entail fault avoidance, detection, and correction in hardware and software. These mitigating RAS features are all intended to improve system reliability and extend uptime in the event of memory errors.
error correction that covers an entire DRAM device has been branded in various forms, most popularized as Chipkill and Single Device Data Correction (SDDC). Advanced ECC is a highly complex feature that is based on the concept of Single Symbol Correcting – Double Symbol Detecting (SSC-DSD) Reed-Solomon error correcting and detection code [3]. At a high level, SSC-DSD works by breaking up cache line accesses into ‘code words’ which in turn are made up of multi-bit symbols.
3 74 75 76 1 2 3 4 XXXX XXXX 78 79 80 81 82 83 84 85 86 87 88 89 5 6 7 8 9 10 11 12 13 14 15 16 ... 137 138 139 140 141 142 143 144 65 66 67 68 69 70 71 72 Figure 2 - Advanced ECC can correct multi-bit errors in a single symbol… 73 74 75 76 78 79 80 81 82 83 84 85 86 87 88 89 2 3 4 5 X X 1 7 8 9 10 11 12 13 14 15 16 6 ...
Adaptive Double Device Data Correction (ADDDC) DIMMs Supported Memory Configuration Required ADDDC Feature Support Table x4 DIMMs: x8 DIMMs: • Two or more memory ranks per memory channel Adaptive Double Device Data Correction (ADDDC) is an Intel platform-specific technology that allows for two DRAM devices to sequentially fail before loss of fault-avoidance.
Memory patrol scrubbing is enabled by default and configured to perform in the background every 24 hours. Memory patrol scrub can be disabled or set to run at an accelerated schedule (every four hours) in the BIOS setup under the power management menu. Memory patrol scrub may have an impact on system performance for some workloads while it is running. FYI: Demand Scrub occurs when the memory controller encounters a correctable error during a regular run-time read transaction and writes back corrected data.
sparing failover. The failover process consists of checking the health of the spare rank(s) through patrol scrubbing then seamlessly copy the contents of the degraded rank to the spare rank(s). Memory rank sparing is disabled by default and can be enabled in BIOS setup if required.
o E.g. One 32 GB RDIMM (2Rx4) and one 16 GB RDIMM (2Rx8) installed = two 16 GB ranks and two 8 GB ranks. Both 16 GB ranks will be held as spares, resulting in a 66% capacity reduction.
Important: Consult your PowerEdge server installation and service manual for complete memory population guidelines to properly enable Memory Mirroring.
Memory channels must be populated with all one DIMM or all two DIMMs (for example, 24 DIMM systems should have 12 DIMMs or 24 DIMMs installed). Fault Resilient Memory is disabled by default and must be enabled through the BIOS setup menu. Important: Consult your PowerEdge server installation and service manual for complete memory population guidelines to properly enable Fault Resilient Memory.
Figure 7 - PPR for a row in a bank group of a 4Gb x4 device PPR is always available on PowerEdge server platforms that support it and if deemed necessary by BIOS will automatically execute after a system cold reboot. For PPR to successfully execute, it is recommended that users do not swap or replace DIMMs between boots when receiving memory error event messages, unless instructed to do so by Dell technical support personnel.
• • If the impacted data was in user/application/VM memory, then the OS will terminate the associated process or VM without impacting the rest of the system. If the impacted data was in user/application/VM memory but the OS had a redundant copy of the data, then the associated process or VM will recover. Consult your operating system documentation on error containment for more information on OS behaviors.
o Benefit: Patrol scrub will run every four hours (instead of 24); increased frequency will reduce the accumulation of errors in areas of memory with low utilization and thus not being corrected by demand scrub It is also recommended that users keep their PowerEdge server firmware up to date, especially server BIOS. This is because even after products ship, PowerEdge server development continuously works to improve its RAS algorithms and behaviors for an optimal customer experience.
• • • • location (note that BIOS may initiate more reboots during this process). Do not remove or swap the DIMM at the specified location in the event message. MEM0804 – This is an indication that the system has successfully performed memory-self healing at the specified DIMM location in the event message. o Recommended Response Action: No response required. DIMM is operating nominally.
• • • • • • • • • • • • • • • • • • • • • PowerEdge T440* PowerEdge T640 PowerEdge C4140 PowerEdge C6420 PowerEdge XR2* PowerEdge R440* PowerEdge R540* PowerEdge R640 PowerEdge R740 PowerEdge R740xd PowerEdge R740xd2 PowerEdge R840 PowerEdge R940 PowerEdge R940xa PowerEdge FC640 PowerEdge M640 PowerEdge MX740c PowerEdge MX840c PowerEdge XE2420* PowerEdge XE7420* PowerEdge XE7440* The following VxRail platforms are leveraged from PowerEdge YX4X servers with Xeon SP processors and are therefore are also cov
What’s New in BIOS 2.8.2 • • • • Self-Healing on Uncorrectable Errors – Prior to this update, PowerEdge server BIOS was capable of performing self-healing only whenever its health monitoring algorithms deemed it necessary. With this PowerEdge server BIOS release, if the CPU detects an uncorrectable error, the server will automatically schedule self-healing to occur on the next cold reboot of the server.
Legal Notices THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. Copyright © 2020 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Intel and Xeon are trademarks of Intel Corporation or its subsidiaries.