Concept Guide

ManualsBrandsDell ManualsConverged InfrastructureGeneral Solution Resources

1 Memory Errors and Dell PowerEdge YX4X Server Memory RAS Features

Whitepaper

Memory Errors and Dell EMC PowerEdge

YX4X Server Memory RAS Features

Introduction

Memory sub-system errors are some of the most common types of errors seen on modern computing

systems. Understanding how memory errors occur and how to prevent or avoid them can be a complex

subject – one that has challenged countless numbers of industry researchers and developers over the

last 30 years. While Dell EMC PowerEdge servers are designed to provide industry leading Reliability,

Availability, and Serviceability (RAS) on memory issues, we realize that many of our technically savvy

customers may want to know more on what’s happening ‘under the hood’ of their servers. This

technical whitepaper is divided in four sections to help PowerEdge users to understand about the

following memory error topics:

• Types of memory errors and how they may affect a server

• Dell EMC PowerEdge YX4X server memory RAS capabilities

• Configuring a PowerEdge server to achieve maximum memory up-time

• Recommended user actions when encountering memory errors

Important: The content covered in this whitepaper applies to Dell EMC PowerEdge

YX4X servers with Intel Xeon SP processors. Customers with YX4X servers that

utilize AMD EPYC or Intel Xeon E processors should refer to v1.0 of the RAS

whitepaper.

The features described in this document assume the user is running the latest

versions of Dell EMC PowerEdge server firmware, such as BIOS and iDRAC.

Revision: 1.3

Issue Date: 11/20/2020

Issue Date: 1/22/2021

Summary of content (20 pages)

PAGE 1
Whitepaper Memory Errors and Dell EMC PowerEdge YX4X Server Memory RAS Features Revision: 1.3 Issue Date: 11/20/2020 Issue Date: 1/22/2021 Introduction Memory sub-system errors are some of the most common types of errors seen on modern computing systems. Understanding how memory errors occur and how to prevent or avoid them can be a complex subject – one that has challenged countless numbers of industry researchers and developers over the last 30 years.
PAGE 2
Revisions Date Description January 3, 2020 • Initial release • Removed content for platforms based on AMD EPYC and Xeon E processors Added more information to primer on uncorrectable errors Added clarification on PPR resources for genuine Dell DIMMs Added MEM8000 SEL event to recommended user actions list Added clarification to MEM9072 SEL event details and recommended user action Added content specific to updates contained in BIOS 2.7.
PAGE 3
Huong Nguyen BIOS Development, Technical Staff, Dell EMC Ching-Lung Chao BIOS Development, Technical Staff, Dell EMC Fred Spreeuwers IPS Engineering, Technical Staff, Dell EMC Mark Dykstra IPS Engineering, Senior Principal Engineer, Dell EMC Rene Franco Memory Systems Engineering, Senior Manager, Dell EMC Mark Farley Component Quality Engineering, Senior Principal Engineer, Dell EMC A Primer on Memory Errors To fully understand the memory RAS response capabilities of PowerEdge servers, it is fir
PAGE 4
o o o Uncorrectable Errors (UCEs) o Uncorrectable errors are multi-bit errors that could not be corrected by the server platform. These can be caused by any combination of soft or hard errors, but typically occur as a result of multiple hard errors. o Not all multi-bit errors are uncorrectable. CPUs that support Advanced ECC can correct some types of multi-bit errors, depending on the bit error pattern.
PAGE 5
Unconsumed Outcome based on OS error containment Poisoned upon detection; error waits to be consumed Error waits to be consumed A Primer on Dell EMC PowerEdge Server Memory RAS Capabilities Previously discussed memory errors are mitigated through PowerEdge server memory RAS capabilities which entail fault avoidance, detection, and correction in hardware and software. These mitigating RAS features are all intended to improve system reliability and extend uptime in the event of memory errors.
PAGE 6
error correction that covers an entire DRAM device has been branded in various forms, most popularized as Chipkill and Single Device Data Correction (SDDC). Advanced ECC is a highly complex feature that is based on the concept of Single Symbol Correcting – Double Symbol Detecting (SSC-DSD) Reed-Solomon error correcting and detection code [3]. At a high level, SSC-DSD works by breaking up cache line accesses into ‘code words’ which in turn are made up of multi-bit symbols.
PAGE 7
3 74 75 76 1 2 3 4 XXXX XXXX 78 79 80 81 82 83 84 85 86 87 88 89 5 6 7 8 9 10 11 12 13 14 15 16 ... 137 138 139 140 141 142 143 144 65 66 67 68 69 70 71 72 Figure 2 - Advanced ECC can correct multi-bit errors in a single symbol… 73 74 75 76 78 79 80 81 82 83 84 85 86 87 88 89 2 3 4 5 X X 1 7 8 9 10 11 12 13 14 15 16 6 ...
PAGE 8
Adaptive Double Device Data Correction (ADDDC) DIMMs Supported Memory Configuration Required ADDDC Feature Support Table x4 DIMMs:  x8 DIMMs:  • Two or more memory ranks per memory channel Adaptive Double Device Data Correction (ADDDC) is an Intel platform-specific technology that allows for two DRAM devices to sequentially fail before loss of fault-avoidance.
PAGE 9
Memory patrol scrubbing is enabled by default and configured to perform in the background every 24 hours. Memory patrol scrub can be disabled or set to run at an accelerated schedule (every four hours) in the BIOS setup under the power management menu. Memory patrol scrub may have an impact on system performance for some workloads while it is running. FYI: Demand Scrub occurs when the memory controller encounters a correctable error during a regular run-time read transaction and writes back corrected data.
PAGE 10
sparing failover. The failover process consists of checking the health of the spare rank(s) through patrol scrubbing then seamlessly copy the contents of the degraded rank to the spare rank(s). Memory rank sparing is disabled by default and can be enabled in BIOS setup if required.
PAGE 11
o E.g. One 32 GB RDIMM (2Rx4) and one 16 GB RDIMM (2Rx8) installed = two 16 GB ranks and two 8 GB ranks. Both 16 GB ranks will be held as spares, resulting in a 66% capacity reduction.
PAGE 12
Important: Consult your PowerEdge server installation and service manual for complete memory population guidelines to properly enable Memory Mirroring.
PAGE 13
Memory channels must be populated with all one DIMM or all two DIMMs (for example, 24 DIMM systems should have 12 DIMMs or 24 DIMMs installed). Fault Resilient Memory is disabled by default and must be enabled through the BIOS setup menu. Important: Consult your PowerEdge server installation and service manual for complete memory population guidelines to properly enable Fault Resilient Memory.
PAGE 14
Figure 7 - PPR for a row in a bank group of a 4Gb x4 device PPR is always available on PowerEdge server platforms that support it and if deemed necessary by BIOS will automatically execute after a system cold reboot. For PPR to successfully execute, it is recommended that users do not swap or replace DIMMs between boots when receiving memory error event messages, unless instructed to do so by Dell technical support personnel.
PAGE 15
• • If the impacted data was in user/application/VM memory, then the OS will terminate the associated process or VM without impacting the rest of the system. If the impacted data was in user/application/VM memory but the OS had a redundant copy of the data, then the associated process or VM will recover. Consult your operating system documentation on error containment for more information on OS behaviors.
PAGE 16
o Benefit: Patrol scrub will run every four hours (instead of 24); increased frequency will reduce the accumulation of errors in areas of memory with low utilization and thus not being corrected by demand scrub It is also recommended that users keep their PowerEdge server firmware up to date, especially server BIOS. This is because even after products ship, PowerEdge server development continuously works to improve its RAS algorithms and behaviors for an optimal customer experience.
PAGE 17
• • • • location (note that BIOS may initiate more reboots during this process). Do not remove or swap the DIMM at the specified location in the event message. MEM0804 – This is an indication that the system has successfully performed memory-self healing at the specified DIMM location in the event message. o Recommended Response Action: No response required. DIMM is operating nominally.
PAGE 18
• • • • • • • • • • • • • • • • • • • • • PowerEdge T440* PowerEdge T640 PowerEdge C4140 PowerEdge C6420 PowerEdge XR2* PowerEdge R440* PowerEdge R540* PowerEdge R640 PowerEdge R740 PowerEdge R740xd PowerEdge R740xd2 PowerEdge R840 PowerEdge R940 PowerEdge R940xa PowerEdge FC640 PowerEdge M640 PowerEdge MX740c PowerEdge MX840c PowerEdge XE2420* PowerEdge XE7420* PowerEdge XE7440* The following VxRail platforms are leveraged from PowerEdge YX4X servers with Xeon SP processors and are therefore are also cov
PAGE 19
What’s New in BIOS 2.8.2 • • • • Self-Healing on Uncorrectable Errors – Prior to this update, PowerEdge server BIOS was capable of performing self-healing only whenever its health monitoring algorithms deemed it necessary. With this PowerEdge server BIOS release, if the CPU detects an uncorrectable error, the server will automatically schedule self-healing to occur on the next cold reboot of the server.
PAGE 20
Legal Notices THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. Copyright © 2020 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Intel and Xeon are trademarks of Intel Corporation or its subsidiaries.