Whitepaper Memory Errors and Dell EMC PowerEdge YX4X Server Memory RAS Features Revision: 1.0 Issue Date: 1/3/2020 Issue Date: 1/3/2020 Introduction Memory sub-system errors are some of the most common types of errors seen on modern computing systems. Understanding how memory errors occur and how to prevent or avoid them can be a complex subject – one that has challenged countless numbers of industry researchers and developers over the last 30 years.
Revisions Date Description January 3, 2020 Initial release Author Name Role Jordan Chin Memory Technologist, Distinguished Member Technical Staff, Dell EMC Acknowledgements This paper had contributions from the following people: Name Role Stuart Berke CPU and Memory Technologist, VP, Fellow, Dell EMC David Chalfant BIOS Development, Technical Staff, Dell EMC Huong Nguyen BIOS Development, Technical Staff, Dell EMC Ching-Lung Chao BIOS Development, Technical Staff, Dell EMC Fred Spreeuwers
A Primer on Memory Errors To fully understand the memory RAS response capabilities of PowerEdge servers, it is first helpful to have an understanding of the various types of possible memory errors. DRAM issues can be broadly classified into two categories described below: o o Soft Errors o Soft errors are transient in nature and may often be caused by electrical disturbances in the memory sub-system components.
a memory module replacement. However, some server competitors will go as far as to say that an indefinite number of correctable errors are acceptable – a belief that is not shared by Dell Engineering. Instead, PowerEdge server firmware will intelligently monitor the health of memory and recommend self-healing action or module replacement based on a variety of factors including DIMM capacity, rates of correctable errors, and effectiveness of available self-healing.
bits). This means that any one bit among the 72-bits accessed from DRAM can be incorrect and PowerEdge server hardware will automatically correct it – regardless of cause.
Cache Line 73 74 75 76 78 79 80 81 82 83 84 85 86 87 88 89 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 73 74 75 76 78 79 80 81 82 83 84 85 86 87 88 89 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 73 74 75 76 78 79 80 81 82 83 84 85 86 87 88 89 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 73 74 75 76 78 79 80 81 82 83 84 85 86 87 88 89 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 73 74 75 76 78 79
As described earlier, SSC-DSD implementations will vary based on CPU platform architecture and generation. This results in different error correction coverage and memory configuration requirements to enable Advanced ECC. The current SSC-DSD coding implementation in PowerEdge YX4X servers with AMD EPYC 7xx1 processors will provide data correction on all error patterns within a single symbol.
operate with SDDC coverage. At this point, memory performance will be impacted as the memory controller must do two reads for every read to the mapped out cache-lines. FYI: ADDDC will only provide fault coverage for sequential DRAM failures over time. Two parallel DRAM failures within the same memory access still result in a service outage.
Memory Page Retire (MPR) is a feature implemented by PowerEdge server BIOS that instructs operating systems to stop using memory page locations (4 KB in size) that BIOS has deemed as potentially unhealthy – essentially removing it from the operating system’s memory pool. BIOS makes the determination of a potentially unhealthy memory page based on a proprietary PowerEdge server algorithm that takes into account correctable error patterns and error rates at a given memory page location.
DIMM A4 DIMM A5 DIMM A6 DIMM A10 DIMM A11 DIMM A12 Spare Rank Four Physical Ranks CPU DIMM A7 DIMM A8 DIMM A9 DIMM A1 DIMM A2 DIMM A3 Figure 5 - Example of two 16GB (2Rx8) RDIMMs with one rank held as spare In order to support single rank sparing, a system must be populated with at least two memory ranks per memory channel. The memory capacity reduction due to rank sparing is based on the memory configuration (number of ranks per channel and size of ranks).
Memory Mirroring Platforms Supported DIMMs Supported Memory Configuration Required Memory Mirroring Feature Support Table Intel Platforms: (Xeon SP Families Only) AMD Platforms: x4 DIMMs: x8 DIMMs: • All identical DIMMs • Memory channels must be populated as either all one DIMM per channel or two DIMMs per channel Memory Mirroring is a memory RAS feature available on Intel platforms that provides the highest level of protection against memory errors – including uncorrectable errors – at the cost
Fault Resilient Mode (FRM) Platforms Supported DIMMs Supported Memory Configuration Required Fault Resilient Mode Feature Support Table Intel Platforms: (Xeon SP Families Only) AMD Platforms: x4 DIMMs: x8 DIMMs: • Memory channels must be populated as either all one DIMM per channel or two DIMMs per channel FYI: Dell has published a separate technical whitepaper specifically for Fault Resilient Mode.
available spare rows. This is done to ensure that PowerEdge servers have a robust self-healing memory ecosystem. When the server platform determines that a DRAM row has one or more faulty cells, it can instruct the DRAM to electrically swap out the old row and replace it with a new one. This happens through electrical fusing and is a permanent process. Additionally, the PPR process can only occur at the beginning of a boot process – before memory training and test can occur.
Machine Check Architecture Recovery, or MCA Recovery, is an advanced RAS feature which when used in conjunction with supported operating systems, can prevent some types of uncorrectable memory errors from crashing the entire system. Essentially, the processor’s memory controller will detect an uncorrectable error, signal to the OS that the detection has occurred for a memory page and allow the OS to gracefully contain the issue.
▪ • • Downside: 25% memory capacity reduction, available for VMware vSphere 5.
Applicable Platforms The following platforms are considered PowerEdge YX4X servers and are therefore covered by this document: Important: Subsequent to the publication of this document, Dell may continue to add products to its YX4X server lineup. If a product is not listed below, please consult with a Dell sales or support representative to confirm the server generation.
References [1] P. Restle, J. Park and B. Lloyd, "DRAM Variable Retention Time," IEEE, 1992. [2] K. Aadithya, A. Demir, S. Venugopalan and J. Roychowdhury, "Accurate Prediction of Random Telegraph Noise Effects in SRAMs and DRAMs," IEEE, 2013. [3] "Reed–Solomon error correction," Wikipedia, [Online]. Available: https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction. [4] V. Sridharan and D. Liberty, "A study of DRAM failures in the field," IEEE, 2012. [5] A. Hwang, S. Ioan and B.