Administrator Guide

Alerts
11 Prefailure alerts provided by Dell EMC PowerEdge server systems management | ID 426
2.3 Memory alerts
With the growing importance of memory in today’s compute environment, Dell is taking steps beyond the
standard monitoring and alerting on memory errors. In addition to the stand alerts, Dell has pioneered the
following solutions: Memory Page Retire and Fault Resilient Memory.
2.3.1 Memory Page Retire
All Dell EMC servers ship standard with Error-Correcting Code (ECC), a first line of defense on errors in
system memory. ECC looks for single-bit errors in memory and automatically corrects them, keeping the
system running smoothly. Most ECC corrected errors are not isolated. Addresses that experience error
corrections once tend to experience them again. If these errors cascade into nearby bits, it can expand
beyond what ECC can deal with. In turn, the operating system processes the uncorrectable error and fails the
system. With the introduction of iDRAC7 based systems, Dell EMC worked with hypervisor partners Microsoft
and VMware to introduce Memory Page Retire (MPR).
The basic flow of MPR is:
1. The hypervisor monitors baseline ECC memory faults.
2. Should certain regions produce recoverable errors beyond a certain threshold, the section, or page, is
retired.
3. After 64 Kb of page retires have occurred, the event is logged in the system event log.
4. The address and adjoining space is mapped off and unavailable to the hypervisor.
5. The defective memory can be replaced during scheduled service time.
Memory Page Retire is supported on Microsoft Windows 2012 R2 and VMware ESXi 5.1 U1 and beyond.
2.3.2 Fault Resilient Memory
Within a virtualization environment, the hypervisor is the brain that sits below the virtual machines, controlling
the server resources and distributing them as needed. Hypervisors are exposed to uncorrectable memory
errors like any other operating system. However, if a hypervisor fails, they generally bring down more than
one application. Fault Resilient Memory (FRM) is a patented technology Dell EMC has introduced aimed at
creating more resilient memory protection for the hypervisor.
FRM works with VMware vSphere v5.5 and higher, which uses its Reliable Memory feature to work with FRM.
FRM creates a fault-resilient memory zone for the hypervisor within socket 0 and communicates that address
up for the hypervisor to place itself into. The ESXi hypervisor in vSphere v5.5 or higher looks for this address
communication, and if found, places itself in the protected zone. The protection FRM provides is as robust as
Memory Mirroring. An uncorrectable error that occurs in socket 0 is logged as a System Event, without
requiring a full 50% of system memory. This log event gives administrators time to become aware of the
issue. They can place the system in Maintenance Mode to clear off running VMs, and then deal with the
memory module showing errors.