Administrator Guide

Alerts

11 Prefailure alerts provided by Dell EMC PowerEdge server systems management | ID 426

2.3 Memory alerts

With the growing importance of memory in today’s compute environment, Dell is taking steps beyond the

standard monitoring and alerting on memory errors. In addition to the stand alerts, Dell has pioneered the

following solutions: Memory Page Retire and Fault Resilient Memory.

2.3.1 Memory Page Retire

All Dell EMC servers ship standard with Error-Correcting Code (ECC), a first line of defense on errors in

system memory. ECC looks for single-bit errors in memory and automatically corrects them, keeping the

system running smoothly. Most ECC corrected errors are not isolated. Addresses that experience error

corrections once tend to experience them again. If these errors cascade into nearby bits, it can expand

beyond what ECC can deal with. In turn, the operating system processes the uncorrectable error and fails the

system. With the introduction of iDRAC7 based systems, Dell EMC worked with hypervisor partners Microsoft

and VMware to introduce Memory Page Retire (MPR).

The basic flow of MPR is:

1. The hypervisor monitors baseline ECC memory faults.

2. Should certain regions produce recoverable errors beyond a certain threshold, the section, or page, is

retired.

3. After 64 Kb of page retires have occurred, the event is logged in the system event log.

4. The address and adjoining space is mapped off and unavailable to the hypervisor.

5. The defective memory can be replaced during scheduled service time.

Memory Page Retire is supported on Microsoft Windows 2012 R2 and VMware ESXi 5.1 U1 and beyond.

2.3.2 Fault Resilient Memory

Within a virtualization environment, the hypervisor is the brain that sits below the virtual machines, controlling

the server resources and distributing them as needed. Hypervisors are exposed to uncorrectable memory

errors like any other operating system. However, if a hypervisor fails, they generally bring down more than

one application. Fault Resilient Memory (FRM) is a patented technology Dell EMC has introduced aimed at

creating more resilient memory protection for the hypervisor.

FRM works with VMware vSphere v5.5 and higher, which uses its Reliable Memory feature to work with FRM.

FRM creates a fault-resilient memory zone for the hypervisor within socket 0 and communicates that address

up for the hypervisor to place itself into. The ESXi hypervisor in vSphere v5.5 or higher looks for this address

communication, and if found, places itself in the protected zone. The protection FRM provides is as robust as

Memory Mirroring. An uncorrectable error that occurs in socket 0 is logged as a System Event, without

requiring a full 50% of system memory. This log event gives administrators time to become aware of the

issue. They can place the system in Maintenance Mode to clear off running VMs, and then deal with the

memory module showing errors.