White Paper White Paper Prefailure alerts provided by Dell EMC PowerEdge server systems management Abstract Discover the various methods by which OpenManage tools can help provide better server uptime with prefailure alerts.
Revisions Revisions Date Description September 2015 Initial release June 2019 First Revision June 2020 Second Revision Acknowledgments Authors: Aparna Giri, Damon Earley, Doug Iler, Jeff Krebs, Lori Matthews, Vish Balakrishnan Contributors: John Abrams The information in this publication is provided “as is.” Dell Inc.
Table of contents Table of contents Revisions.............................................................................................................................................................................2 Acknowledgments ...............................................................................................................................................................2 Table of contents ................................................................................................
Executive summary Executive summary The ability to receive and react to alerts for possible component issues is a critical task for any IT admin. Dell EMC PowerEdge servers provide a wide range of alerts using the integrated Dell Remote Access Controller (iDRAC) and other elements of the OpenManage portfolio. The iDRAC monitors the status of critical subsystems and notifies system administrators about any warning and critical threshold events.
Introduction 1 Introduction The acronym PFA stands for prefailure alert or predictive failure analysis. Originally, PFAs focused on hard drives. The goal was, and still is, to avoid unplanned downtime. Over the years, PFAs have grown beyond hard drives, and now include many other components in the server. This expanded coverage has become increasingly important with the rise of virtualization. Today, there can be multiple virtual servers depending on the underlying physical hardware.
Introduction • • • • Health status Warning and Failure alerts Redundancy Warning and Failure alerts Predictive Failure alerts Actions in response to these alerts include: • • • • • • SNMP traps Email alerts Redfish eventing IPMI events All events are logged and can be exported manually or remotely using Remote Syslog. o From the iDRAC GUI, IT administrators can export the full Lifecycle Controller log and review.
Alerts 2 Alerts This section reviews the various devises and alerts in greater detail. o o o o o o o o Drives – Hard Disk Drive and Solid-State Drive CPU Memory Temperature Fans Power supplies GPUs (requires iDRAC9 firmware 4.00 or higher) SFP I/O (requires iDRAC9 firmware 4.00 or higher) The iDRAC home page, or dashboard, provides a quick view of the health status of the server and storage.
Alerts If there were an issue with the server, the ‘details’ link would show the issue as listed in the Lifecycle Controller log. On the “System” page, an extended view of the various components and status can be seen. This visual provides an IT admin with quick access to key components. For example, if a warning or critical error happens in “cooling,” the icon would change color. The admin can choose that icon to directly access details to pinpoint and correct a warning or critical alert.
Alerts 2.1 Drive alerts Drive alerts are based on the SMART industry-standard specification for system drives. SMART drives are engineered to provide early warning of certain drive failure indicators. These indicators are meant to give advanced warning of certain types of failures. These warnings do not include defective components, improper handling, or static electricity discharge. However, roughly 60% of drive failures are due to gradual wear and tear.
Alerts 2.2 System Processor (CPU) alerts Servers have multiple CPUs, each with multiple cores, and are typically used for virtualization and highperformance applications. As system uptime service level requirements have become increasingly stringent, CPU manufacturing and testing processes have become correspondingly sophisticated. CPU faults are typically unrecoverable errors. If CPU errors occur frequently, certain problems such as L2 cache error corrections can lead to server failure.
Alerts 2.3 2.3.1 Memory alerts With the growing importance of memory in today’s compute environment, Dell is taking steps beyond the standard monitoring and alerting on memory errors. In addition to the stand alerts, Dell has pioneered the following solutions: Memory Page Retire and Fault Resilient Memory. Memory Page Retire All Dell EMC servers ship standard with Error-Correcting Code (ECC), a first line of defense on errors in system memory.
Alerts 2.4 Temperature and fan alerts Temperature alerts provide advanced warning that either the ambient temperature is at or exceeding preset temperature ranges. Dell offers a wide array of alerts and other technologies that help monitor and manage temperature alerts. Temperatures are monitored at server CPU and at the system board inlet. Fans and blowers are well placed within the chassis to provide maximum cooling.
Alerts 2.5 Power Supply alerts Power-conditioning uninterruptible power supplies are a highly cost-effective step in defending servers from electrical dangers to their sophisticated and delicate electronic circuits. Once power going to servers has been conditioned properly, the next critical server subsystems to protect are their power supplies. Dell EMC PowerEdge servers are designed to offer redundant power supplies.
Beyond Alerts – policy-based actions 3 Beyond Alerts – policy-based actions Getting an alert is one thing; acting on it is another. Some companies still have IT staff patrol the aisles of their data center, taking notes of flashing or amber lights. This process is a time-consuming task with the potential for overlooked or missed information. For customers with a smaller IT shop with a few servers, the emailed alert option from iDRAC can be an effective solution.
Beyond Alerts – policy-based actions The following image shows the ‘wizard’ to help create a policy action based on event. OpenManage Enterprise Power Manager is a plug-in to OpenManage Enterprise. Power Manager uses power capping to ensure power for a group of servers remains within the envelope the customer sets for the server group. An admin defines a group of servers by rack, row, or room of a data center.
Beyond Alerts – policy-based actions Step 4 shows the details of the “Policy Schedule.” Step 5 is the summary slide. 3.2 Alerts and Partner Consoles For virtualized environments running Microsoft Hyper-V or VMware ESXi, Dell EMC offers OpenManage integrations with both System Center Virtual Machine Manager (SCVMM) and VMware vCenter. This integration allows customers to set different actions that are based on alert type and severity.
Beyond Alerts – policy-based actions Alert integration is also available for VMware customers. The Dell EMC OpenManage Integration for VMware vCenter inserts custom alarm definitions that enable administrators to remediate failures in an automated fashion, as shown below. By integrating into VMware “events, alarms, and actions mechanism,” administrators can see and react to hardware errors. Options include placing the server in maintenance mode, running a batch file, or sending an email.
Beyond Alerts – policy-based actions The following image shows the page for enabling alarms and events in OpenManage Integration for VMware vCenter console. The next image shows the alarm definitions at the host level in vCenter.
Conclusion 4 Conclusion Dell EMC delivers an extensive PFA and performance monitoring technology by the iDRAC embedded in every PowerEdge server. The iDRAC and the comprehensive OpenManage portfolio provide effective, proactive management that is designed to make IT administrators more effective and efficient. PFA monitoring and alerts from iDRAC are the first steps in this process.
Technical support and resources A Technical support and resources The iDRAC support home page provides access to product documents, technical white papers, how-to videos, and more. www.dell.com/support/idrac iDRAC User Guides and other manuals www.dell.com/idracmanuals Dell Technical Support www.Dell.