White Paper on Dynamic Processor Deallocation and Dynamic Processor Resilience

ManualsBrandsHP ManualsSoftwareHP-UX Online Diagnostics Software

Dynamic Processor Deallocation

And

Dynamic Processor Resilience

White Paper

April 16, 2002

Revision 1.07

Summary of content (11 pages)

PAGE 1
Dynamic Processor Deallocation And Dynamic Processor Resilience White Paper April 16, 2002 Revision 1.
PAGE 2
Revision Information Revision 1.03: Initial revision reflecting the June 1999 IPR release. Revision 1.04: Reflects changes release in the September 1999 Support Plus release. Enhancement of Dynamic Processor Deallocation to keep processors deallocated even if the system is rebooted. Enhancement of the STM System Information Tool to reflect the current state of all processors on the system. Revision 1.05: Information added to reflect changes made in support of the Instant Capacity on Demand (iCOD) product.
PAGE 3
INTRODUCTION ........................................................................................................................................ 4 TYPES OF ERRORS ADDRESSED .......................................................................................................... 4 LPMCS ........................................................................................................................................................ 4 FLOATING-POINT ERRORS ..........................................
PAGE 4
Introduction The purpose of this white paper is to provide an overview of an exciting new technology that Hewlett Packard has developed that can significantly reduce system downtime due to processor failures. This technology, called Dynamic Processor Resilience, enables HPUX systems to monitor the operation of processors, predict failures before they occur, and dynamically deallocate troubled processors before they experience catastrophic errors resulting in system failures.
PAGE 5
NOTE: Starting HWE 0206 release of Diagnostics, the CPU monitor will keep track of each of these types of LPMCs rather than treating them as one type as in earlier version of the monitor. Floating-Point Errors Besides monitoring the Cache errors on the processors, the monitor will run tests on the Floating-Point registers to see if they are functioning properly.
PAGE 6
Dynamic Processor Resilience (DPR) Beginning with the June 1999 release of the IPR/Diagnostic media, an EMS monitor is provided which monitors the rate of correctable errors in each processor’s on-board cache. These errors are manifested as Low Priority Machine Checks (LPMCs). While occasional correctable errors are to be expected in the on-board cache, too many of these errors in a short period of time indicate an increased likelihood that a noncorrectable cache error could occur.
PAGE 7
same type of cache error (I-Cache Data, I-Cache Tag, D-Cache Data and D-Cache Tag) , one of two actions will be taken: 1. If the processor IS NOT the monarch, the Dynamic Processor Deallocation facility will be invoked to deallocate it. The monitor will then generate a serious EMS event indicating that the processor was deallocated and should be scheduled for replacement (see example in figure 3).
PAGE 8
NOTE: On N-Class, L-Class and later machines, the monitor will also try to mark the processor for Deconfiguration whether the processor in question is a Monarch CPU or not. NOTE: The EMS CPU monitor will detect and prevent most processor failures that are related to cache errors. However, although cache errors account for the majority of processor failures, it is not possible to detect and prevent all processor related system failures.
PAGE 9
provides an array of monitors that continuously assess the health of various hardware and software components on HP-UX systems. These monitors will generate EMS events whenever a failure condition is detected. These events are then reported through various customer configurable reporting mechanisms such as e-mail, syslog, console messages, text logs, SNMP, etc. Figure 2 shows a high-level diagram of EMS.
PAGE 10
Event Time..........: Severity............: Monitor.............: Event #.............: System..............: Wed Feb 20 08:55:50 2002 SERIOUS lpmc_em 100624 hptest17 Summary: Module at Hard Physical Address = 0xfffffffffc478000 : Cache Error(s) detected on Processor 16. Description of Error: 2 parity errors have been detected in either Instruction or Data memory (I-Cache or D-Cache) in 1 Day(s). These are indicated by 100611 100612 EMS Event(s) generated prior to this event.
PAGE 11
Dynamic Processor Resilience and HP Predictive Support HP Predictive Support is a facility provided to HP Support customers that detects and predicts certain failures on customer systems and reports them directly to HP. Dynamic Processor Resilience is indirectly supported by this facility.