White Paper on Dynamic Processor Deallocation and Dynamic Processor Resilience

Introduction
The purpose of this white paper is to provide an overview of an exciting new technology
that Hewlett Packard has developed that can significantly reduce system downtime due to
processor failures. This technology, called Dynamic Processor Resilience, enables HP-
UX systems to monitor the operation of processors, predict failures before they occur,
and dynamically deallocate troubled processors before they experience catastrophic
errors resulting in system failures. Dynamic Processor Resilience is one of the key
technologies that enable HP to deliver industry-leading system availability and is
available on all PA8500, PA8600 and future processors.
As the cache sizes incorporated into processors continues to increase, accounting for
increasingly higher percentages of processor circuitry, it is critical that correctable cache
errors be handled effectively in order to avoid processor-related system failures. On
Hewlett Packard systems based on PA8500 or later processors, single-bit cache errors (a
single erroneous bit in the data at any given cache memory location) are corrected.
However, a double-bit cache error (two erroneous bits in the data) cannot be corrected
and will result in a system failure. Statistically, most double-bit cache errors will be
preceded by a series of single-bit errors over time as the memory cell begins to degrade.
Using Hewlett Packard’s Dynamic Processor Resilience and Dynamic Processor
Deallocation technology, processor cache can be monitored for correctable errors and the
processor dynamically deallocated before correctable errors turn uncorrectable.
Dynamic Processor Resilience also works hand-in-hand with HP's exciting new "instant
Capacity On Demand" (iCOD) product. iCOD enables customers to purchase systems
that have one or more processors in reserve, which have not yet been purchased. When
additional capacity is required, the reserve processors can be purchased and "instantly"
enabled. For systems, which have iCOD, enabled, reserve processors automatically
replace processors that are deallocated by the EMS CPU monitor (previously named
“LPMC monitor”) if they exist, thus ensuring that the system continues to run at full
capacity. The faulty processor can then be replaced when convenient at which time it will
be returned to the reserve pool.
NOTE: Starting HWE 0206 release of Diagnostics, the “LPMC monitor” has
been renamed “CPU monitor” in the documentation to reflect the fact
that it monitors floating point functionality in addition to LPMCs. The
binary name (lpmc_em) remains unchanged.
Types of Errors Addressed
LPMCs
There are four types of Single-Bit Cache Parity Errors that the processor can
experience: I-Cache data error, I-Cache Tag error, D-Cache Data error and D-
Cache Tag error.