White Paper on Dynamic Processor Deallocation and Dynamic Processor Resilience

Introduction

The purpose of this white paper is to provide an overview of an exciting new technology

that Hewlett Packard has developed that can significantly reduce system downtime due to

processor failures. This technology, called Dynamic Processor Resilience, enables HP-

UX systems to monitor the operation of processors, predict failures before they occur,

and dynamically deallocate troubled processors before they experience catastrophic

errors resulting in system failures. Dynamic Processor Resilience is one of the key

technologies that enable HP to deliver industry-leading system availability and is

available on all PA8500, PA8600 and future processors.

As the cache sizes incorporated into processors continues to increase, accounting for

increasingly higher percentages of processor circuitry, it is critical that correctable cache

errors be handled effectively in order to avoid processor-related system failures. On

Hewlett Packard systems based on PA8500 or later processors, single-bit cache errors (a

single erroneous bit in the data at any given cache memory location) are corrected.

However, a double-bit cache error (two erroneous bits in the data) cannot be corrected

and will result in a system failure. Statistically, most double-bit cache errors will be

preceded by a series of single-bit errors over time as the memory cell begins to degrade.

Using Hewlett Packard’s Dynamic Processor Resilience and Dynamic Processor

Deallocation technology, processor cache can be monitored for correctable errors and the

processor dynamically deallocated before correctable errors turn uncorrectable.

Dynamic Processor Resilience also works hand-in-hand with HP's exciting new "instant

Capacity On Demand" (iCOD) product. iCOD enables customers to purchase systems

that have one or more processors in reserve, which have not yet been purchased. When

additional capacity is required, the reserve processors can be purchased and "instantly"

enabled. For systems, which have iCOD, enabled, reserve processors automatically

replace processors that are deallocated by the EMS CPU monitor (previously named

“LPMC monitor”) if they exist, thus ensuring that the system continues to run at full

capacity. The faulty processor can then be replaced when convenient at which time it will

be returned to the reserve pool.

NOTE: Starting HWE 0206 release of Diagnostics, the “LPMC monitor” has

been renamed “CPU monitor” in the documentation to reflect the fact

that it monitors floating point functionality in addition to LPMCs. The

binary name (lpmc_em) remains unchanged.

Types of Errors Addressed

LPMCs

There are four types of Single-Bit Cache Parity Errors that the processor can

experience: I-Cache data error, I-Cache Tag error, D-Cache Data error and D-

Cache Tag error.