White Paper on Dynamic Processor Deallocation and Dynamic Processor Resilience

Dynamic Processor Resilience (DPR)

Beginning with the June 1999 release of the IPR/Diagnostic media, an EMS monitor is

provided which monitors the rate of correctable errors in each processor’s on-board

cache. These errors are manifested as Low Priority Machine Checks (LPMCs). While

occasional correctable errors are to be expected in the on-board cache, too many of

these errors in a short period of time indicate an increased likelihood that a non-

correctable cache error could occur. The EMS CPU monitor will continuously monitor

the rate at which LPMCs are occurring and dynamically deactivate a processor, using

the Dynamic Processor Deactivation facility, if the factory determined threshold is

exceeded. This technology is referred to as Dynamic Processor Resilience. For

PA8500 processors, for example, the threshold is set at three LPMCs within a 24-hour

time period. The monitor sets the threshold for different processors automatically.

NOTE: Starting IPR0009 release, this threshold value is no longer configurable.

NOTE: On N-Class, L-Class and later machines, the processor can be Marked-

for-Deconfiguration and so that when the system is rebooted, the

processor will be completely removed from system use. This action of

removing the processor from the system is known as Processor

Deconfiguration. On earlier PA8500-based machines, deconfigured

processors will be reconfigured automatically upon reboot. On these

machines, it is necessary to deconfigure processors manually via the Boot

Console Handler (BCH) if they were Marked-for-Deconfiguration when

the machine was rebooted.

NOTE: Starting HWE 0206 release of Diagnostics, the monitor will deactivate the

processor with a special O/S option, so that it cannot be re-activated

without rebooting the system. The purpose behind using the new option is

to prevent system problems by continued use of the faulty processor in

case the user decides to re-activate the processor using the CPU Expert

Tool in STM.

The current state of all of the processors on the system can be determined via the STM

System Information Tool.

The EMS CPU monitor generates informational EMS events for each correctable

cache error that it detects. In order to prevent flooding the administrator with these

events in the case where persistent cache errors are occurring, these informational

events will cease to be generated once the threshold is met, a serious event is

generated, and the processor is deallocated.

The CPU monitor receives immediate notification of LPMCs as they occur. Since no

polling delays are involved, the monitor is able to take action the moment the

correctable cache error rate exceeds the threshold. When the threshold is exceeded for