White Paper on Dynamic Processor Deallocation and Dynamic Processor Resilience

Dynamic Processor Resilience (DPR)
Beginning with the June 1999 release of the IPR/Diagnostic media, an EMS monitor is
provided which monitors the rate of correctable errors in each processor’s on-board
cache. These errors are manifested as Low Priority Machine Checks (LPMCs). While
occasional correctable errors are to be expected in the on-board cache, too many of
these errors in a short period of time indicate an increased likelihood that a non-
correctable cache error could occur. The EMS CPU monitor will continuously monitor
the rate at which LPMCs are occurring and dynamically deactivate a processor, using
the Dynamic Processor Deactivation facility, if the factory determined threshold is
exceeded. This technology is referred to as Dynamic Processor Resilience. For
PA8500 processors, for example, the threshold is set at three LPMCs within a 24-hour
time period. The monitor sets the threshold for different processors automatically.
NOTE: Starting IPR0009 release, this threshold value is no longer configurable.
NOTE: On N-Class, L-Class and later machines, the processor can be Marked-
for-Deconfiguration and so that when the system is rebooted, the
processor will be completely removed from system use. This action of
removing the processor from the system is known as Processor
Deconfiguration. On earlier PA8500-based machines, deconfigured
processors will be reconfigured automatically upon reboot. On these
machines, it is necessary to deconfigure processors manually via the Boot
Console Handler (BCH) if they were Marked-for-Deconfiguration when
the machine was rebooted.
NOTE: Starting HWE 0206 release of Diagnostics, the monitor will deactivate the
processor with a special O/S option, so that it cannot be re-activated
without rebooting the system. The purpose behind using the new option is
to prevent system problems by continued use of the faulty processor in
case the user decides to re-activate the processor using the CPU Expert
Tool in STM.
The current state of all of the processors on the system can be determined via the STM
System Information Tool.
The EMS CPU monitor generates informational EMS events for each correctable
cache error that it detects. In order to prevent flooding the administrator with these
events in the case where persistent cache errors are occurring, these informational
events will cease to be generated once the threshold is met, a serious event is
generated, and the processor is deallocated.
The CPU monitor receives immediate notification of LPMCs as they occur. Since no
polling delays are involved, the monitor is able to take action the moment the
correctable cache error rate exceeds the threshold. When the threshold is exceeded for