HP Insight Management Agents 9.10 Managing ProLiant Servers with Linux HOW TO Whitepaper

A Error messages
Messages logged if an ASR event occurs are listed in Table 14 (page 27).
Table 14 Error messages
DetailsMessage Number
NMI-Automatic Server Recovery timer expiration Hour %d-%d/%d/%dMessage 1
This message indicates that the Health Monitor detected an ASR timeout and is
attempting to gracefully shut down the Operating System. Absence of this
Description
message can indicate a critical hardware failure (such as a non-correctable ECC
error on a memory DIMM) or some other severe event. This is the first of a series
of messages displayed to the console. This message is not logged to the IML
and most likely not listed in any system logs.
Review all the messages logged to the IML to see if any previous errors have
been logged. For example, a corrected single-bit memory error might have been
logged.
Recommended
action
ASR Lockup Detected: %sMessage 2
This message indicates that the Health Monitor detected an ASR timeout and is
attempting to gracefully shut down the Operating System. Absence of this
Description
message can indicate a critical hardware failure (such as a non-correctable ECC
error on a memory DIMM) or some other severe event. This is the first ASR
message logged to the IML, if logging is possible.
Review all the messages logged to the IML to see if any previous errors have
been logged.
Recommended
action
casm: ASR performed a successful OS shutdownMessage 3
This ASR message indicates that the Health monitor detected an ASR timeout
and has gracefully shut down the Operating System. Absence of this message
Description
can indicate a critical hardware failure (such as a non-correctable ECC error
on a memory DIMM), a high priority process consuming all the available CPU
cycles (software failure), or a device such as a storage or a network controller
flooding the system with interrupts. This is the second ASR message logged to
the IML, if logging is possible.
This ASR message usually indicates a software error such as a high priority
process consuming all the available CPU cycles. Linux tools such as sar” (system
Recommended
action
activity report) can be used in conjunction with the ASR facility to locate the
process causing the problem.
ASR Detected by System ROMMessage 4
This message indicates that the ProLiant Server ROM detected an ASR timeout.
This message is almost always present in the IML when an ASR timeout occurs.
Description
If this is the only ASR message logged to the IML, this can indicate a hardware
failure such as a non-correctable ECC error on a memory DIMM. The ASR feature
on a ProLiant server resets the server when the timeout expires with no software
intervention required.
If this is the only ASR message present, this usually indicates a hardware error
(such as an unrecoverable memory error). Try moving the server memory DIMMs
Recommended
action
to different slots to see if more information can be logged. Review all IML
messages that previously occurred to see if any other component has given an
indication of failure or temperature limits that might have exceeded normal
operating thresholds.
The cpqriisd service acts as an enabler for other ProLiant value-add software, such as the Rack
Agent and the Rack Upgrade Utility. This service is only applicable for p-Class blade systems.
If the service goes away after a few seconds, there is a failure to initiate communication with the
iLO management controller. The failure reason is logged in the message log. If the service is
27