HP Insight Management Agents 9.10 Managing ProLiant Servers with Linux HOW TO Whitepaper

The Health Monitor does the following:
Displays a message on the console stating the problem
Makes an entry in the system health log
This server feature is configured using RBSU. On ProLiant servers that do not support AMP mirroring,
an uncorrectable (double bit) memory error causes the operating system to halt abruptly. Logging
of the error might not be possible if the error occurs in memory used by the Health Monitor.
Automatic server recovery
Automatic Server Recovery (ASR) is configured using RBSU available during the initial boot of the
server by pressing the F9 key when prompted. This feature is implemented using a "heartbeat"
timer that continually counts down. The Health Monitor frequently reloads the counter to prevent
it from counting down to zero. If the ASR counts down to zero, it is assumed that the operating
system has locked up and the system automatically attempts to reboot. Events that can contribute
to the operating system locking up include:
A peripheral device, such as a Peripheral Component Interconnect Specification (PCI) adapter,
generates numerous spurious interrupts when it fails.
A high priority software application consumes all the available central processing unit (CPU)
cycles and does not allow the operating system scheduler to run the ASR timer reset process.
A software or kernel application consumes all available memory, including the virtual memory
space (for example, swap). This can cause the operating system scheduler to cease functioning.
A critical operating system component, such as a file system, fails and causes the operating
system scheduler to cease functioning.
Any event other than an ASR timeout causes a Non-Maskable Interrupt (NMI) to be generated.
The ASR feature is a hardware-based timer.
If a true hardware failure occurs, the Health Monitor might not be called, but the server resets as
if the power switch was pressed. The ProLiant ROM code might log an event to the IML when
the server reboots.
The Health Monitor is notified of an ASR timeout through an NMI. If possible, the driver attempts
to perform the following actions:
Displays a message on the console stating the problem
Makes an entry in the IML
Attempts to gracefully shut down the operating system to close the file systems
There is no guarantee that the operating system will gracefully shutdown. This shutdown depends
on the type of error condition (software or hardware) and its severity. The Health Monitor logs a
series of messages when an ASR event occurs. The presence or absence of these messages can
provide some insight into the reason for the ASR event. The order of the messages is important,
since the ASR event is always a symptom of another error condition.
Console messages
When events occur outside normal operations, the Health Monitor might display a console message
or log a message to the IML. Operational messages, such as fan failures or temperature violations,
are logged to the standard /var/log/messages file. Messages specific to device drivers (such
as NMI type messages) can be viewed using dmesg, if the system is not completely locked up.
The hp-health manpage documents know how to interpret the messages produced by the Health
Monitor.
HP Integrated Management Logging Utility (hplog)
The HP ProLiant Integrated Management Logging utility (hplog) allows system administrators to
view IML pages. Commands are listed in Table 3: hplog options.
System Health application and Command Line utilities(hp-health) 9