Availability Guide for Problem Management

Monitoring Objects
Availability Guide for Problem Management125509
5-6
Performance Monitoring
When an object goes into an odd state, you need sufficient information to bring the
object back into an up state. This is preventive recovery, because the object is still
providing services; but if the situation is not corrected, a more important problem can
occur. For example, if an application transaction log file is over 75 percent full and this
is considered to be an odd state for this object, the common corrective action is to create
a new log file or modify the attributes of the existing file. However, if this condition is
not detected, the log file could become full. If that happens, the application might have
to be stopped to correct the problem.
To detect that an object is in an odd state may require threshold alarm detection for the
object. Either an application or a monitoring subsystem must take responsibility for
tracking the odd state.
When determining what objects to monitor, there are two types of events that are
important:
The first type of event tells when an object changes state and requires reactive
recovery.
The second type of event tells when an object exceeds a threshold, which may also
cause a state change, and requires preventive recovery.
Performance Monitoring
In addition to monitoring the states of critical objects in your system environment,
another way to ensure increased availability is to monitor performance using the
following measurements:
End-user response time measurement
Throughput measurement
Measuring end-user response time is important because the assessment of system
availability should be from the end-user’s perspective. For example, it is not enough to
simply record that a certain hardware or software component has gone down; you must
also take into consideration the user’s ability to access the service, the quality of the
service provided, and whether or not the response time is acceptable to the user.
Throughput is measured as the number of transactions that the system can process in a
particular span of time. It is usually expressed as transactions per second. As throughput
increases, the cost of each transaction falls proportionally.