Availability Guide for Problem Management

Preventing Unplanned Outages
Availability Guide for Problem Management125509
2-4
Goals and Strategies
An application environment may consist of thousands of objects (processors,
terminals, disk drives, communications lines, files, processes, and so on) that need to
be present and in the correct state to be available to end users. You need to ensure
that critical objects are automatically monitored to keep them available to users.
You also need to understand the dependencies that may exist between these objects,
for example, disk space and processor cycles. You might have a daily batch
application that takes 16 hours to run, but it fails after 14 hours because of a lack of
disk space. As a result, the 16-hour job must be run again in addition to the current
day’s job.
Section 5, “Monitoring Objects,” provides more information on this topic.
Preventing Potential Problems
Two important strategies for preventing potential problems before they occur are:
Automating operations, intervention, recovery, and performance-monitoring tasks.
Automated operations can help alleviate problems of overworked operations
staff and help-desk personnel. By intervening programmatically, automated
systems reduce the burden on operations staff and increase the availability of the
system.
Automated operations are generally faster than human operators. This helps
reduce the period of vulnerability between the single-point failure and the fix.
Automated operations can ensure that a problem has to be resolved only once,
instead of multiple times with an increased chance of errors.
Section 6, “Automating Operations and Recovery Procedures,” provides more
information on this topic.
Auditing your system and applications for fault tolerance.
Ensuring that your system and applications are fault tolerant reduces the likelihood
that a single component failure will contribute to an unplanned outage.
Section 7,Auditing Systems for Fault Tolerance,provides more information on
this topic.
Preparing for Problems That May Occur
Three important strategies for preparing for problems before they occur are:
Preparing for environmental problems and disasters.
Having proper disaster recovery plans and backup equipment in place can help to
reduce the impact of unplanned outages caused by air conditioning and power
failures, and disasters.
Section 8, “Planning for Disasters,” provides more information on this topic.
Documenting your operations management procedures.