Availability Guide for Problem Management
Preventing Unplanned Outages
Availability Guide for Problem Management–125509
2-2
Common Causes of Unplanned Outages
Common Causes of Unplanned Outages
Tandem studies repeatedly identify four common causes of unplanned outages, listed
here in order of frequency of occurrence:
•
Operations management errors
•
Nonfault-tolerant hardware configuration
•
Nonfault-tolerant application design
•
Environmental problems
Operations Management Errors
This category is the single most common cause of unplanned outages. Operations
management errors are caused by
•
Procedural problems (or lack of procedures)
•
Lack of operator training
•
Failure of operations staff to take appropriate action when problems occur
•
Failure to perform (or perform correctly) basic operations and maintenance tasks
Nonfault-Tolerant Hardware Configuration
Inadequate hardware configuration (that is, a system configuration that cannot support a
single-processor failure), or a system that uses unmirrored disk drives can contribute to
unplanned outages. Section 7, “Auditing Systems for Fault Tolerance,” provides more
information on this topic.
Nonfault-Tolerant Application Design
Nonfault-tolerant application design can be another source of unplanned outages. Your
applications are not fault-tolerant unless you use one or more of the following
techniques:
•
Transaction protection in a Pathway environment
•
Process pairs
•
Process monitoring with restart capability
•
Transaction Monitoring Facility (TMF) database recovery
Section 7, “Auditing Systems for Fault Tolerance,” provides more information on this
subject.
Environmental Problems
Environmental outages include failures in power, cooling, or network connections. Other
examples may include a fiber-optic cable that is accidentally severed or a satellite
transponder that experiences interference.
Section 8, “Planning for Disasters,” provides more information on this topic.