Availability Guide for Application Design
What Is Application Availability?
Availability Guide for Application Design—525637-004
1-9
Outage Classes
Design Outages
Design outages are usually caused by malfunctioning software, either system software
or application. Again, deterministic faults are rare throughout the industry. Transient
problems are more common as users of most personal computers will testify.
Potential causes of a design outage on a NonStop system include a LAN network
broadcast storm or a degenerating response time. Good software engineering
practices are the primary means for preventing deterministic design outages. Process
pairs tolerate transient faults well; the backup process can continue where the primary
failed because its memory, queues, and so on, are different than those of the failed
primary. Refer to Section 7, Availability Through Process-Pairs and Monitors, for
details.
The transaction model also tolerates faults well through atomicity, consistency,
integrity, and durability. Refer to Section 4, Data Protection and Recovery, for details.
Operational Outages
Operational errors occur when the operator or support person does the wrong thing.
Examples of operational errors on a NonStop system include accidentally pushing the
power off button, incorrectly installing the operating system, and pulling the good
processor board when intending to replace the faulty one.
Training and automated problem handling provide the best protection against
operational errors. Section 8, Instrumenting an Application for Availability, provides
details on how to design an application to generate event messages when problems
occur. The NonStop Distributed Systems Management (DSM) subsystem can be used
to automate a response or present the information to an operator in a way that clarifies
and simplifies the response procedures. Refer to the Availability Guide for Problem
Management for details on how to handle event messages.
Environmental Outages
Environmental outages result from an external condition that has nothing to do with the
design or operation of the computer installation.
Examples of environmental outages include major natural disasters such as
earthquakes, electrical storms, and flooding, man-made disasters, or more mundane
events such as power-grid problems. Note that in the U.S.A., which has one of the
most reliable power services in the world, the average computer room experiences 443
power faults per year. In other words, scarcely a day passes without the power quality
being compromised.
Keeping a remote duplicate database enables fast recovery from natural or man-made
disasters; Section 4, Data Protection and Recovery
, provides details. Some NonStop
servers are designed to tolerate earthquakes up to a magnitude 8.2 on the Richter
scale.