NonStop Systems Introduction
NonStop Server Architecture
NonStop Systems Introduction—527825-001
7-9
Multiple Power Sources and Online Repair
Multiple Power Sources and Online Repair
One threat to continuous system operation lies outside the system itself: the danger of
a power failure. No system is immune to a total power failure, but the NonStop server
contains a number of mechanisms to minimize the effects of power failures. These
mechanisms have the added advantage of enabling the operations staff or HP service
personnel to take individual hardware components out of operation for repair without
shutting down the whole system.
Each processor has its own power supply and can be brought up and shut down
independently of all the other processors so that individual repairs can be performed.
The ability to remove and repair an individual component while the rest of the system
continues to operate is known as online repair.
As in the case of processors, the I/O board containing the logic for ServerNet
addressable controllers can be individually powered up or down to allow it to be
replaced while the system continues to operate.
The power supply for each processor includes a battery backup system to provide a
ride-through power backup feature (in addition to the commonly implemented
power-fail interrupt memory maintenance function) when loss of AC power occurs.
The ride-through feature (or power-fail delay) permits the processor to continue
operating for about 20 to 30 seconds without AC power. If the power outage lasts
longer than the ride-through time, then the usual power-fail interrupt occurs to protect
the contents of memory. The battery can maintain the contents of main memory for up
to several hours, depending on the size of memory.
In the case of a full shutdown following a power failure, assuming that power is
restored while the batteries are still maintaining the memory contents, the system
automatically resumes operation within minutes following restoration of power. After
bringing disks and tapes back to full operating speed, the system recovers any files
that might have been compromised and resumes processing transactions against
these files. Of course, if the power outage lasts a very long time (longer than the
batteries can maintain proper memory contents), operator intervention is required—
possibly with an alternate AC power source.
Detection and Correction of Hardware Errors
As explained in Processor Checking on page 6-12, the operating system running in
each processor in the NonStop server checks the status of all other processors in the
system by sending periodic messages, called “I’m alive” messages, to each processor.
In addition, the processors themselves perform extensive self-checking. When an
error occurs, the processor either reports it to the operating system or takes itself out of
service.
In some instances, processors are able to correct errors and continue running rather
than halt. For example, if an error occurs in main memory, the processor detects and,
if possible, corrects the error using an error correcting code (ECC). Whenever a word
of main memory gets a correctable error, the processor detects it and uses the ECC
information to derive the correct data and rewrites the word.