NonStop Systems Introduction

ManualsBrandsHP ManualsServerHP NonStop G-Series

131

132

133

134

135

136

137

138

139

140

NonStop Server Architecture

NonStop Systems Introduction—527825-001

7-9

Multiple Power Sources and Online Repair

One threat to continuous system operation lies outside the system itself: the danger of

a power failure. No system is immune to a total power failure, but the NonStop server

contains a number of mechanisms to minimize the effects of power failures. These

mechanisms have the added advantage of enabling the operations staff or HP service

personnel to take individual hardware components out of operation for repair without

shutting down the whole system.

Each processor has its own power supply and can be brought up and shut down

independently of all the other processors so that individual repairs can be performed.

The ability to remove and repair an individual component while the rest of the system

continues to operate is known as online repair.

As in the case of processors, the I/O board containing the logic for ServerNet

addressable controllers can be individually powered up or down to allow it to be

replaced while the system continues to operate.

The power supply for each processor includes a battery backup system to provide a

ride-through power backup feature (in addition to the commonly implemented

power-fail interrupt memory maintenance function) when loss of AC power occurs.

The ride-through feature (or power-fail delay) permits the processor to continue

operating for about 20 to 30 seconds without AC power. If the power outage lasts

longer than the ride-through time, then the usual power-fail interrupt occurs to protect

the contents of memory. The battery can maintain the contents of main memory for up

to several hours, depending on the size of memory.

In the case of a full shutdown following a power failure, assuming that power is

restored while the batteries are still maintaining the memory contents, the system

automatically resumes operation within minutes following restoration of power. After

bringing disks and tapes back to full operating speed, the system recovers any files

that might have been compromised and resumes processing transactions against

these files. Of course, if the power outage lasts a very long time (longer than the

batteries can maintain proper memory contents), operator intervention is required—

possibly with an alternate AC power source.

Detection and Correction of Hardware Errors

As explained in Processor Checking on page 6-12, the operating system running in

each processor in the NonStop server checks the status of all other processors in the

system by sending periodic messages, called “I’m alive” messages, to each processor.

In addition, the processors themselves perform extensive self-checking. When an

error occurs, the processor either reports it to the operating system or takes itself out of

service.

In some instances, processors are able to correct errors and continue running rather

than halt. For example, if an error occurs in main memory, the processor detects and,

if possible, corrects the error using an error correcting code (ECC). Whenever a word

of main memory gets a correctable error, the processor detects it and uses the ECC

information to derive the correct data and rewrites the word.