Availability Guide for Problem Management

Auditing Systems for Fault Tolerance
Availability Guide for Problem Management125509
7-3
Continuous Operations
Fault Tolerance in the Client/Server Environment
Tandem provides fault tolerance in the client/server environment with its new NonStop
Access for Networking (NSAN) product, a joint Tandem and Ungermann-Bass effort.
NSAN is a networking solution that delivers fault tolerance from the server out to the
desktop through the creation of primary and alternate paths between PCs, networking
hubs, and Tandem NonStop and Integrity servers.
Fault tolerance is provided to client and server applications through the use of fully
redundant local area networks (LANs) and LAN connections. The Tandem server
connection is achieved through software that supports the use of two 3615 Ethernet
controllers, paired to support a single LAN communications line.
The PC client fault-tolerant connection is achieved through the use of the UB
Networking MasterLAN II-T2 dual port LAN adapter and associated drivers.
Tandem TCP/IP and Fault Tolerance
Tandem TCP/IP runs on the NonStop Kernel, and allows heterogeneous systems in a
multinetwork environment to communicate with each other. The Tandem TCP/IP
process provides fault tolerance by running as a NonStop process pair. The primary
process attempts to start the backup process as it completes its initialization. If the
attempt fails, the primary process waits and then attempts to start the backup process
again. After each attempt, the primary process delays for a slightly longer interval before
attempting the restart until it reaches a maximum value of 10 minutes.
The primary TCP/IP process keeps the backup process ready to take over by
checkpointing all configuration changes and socket creation, deletion, and state changes.
A NonStop TCP/IP process provides a persistent TCP/IP process, which recovers from
processor failure and Subsystem Control Facility (SCF) primary requests.
If the backup process of a TCP/IP NonStop pair abends for any reason, the primary
process tries to restart the backup process (in the configured processor) after delaying
for a period of time. This time period increases after each failure until it reaches a
maximum value of ten minutes. An EMS message is issued when the backup process
has abended and displays the time after which the primary process attempts to create the
backup. The time period for recovery initially begins at 5 seconds with a maximum
delay of 10 minutes. It will be reset to the initial value each time the backup processor is
reloaded. When the backup process abends it creates a saveabend dump file. The
saveabend file created contains only the stack area of the process and not the full QIO
segment. If the backup abends during its initialization code, no saveabend file is created.
When using TCP/IP as a NonStop process pair, it is important to configure the QIO
subsystem in the backup processor with at least the same resources with which the
primary processor has been configured. Failure to do so may result in the TCP/IP
backup process being unable to process some checkpoint messages or even preventing it
from starting up successfully.
Refer to the Tandem TCP/IP Configuration and Management Manual for more
information about running TCP/IP as a NonStop process pair.
Note. QIO is a data communications subsystem that performs input and output (I/O) to and
from a local area network (LAN) controller.