BladeCluster Solution Manual

10 Troubleshooting

This chapter provides procedures to troubleshoot and recover from basic problems that might occur

when installing, changing, or migrating to a BladeCluster topology and includes these topics:

• “Using OSM to Suppress BladeCluster Alarms” (page 83)

• “Restoring Connectivity to a Node ” (page 83)

• “Switching the SNETMON Primary and Backup Processes” (page 84)

• “Starting Required Processes and Subsystems” (page 84)

• “Fallback Procedures” (page 84)

Using OSM to Suppress BladeCluster Alarms

OSM T0682 H02^ACZ and later provides support for suppressing, on other directly-connected

nodes, dialing out of alarms resulting from either intentionally halting or stopping the SNETMON

($ZZSCL) process on the local node. As part of this feature, OSM provides a Place Local Node in

Service action on the BladeCluster object which effectively stops the $ZZSCL subsystem on the

system while triggering the suppression of alarms on directly-connected remote nodes. This

suppression state persists until either OSM CIMOM is restarted or the Place Local Node in Service

action is performed again to change the Node in Service attribute value to No.

NOTE: This feature requires that OSM T0682 H02^ACZ or later be running not only on the local

node, but also on all directly-connected remote nodes; it will not suppress these alarms on remote

nodes running earlier OSM versions.

Restoring Connectivity to a Node

Sometimes temporary problems result in a loss of connectivity in a BladeCluster fabric. In this case,

direct ServerNet connectivity is automatically restored after an interval of approximately 25 seconds

times the number of remote nodes in the BladeCluster. If connectivity is not restored:

1. Use SCF to gather more information on a node:

• “Checking the External Fabric for All Nodes” (page 70)

• “Checking the Operation of Each Node ” (page 70)

2. Use the OSM Service Connection to gather more information on a node by checking for

alarms, waiting to see if these alarms clear after several minutes, and verifying that the

ServerNet switches are operational.

3. Use SCF to start the fabric on all affected nodes. For example, you can issue this command

from the local node for a remote node:

START SERVERNET \Remotenode.$ZSNET.Fabric.*

4. If you continue to have problems, check that all fabrics, required processes, and the ServerNet

cluster subsystem are started for that node:

• “Checking MSGMON, SANMAN, and SNETMON” (page 71)

• “Checking the Operation of the Expand Processes and Lines” (page 74)

• “Checking the ServerNet Cluster Subsystem” (page 72)

5. If you continue to have problems, switch the SNETMON primary and backup processes as

described in the next section.

6. If you continue to have problems, connectivity might be down due to a hardware failure. Refer

to the NonStop Operations Guide.

Using OSM to Suppress BladeCluster Alarms 83