BladeCluster Solution Manual

10 Troubleshooting
This chapter provides procedures to troubleshoot and recover from basic problems that might occur
when installing, changing, or migrating to a BladeCluster topology and includes these topics:
“Using OSM to Suppress BladeCluster Alarms” (page 83)
“Restoring Connectivity to a Node ” (page 83)
“Switching the SNETMON Primary and Backup Processes” (page 84)
“Starting Required Processes and Subsystems” (page 84)
“Fallback Procedures” (page 84)
Using OSM to Suppress BladeCluster Alarms
OSM T0682 H02^ACZ and later provides support for suppressing, on other directly-connected
nodes, dialing out of alarms resulting from either intentionally halting or stopping the SNETMON
($ZZSCL) process on the local node. As part of this feature, OSM provides a Place Local Node in
Service action on the BladeCluster object which effectively stops the $ZZSCL subsystem on the
system while triggering the suppression of alarms on directly-connected remote nodes. This
suppression state persists until either OSM CIMOM is restarted or the Place Local Node in Service
action is performed again to change the Node in Service attribute value to No.
NOTE: This feature requires that OSM T0682 H02^ACZ or later be running not only on the local
node, but also on all directly-connected remote nodes; it will not suppress these alarms on remote
nodes running earlier OSM versions.
Restoring Connectivity to a Node
Sometimes temporary problems result in a loss of connectivity in a BladeCluster fabric. In this case,
direct ServerNet connectivity is automatically restored after an interval of approximately 25 seconds
times the number of remote nodes in the BladeCluster. If connectivity is not restored:
1. Use SCF to gather more information on a node:
“Checking the External Fabric for All Nodes” (page 70)
“Checking the Operation of Each Node ” (page 70)
2. Use the OSM Service Connection to gather more information on a node by checking for
alarms, waiting to see if these alarms clear after several minutes, and verifying that the
ServerNet switches are operational.
3. Use SCF to start the fabric on all affected nodes. For example, you can issue this command
from the local node for a remote node:
START SERVERNET \Remotenode.$ZSNET.Fabric.*
4. If you continue to have problems, check that all fabrics, required processes, and the ServerNet
cluster subsystem are started for that node:
“Checking MSGMON, SANMAN, and SNETMON” (page 71)
“Checking the Operation of the Expand Processes and Lines” (page 74)
“Checking the ServerNet Cluster Subsystem” (page 72)
5. If you continue to have problems, switch the SNETMON primary and backup processes as
described in the next section.
6. If you continue to have problems, connectivity might be down due to a hardware failure. Refer
to the NonStop Operations Guide.
Using OSM to Suppress BladeCluster Alarms 83