Managing HP Serviceguard for Linux, Sixth Edition, August 2006

Troubleshooting Your Cluster

Solving Problems

Chapter 8280

Cluster Re-formations

Cluster re-formations may occur from time to time due to current cluster

conditions. Some of the causes are as follows:

• local switch on an Ethernet LAN if the switch takes longer than the

cluster NODE_TIMEOUT value. To prevent this problem, you can

increase the cluster NODE_TIMEOUT value.

• excessive network traffic on heartbeat LANs. To prevent this, you

can use dedicated heartbeat LANs, or LANs with less traffic on

them.

• an overloaded system, with too much total I/O and network traffic.

• an improperly configured network, for example, one with a very large

routing table.

In these cases, applications continue running, though they might

experience a small performance impact during cluster re-formation.

System Administration Errors

There are a number of errors you can make when configuring

Serviceguard that will not show up when you start the cluster. Your

cluster can be running, and everything appears to be fine, until there is a

hardware or software failure and control of your packages are not

transferred to another node as you would have expected.

These are errors caused specifically by errors in the cluster configuration

file and package configuration scripts. Examples of these errors include:

• Volume groups not defined on adoptive node.

• Mount point does not exist on adoptive node.

• Network errors on adoptive node (configuration errors).

• User information not correct on adoptive node.

You can use the following commands to check the status of your disks:

• df - to see if your package’s volume group is mounted.

• vgdisplay -v - to see if all volumes are present.

• strings /etc/lvmconf/*.conf - to ensure that the configuration is

correct.