Managing HP Serviceguard for Linux, Sixth Edition, August 2006

Troubleshooting Your Cluster
Solving Problems
Chapter 8280
Cluster Re-formations
Cluster re-formations may occur from time to time due to current cluster
conditions. Some of the causes are as follows:
local switch on an Ethernet LAN if the switch takes longer than the
cluster NODE_TIMEOUT value. To prevent this problem, you can
increase the cluster NODE_TIMEOUT value.
excessive network traffic on heartbeat LANs. To prevent this, you
can use dedicated heartbeat LANs, or LANs with less traffic on
them.
an overloaded system, with too much total I/O and network traffic.
an improperly configured network, for example, one with a very large
routing table.
In these cases, applications continue running, though they might
experience a small performance impact during cluster re-formation.
System Administration Errors
There are a number of errors you can make when configuring
Serviceguard that will not show up when you start the cluster. Your
cluster can be running, and everything appears to be fine, until there is a
hardware or software failure and control of your packages are not
transferred to another node as you would have expected.
These are errors caused specifically by errors in the cluster configuration
file and package configuration scripts. Examples of these errors include:
Volume groups not defined on adoptive node.
Mount point does not exist on adoptive node.
Network errors on adoptive node (configuration errors).
User information not correct on adoptive node.
You can use the following commands to check the status of your disks:
df - to see if your package’s volume group is mounted.
vgdisplay -v - to see if all volumes are present.
strings /etc/lvmconf/*.conf - to ensure that the configuration is
correct.