Computer Hardware User Manual

Cluster Troubleshooting 149
7.6 User ID Problems
Within an HACMP cluster, you always have more than one node potentially
offering the same service to a specific user or a specific user id.
As the node providing the service can change, the system administrator has
to ensure that the same user and group is known to all nodes potentially
running an application. So, in case one node is failing, and the application is
taken over by the standby node, a user can go on working since the takeover
node knows that user under exactly the same user and group id.
Since user access within an NFS mounted file system is granted based on
user IDs, the same applies to NFS mounted file systems.
For more information on managing user and group accounts within a cluster,
refer to Chapter 2.7, “User ID Planning” on page 48, or to Chapter 12,
“Managing User and Groups in a Cluster” of the
HACMP for AIX, Version 4.3:
Administration Guide
, SC23-4279.
7.7 Troubleshooting Strategy
In order to quickly find a solution to a problem in the cluster, some sort of
strategy is helpful for pinpointing the problem. The following guidelines
should make the troubleshooting process more productive:
Save the log files associated with the problem before they become
unavailable. Make sure you save the /tmp/hacmp.out and /tmp/cm.log files
before you do anything else to try to figure out the cause of the problem.
Attempt to duplicate the problem. Do not rely too heavily on the user’s
problem report. The user has only seen the problem from the application
level. If necessary, obtain the user’s data files to recreate the problem.
Approach the problem methodically. Allow the information gathered from
each test to guide your next test. Do not jump back and forth between
tests based on hunches.
Keep an open mind. Do not assume too much about the source of the
problem. Test each possibility and base your conclusions on the evidence
of the tests.
Isolate the problem. When tracking down a problem within an HACMP
cluster, isolate each component of the system that can fail and determine
whether it is working. Work from top to bottom, following the progression
described in the following section.