Managing HP Serviceguard for Linux, Tenth Edition, September 2012

Removing Serviceguard from a System....................................................................288
8 Troubleshooting Your Cluster.................................................................................................289
Testing Cluster Operation .....................................................................................289
Testing the Package Manager ...........................................................................289
Testing the Cluster Manager .............................................................................290
Monitoring Hardware ..........................................................................................290
Replacing Disks....................................................................................................291
Replacing a Faulty Mechanism in a Disk Array....................................................291
Replacing a Lock LUN......................................................................................291
Revoking Persistent Reservations after a Catastrophic Failure.......................................292
Examples........................................................................................................293
Replacing LAN Cards...........................................................................................293
Replacing a Failed Quorum Server System...............................................................294
Troubleshooting Approaches .................................................................................296
Reviewing Package IP Addresses .......................................................................296
Reviewing the System Log File ...........................................................................297
Sample System Log Entries ..........................................................................297
Reviewing Object Manager Log Files .................................................................298
Reviewing Configuration Files ...........................................................................298
Reviewing the Package Control Script ................................................................298
Using the cmquerycl and cmcheckconf Commands...............................................299
Reviewing the LAN Configuration ......................................................................299
Solving Problems .................................................................................................299
Name Resolution Problems................................................................................300
Networking and Security Configuration Errors.................................................300
Cluster Re-formations Caused by Temporary Conditions........................................300
Cluster Re-formations Caused by MEMBER_TIMEOUT Being Set too Low.................300
System Administration Errors .............................................................................301
Package Control Script Hangs or Failures ......................................................302
Package Movement Errors ................................................................................304
Node and Network Failures .............................................................................304
Troubleshooting the Quorum Server....................................................................304
Authorization File Problems...........................................................................304
Timeout Problems........................................................................................305
Messages..................................................................................................305
Lock LUN Messages.........................................................................................305
A Designing Highly Available Cluster Applications ....................................................................306
Automating Application Operation ........................................................................306
Insulate Users from Outages .............................................................................307
Define Application Startup and Shutdown ..........................................................307
Controlling the Speed of Application Failover ..........................................................308
Replicate Non-Data File Systems .......................................................................308
Evaluate the Use of a Journaled Filesystem (JFS)...................................................308
Contents 13