Managing HP Serviceguard for Linux, Tenth Edition, September 2012

revokes them during package shutdown, using the sg_persist command. This command

is available, and has a manpage, on both Red Hat 5 and SUSE SLES 10/11.

Serviceguard makes a PR of type Write Exclusive Registrants Only (WERO) on the

package's LUN devices. This gives read access to any initiator regardless of whether

the initiator is registered or not, but grants write access only to those initiators who are

registered. (WERO is defined in the SPC-3 standard.)

All initiators on each node running the package register with LUN devices using the same

PR Key, known as the node_pr_key. Each node in the cluster has a unique

node_pr_key, which you can see in the output of cmviewcl -f line; for example:

...

node:bla2|node_pr_key=10001

When a failover package starts up, any existing PR keys and reservations are cleared

from the underlying LUN devices first; then the node_pr_key of the node that the

package is starting on is registered with each LUN.

In the case of a multi-node package, the PR reservation is made for the underlying LUNs

by the first instance of the package, and the appropriate node_pr_key is registered

each time the package starts on a new node. If a node fails, the instances of the package

running on other nodes will remove the registrations of the failed node.

You can use cmgetpkgenv (1m) to see whether PR is enabled for a given package;

for example:

cmgetpkgenv pkg1

...

PKG_PR_MODE="pr_enabled"

Responses to Failures

Serviceguard responds to different kinds of failures in specific ways. For most hardware

failures, the response is not user-configurable, but for package and service failures, you

can choose the system’s response, within limits.

Reboot When a Node Fails

The most dramatic response to a failure in a Serviceguard cluster is a system reboot. This

allows packages to move quickly to another node, protecting the integrity of the data.

A reboot is done if a cluster node cannot communicate with the majority of cluster

members for the pre-determined time, or under other circumstances such as a kernel hang

or failure of the cluster daemon (cmcld). When this happens, you may see the following

message on the console:

DEADMAN: Time expired, initiating system restart.

The case is covered in more detail under “What Happens when a Node Times Out”.

See also “Cluster Daemon: cmcld” (page 34).

86 Understanding Serviceguard Software Components