Designing High Availability Solutions with HP Serviceguard and HP Integrity Virtual Machines

ManualsBrandsHP ManualsSoftwareHP-UX Virtual Partitions (vPars) Software

at cluster configuration time to extend the quiescence period of the cluster reformation based on whether a VM node

is present in the cluster and the I/O timeout settings on the VM host. It is important to note that:

• The io_timeout_extension parameter is set internally by Serviceguard and is not configurable by the user;

however, its value can be viewed using the Serviceguard cmviewconf, cmviewcl v –f line commands, or

can be found in the system log file.

• It is highly recommended to install the VM guest management software, especially on VM guests functioning as

Serviceguard nodes, in order for Serviceguard to determine an optimal io_timeout_extension value

(otherwise, Serviceguard would assume the most conservative value of 70 seconds resulting in unnecessarily

lengthening of the cluster recovery time).

• In a failure scenario where the pending I/Os from a VM guest are not cleared within its extended quiescence time

period, the Integrity VM software will perform a transfer of control (TOC, or CPU reset) on the VM host servicing

the guest to ensure data integrity by terminating any outstanding I/O requests from the affected VM guest. There

are no specific recommendations to avoid this as it is not expected to happen often. If it does occur, it means that

the host is heavily loaded and action should be taken to reduce that load.

When performing a Serviceguard cluster consolidation, as with any workload consolidation using Integrity VM,

careful planning of the VM configuration is required to ensure proper performance of the VM guests by having a

sufficient number of processors and available memory, in addition to storage and network I/O connections, to handle

their workloads. Any initial performance problems with a VM guest can be compounded when application workloads

are failed over to it by Serviceguard in response to a failure in one of the other cluster members.

“Cluster in a box” configurations should not be considered for running mission- or business-critical applications as the

physical VM host system is a SPOF. If the physical system fails, the entire cluster will also fail.

Integrity VM instances are not highly available in VMs as nodes configurations. A failure of a VM guest is similar to a

node failure in a Serviceguard cluster. It is the use of Serviceguard within the VM guest that provides high availability

for the applications running in the VM.

VMs as Serviceguard nodes configurations do have a shortcoming in that the adoptive failover VMs must be

executing and consuming some degree of VM host resources, which could potentially be used by other VMs that are

not part of the Serviceguard cluster. The use of the dynamic memory allocation feature should be considered to better

manage adoptive VM node memory usage during application failovers.

Additional considerations for VM as Serviceguard node configurations

Serviceguard clusters rely on a cluster daemon process called cmcld that determines cluster membership by sending

heartbeat messages to other cmcld daemons on other nodes within the cluster. The cmcld daemon runs at a real-time

priority and is locked in memory. Along with handling the management of Serviceguard packages, cmcld also

updates a safety timer within the kernel to detect kernel hangs, checks the health of networks on the system and

performs local LAN failovers. Status information from cmcld is written to the node’s system log file.

In VMs as Serviceguard node configurations, there are some situations where VM guests defined with multiple vCPUs,

or a single vCPU with insufficient entitlement, can potentially experience cmcld runtime delays under heavy

processing load conditions. If the runtime delay is longer than the configured cluster MEMBER_TIMEOUT

value

(i.e., the time after which a node may decide that another cluster node has become unavailable), cmcld will evict the

node from the cluster just as if a node had failed.

Other factors that may contribute to this situation include vCPU processing entitlement percentages and the number of

vCPUs assigned per VM as they relate to HP-UX kernel time slice processing.

Cmcld run delays can be identified by the following warning reported in the system log file:

[date/time VM name] cmcld [PID]: Warning: cmcld process was unable to

run for the last <x.yz> seconds

System log file names are /var/adm/syslog/syslog.log on HP-UX systems and /var/log/messages on Linux systems.

MEMBER_TIMEOUT is for determining runtime delays in Serviceguard A.11.19 and later. Serviceguard A.11.18 and earlier use a combination of

NODE_TIMEOUT and HEARTBEAT_INTERVAL for determining runtime delays.