Designing High Availability Solutions with HP Serviceguard and HP Integrity Virtual Machines

18
at cluster configuration time to extend the quiescence period of the cluster reformation based on whether a VM node
is present in the cluster and the I/O timeout settings on the VM host. It is important to note that:
The io_timeout_extension parameter is set internally by Serviceguard and is not configurable by the user;
however, its value can be viewed using the Serviceguard cmviewconf, cmviewcl v –f line commands, or
can be found in the system log file.
5
It is highly recommended to install the VM guest management software, especially on VM guests functioning as
Serviceguard nodes, in order for Serviceguard to determine an optimal io_timeout_extension value
(otherwise, Serviceguard would assume the most conservative value of 70 seconds resulting in unnecessarily
lengthening of the cluster recovery time).
In a failure scenario where the pending I/Os from a VM guest are not cleared within its extended quiescence time
period, the Integrity VM software will perform a transfer of control (TOC, or CPU reset) on the VM host servicing
the guest to ensure data integrity by terminating any outstanding I/O requests from the affected VM guest. There
are no specific recommendations to avoid this as it is not expected to happen often. If it does occur, it means that
the host is heavily loaded and action should be taken to reduce that load.
When performing a Serviceguard cluster consolidation, as with any workload consolidation using Integrity VM,
careful planning of the VM configuration is required to ensure proper performance of the VM guests by having a
sufficient number of processors and available memory, in addition to storage and network I/O connections, to handle
their workloads. Any initial performance problems with a VM guest can be compounded when application workloads
are failed over to it by Serviceguard in response to a failure in one of the other cluster members.
Cluster in a boxconfigurations should not be considered for running mission- or business-critical applications as the
physical VM host system is a SPOF. If the physical system fails, the entire cluster will also fail.
Integrity VM instances are not highly available in VMs as nodes configurations. A failure of a VM guest is similar to a
node failure in a Serviceguard cluster. It is the use of Serviceguard within the VM guest that provides high availability
for the applications running in the VM.
VMs as Serviceguard nodes configurations do have a shortcoming in that the adoptive failover VMs must be
executing and consuming some degree of VM host resources, which could potentially be used by other VMs that are
not part of the Serviceguard cluster. The use of the dynamic memory allocation feature should be considered to better
manage adoptive VM node memory usage during application failovers.
Additional considerations for VM as Serviceguard node configurations
Serviceguard clusters rely on a cluster daemon process called cmcld that determines cluster membership by sending
heartbeat messages to other cmcld daemons on other nodes within the cluster. The cmcld daemon runs at a real-time
priority and is locked in memory. Along with handling the management of Serviceguard packages, cmcld also
updates a safety timer within the kernel to detect kernel hangs, checks the health of networks on the system and
performs local LAN failovers. Status information from cmcld is written to the nodes system log file.
In VMs as Serviceguard node configurations, there are some situations where VM guests defined with multiple vCPUs,
or a single vCPU with insufficient entitlement, can potentially experience cmcld runtime delays under heavy
processing load conditions. If the runtime delay is longer than the configured cluster MEMBER_TIMEOUT
6
value
(i.e., the time after which a node may decide that another cluster node has become unavailable), cmcld will evict the
node from the cluster just as if a node had failed.
Other factors that may contribute to this situation include vCPU processing entitlement percentages and the number of
vCPUs assigned per VM as they relate to HP-UX kernel time slice processing.
Cmcld run delays can be identified by the following warning reported in the system log file:
[date/time VM name] cmcld [PID]: Warning: cmcld process was unable to
run for the last <x.yz> seconds
5
System log file names are /var/adm/syslog/syslog.log on HP-UX systems and /var/log/messages on Linux systems.
6
MEMBER_TIMEOUT is for determining runtime delays in Serviceguard A.11.19 and later. Serviceguard A.11.18 and earlier use a combination of
NODE_TIMEOUT and HEARTBEAT_INTERVAL for determining runtime delays.