HP BladeSystem c-Class Onboard Administrator Failover

2
Purpose of this White Paper
HP c-Class BladeSystem Onboard Administrators (OA) are frequently configured in pairs to provide fault tolerance of
the enclosure management. This white paper describes the interaction of the Active and Standby OAs when
configured in fault tolerant pairs, and how they behave during failover events.
OA Failover Process Description
In the redundant OA configuration, the Standby OA is initialized to the state of “hot standby” ready to take over the
Active OA when the situation warrants. The Active and Standby OAs communicate keep-alives via two separate
communication links (Ethernet and Serial) in order to maintain redundancy. The Active OA also sends enclosure
system configuration changes to the Standby OA so that both OAs contain the same configuration information.
OA failover events can occur in three separate scenarios:
(1) An Active OA hardware failure. When the Standby OA cannot communicate with the Active OA via either
of communication links, the Standby OA initiates a takeover event. Actual OA hardware failures are
extremely rare.
(2) Customer initiated “forced” failover for administrative purposes.
(3) When the customer has enabled the Link Loss Failover (LLF) feature, if Active OA Module loses its network
link for duration as mentioned in failover interval and standby reported good link during the same time
span, an automatic OA failover will occur.
Note: Link Loss Failover settings can be configured even if the enclosure has no management redundancy.
However, the settings will not take effect unless a redundant Onboard Administrator is present.
In all cases, the actual failover processing is exactly the same. The Standby OA initializes itself as the new Active OA
and resets the previously Active OA to make sure it is not in an indeterminate state (this phase takes approximately
15 seconds). It then proceeds to check the status and configuration of all the devices in the enclosure. During this
process, any interconnect module which was in a powered off state prior to failover will be powered on if sufficient
enclosure power is available. The duration of this phase depends on the configuration complexity - the lab
measurements using large enclosure configurations show it completes within 7 minutes. Users will be able to log into
the GUI/CLI within a minute of the failover initiation while the background enclosure device inventory is conducted.
The original Active OA will initialize itself as the new Standby OA if the failover was not caused by an OA hardware
failure. When the new Active and Standby OAs reestablish redundancy, the Active OA transfers the enclosure
configuration data to the Standby OA in order to make sure any incremental changes are also stored in the new
Standby OA.
OA Failover Testing General Discussion
In normal customer operational situations, OA failures or forced failovers are not commonplace events. When a
failover occurs it is usually because of a specific operational or administrative issue, although it could occasionally
result from an actual hardware failure. In these situations, it would be extremely rare for multiple OA failovers to
occur within a short time. However, it is possible using the OA CLI to trigger successive OA failovers in rapid
succession. While this process does not make sense from an operational perspective, some customers may want to
include repetitive OA failover testing as part of their qualification processes and have leveraged the CLIs this way to
script back to back failovers. However, initiating multiple failover events within a short time span may cause
intermittent undesirable results. The remaining section of this white paper will address these situations and provide
best practices for testing OA failover and recovery in the data center.
As described in the previous section, during an OA failover, the Standby OA will complete the basic failover
operations very quickly. Within a minute, users can log back into the OA via GUI or CLI. Although the OA appears
to be fully operational, there are other processes necessary for the resynchronization of enclosure device data and
status, which may run for several minutes after the GUI and CLI appear operational. Thus, for repetitive OA failover
testing, it is recommended to wait at least 7 minutes from the time a failover is initiated before attempting another OA
failover, in order to insure the entire enclosure is fully synchronized.