HP BladeSystem c-Class Onboard Administrator Failover

Purpose of this White Paper

HP c-Class BladeSystem Onboard Administrators (OA) are frequently configured in pairs to provide fault tolerance of

the enclosure management. This white paper describes the interaction of the Active and Standby OAs when

configured in fault tolerant pairs, and how they behave during failover events.

OA Failover Process Description

In the redundant OA configuration, the Standby OA is initialized to the state of “hot standby” ready to take over the

Active OA when the situation warrants. The Active and Standby OAs communicate keep-alives via two separate

communication links (Ethernet and Serial) in order to maintain redundancy. The Active OA also sends enclosure

system configuration changes to the Standby OA so that both OAs contain the same configuration information.

OA failover events can occur in three separate scenarios:

(1) An Active OA hardware failure. When the Standby OA cannot communicate with the Active OA via either

of communication links, the Standby OA initiates a takeover event. Actual OA hardware failures are

extremely rare.

(2) Customer initiated “forced” failover for administrative purposes.

(3) When the customer has enabled the Link Loss Failover (LLF) feature, if Active OA Module loses its network

link for duration as mentioned in failover interval and standby reported good link during the same time

span, an automatic OA failover will occur.

Note: Link Loss Failover settings can be configured even if the enclosure has no management redundancy.

However, the settings will not take effect unless a redundant Onboard Administrator is present.

In all cases, the actual failover processing is exactly the same. The Standby OA initializes itself as the new Active OA

and resets the previously Active OA to make sure it is not in an indeterminate state (this phase takes approximately

15 seconds). It then proceeds to check the status and configuration of all the devices in the enclosure. During this

process, any interconnect module which was in a powered off state prior to failover will be powered on if sufficient

enclosure power is available. The duration of this phase depends on the configuration complexity - the lab

measurements using large enclosure configurations show it completes within 7 minutes. Users will be able to log into

the GUI/CLI within a minute of the failover initiation while the background enclosure device inventory is conducted.

The original Active OA will initialize itself as the new Standby OA if the failover was not caused by an OA hardware

failure. When the new Active and Standby OAs reestablish redundancy, the Active OA transfers the enclosure

configuration data to the Standby OA in order to make sure any incremental changes are also stored in the new

Standby OA.

OA Failover Testing – General Discussion

In normal customer operational situations, OA failures or forced failovers are not commonplace events. When a

failover occurs it is usually because of a specific operational or administrative issue, although it could occasionally

result from an actual hardware failure. In these situations, it would be extremely rare for multiple OA failovers to

occur within a short time. However, it is possible using the OA CLI to trigger successive OA failovers in rapid

succession. While this process does not make sense from an operational perspective, some customers may want to

include repetitive OA failover testing as part of their qualification processes and have leveraged the CLIs this way to

script back to back failovers. However, initiating multiple failover events within a short time span may cause

intermittent undesirable results. The remaining section of this white paper will address these situations and provide

best practices for testing OA failover and recovery in the data center.

As described in the previous section, during an OA failover, the Standby OA will complete the basic failover

operations very quickly. Within a minute, users can log back into the OA via GUI or CLI. Although the OA appears

to be fully operational, there are other processes necessary for the resynchronization of enclosure device data and

status, which may run for several minutes after the GUI and CLI appear operational. Thus, for repetitive OA failover

testing, it is recommended to wait at least 7 minutes from the time a failover is initiated before attempting another OA

failover, in order to insure the entire enclosure is fully synchronized.