Users Guide

1. Before running the "promote" task on the backup lead chassis:
a. The "promote" task is a disruptive operation and must be used only when there are no means to recover the inaccessible
lead chassis. In partial failures of the lead chassis, for example; if only the management modules are nonresponsive, but
the computes are working, running the promote task disrupts workloads that are still running on the lead chassis
computes. For information about relocating working components that is, computes and network switches from the failed
lead, see, the list item 3.c, "Steps that are required to restore the failed lead before putting it into production."
b. After determining that the lead chassis has failed and is inaccessible, you must remotely shut down power to the lead
chassis or physically remove the chassis from the stack before running the "promote" task on the backup. If lead chassis
not turned off or removed from the stack before the promote task, the failed or partially failed lead chassis may revive
after promoting the backup and cause situations of multiple leads. Multiple leads can create confusion and interference in
managing the group.
2. Running the "promote" task on the backup lead chassis:
a. If the lead chassis is up and running, the backup chassis web interface blocks the "promote" task. Ensure that the lead
has failed and is inaccessible before initiating the promote task on backup. The backup may erroneously block the
"promote" when the lead is accessible through the private network, but it may not be reachable on the public user
management network. In such cases, OME-Modular RESTful API can be used to run the promote task forcefully. For
more information, see the RESTful API guide.
b. A job is created after the "promote" operation is started. The job may be completed in 10-45 minutes, based on the
number of chassis in the group and amount of configuration that has to be restored.
c. If the lead chassis is configured to forward alerts to external destinations (email, trap, system log), any alerts that
components in the group generate while the lead is down, are available only locally in their respective hardware or alert
logs. During the lead outage, the leads cannot be forwarded to configured external destinations. The outage is the period
between lead failure and successful promotion of backup.
3. Expected behavior after the "promote" task:
a. The backup chassis becomes the lead and all the member chassis are accessible as they were on the earlier lead chassis.
After the "promote" task, references to the old lead chassis exist as a member of the same group. The references are
created to prevent any disruption to the working computes in the old lead in a lead chassis MM failure situation.
The "promote" task rediscovers all the members in the group and if any member chassis is inaccessible then, the chassis
is still listed in the lead home page with a broken connection and available repair options. You can use the repair option to
add the member chassis again or remove the chassis from the group.
b. All firmware baselines or catalogs, alert policies, templates or identity-pools, and fabrics settings are restored as they
were on the failed lead chassis. However, following are some exceptions and limitations:
i. Any recent configuration changes on the failed lead within the 90 minutes window that is needed for copying to the
backup, those configurations may not be copied completely to the backup and are not restored completely after the
"promote" task.
ii. The in-progress and partially copied jobs that are associated with templates/identity-pools continue to run. You can
perform one of the following tasks:
i. Stop the running job.
ii. Reclaim any identity-pool assignments.
iii. Restart the job to redeploy the template.
iii. Any template that is attached to an occupied slot through the lead before the backup takes over as the new lead, is
not deployed on the existing sled when it is removed or reinserted. For the deployment to work, the administrator
must detach the template from the slot, reattach the template to the slot, and remove or reinsert the existing sled.
Or, insert a new sled.
iv. Any firmware catalogs that are created with automatic update catalog on a schedule are restored as manual updates.
Edit the catalog and provide automatic update method with update frequency.
v. Alert Policies, with stale or no references to devices on the old lead, are not restored on the new lead.
c. Steps that are required to restore the failed lead before putting it into production:
i. On the new lead, turn off the chassis remotely before performing the "promote" task on the backup. If the chassis
not turned off, the partially failed lead may come online and cause a situation of multiple leads. There is limited
support in automatic detection and recovery of this situation. If the earlier lead comes online and automatic recovery
is possible, the earlier lead is forced to join the group as a member.
ii. On the new lead, remove the earlier lead chassis from the group to remove references to it.
iii. On the old lead, gain physical access to the failed lead chassis as soon as possible and unstack it from the group. If
there were any templates with identity-pool assignments that are deployed to any computes on the old lead, then
reclaim the identity-pool assignments from the computes. Reclaiming the identity pool assignments is required to
prevent any network identity collision when the old chassis is put back into production.
Use case scenarios
109