Managing HP Serviceguard A.11.20.10 for Linux, December 2012

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux RH AS ProLiant Cluster

261

262

263

264

265

266

267

268

269

270

Ideally, if one process fails, the other processes can wait a period of time for that component to

come back online. This is true whether the component is on the same system or a remote system.

The failed component can be restarted automatically on the same system and rejoin the waiting

processing and continue on. This type of failure can be detected and restarted within a few seconds,

so the end user would never know a failure occurred.

Another alternative is for the failure of one component to still allow bringing down the other

components cleanly. If a database SQL server fails, the database should still be able to be brought

down cleanly so that no database recovery is necessary.

The worse case is for a failure of one component to cause the entire system to fail. If one component

fails and all other components need to be restarted, the downtime will be high.

A.5.2 Be Able to Monitor Applications

All components in a system, including applications, should be able to be monitored for their health.

A monitor might be as simple as a display command or as complicated as a SQL query. There

must be a way to ensure that the application is behaving correctly. If the application fails and it

is not detected automatically, it might take hours for a user to determine the cause of the downtime

and recover from it.

A.6 Minimizing Planned Downtime

Planned downtime (as opposed to unplanned downtime) is scheduled; examples include backups,

systems upgrades to new operating system revisions, or hardware replacements. For planned

downtime, application designers should consider:

• Reducing the time needed for application upgrades/patches.

Can an administrator install a new version of the application without scheduling downtime?

Can different revisions of an application operate within a system? Can different revisions of

a client and server operate within a system?

• Providing for online application reconfiguration.

Can the configuration information used by the application be changed without bringing down

the application?

• Documenting maintenance operations.

Does an operator know how to handle maintenance operations?

When discussing highly available systems, unplanned failures are often the main point of discussion.

However, if it takes 2 weeks to upgrade a system to a new revision of software, there are bound

to be a large number of complaints.

The following sections discuss ways of handling the different types of planned downtime.

A.6.1 Reducing Time Needed for Application Upgrades and Patches

Once a year or so, a new revision of an application is released. How long does it take for the

end-user to upgrade to this new revision? This answer is the amount of planned downtime a user

must take to upgrade their application. The following guidelines reduce this time.

A.6.1.1 Provide for Rolling Upgrades

Provide for a “rolling upgrade” in a client/server environment. For a system with many components,

the typical scenario is to bring down the entire system, upgrade every node to the new version of

the software, and then restart the application on all the affected nodes. For large systems, this

could result in a long downtime. An alternative is to provide for a rolling upgrade. A rolling upgrade

rolls out the new software in a phased approach by upgrading only one component at a time. For

example, the database server is upgraded on Monday, causing a 15 minute downtime. Then on

Tuesday, the application server on two of the nodes is upgraded, which leaves the application

servers on the remaining nodes online and causes no downtime. On Wednesday, two more

application servers are upgraded, and so on. With this approach, you avoid the problem where

everything changes at once, plus you minimize long outages.

266 Designing Highly Available Cluster Applications