Datasheet

ManualsBrandsMasterclock ManualsAmplifierD232

141

142

143

144

145

146

147

148

149

150

Intel

E8870 Scalable Node Controller (SNC) Datasheet 6-9

Reliability, Availability, and Serviceability

An entry times-out if the counter wraps around (toggles the high-order) bit twice. As a result, the

time-out period can be from 1 x to 2 x the timer value.

Using such a mechanism, it is possible for multiple entries in a queue to time-out simultaneously.

When a time-out occurs, the hardware selects one entry as the “first error” for logging in the

FERRST. The presence of more than one error is indicated in the SERRST register.

6.2 RAS: System Components Roles and

Responsibilities

The fault isolation process is greatly enhanced with active monitoring and logging of error

conditions by the server management and the MCA handler. High RAS systems based on this

chipset are not possible without well-designed server management and machine check architecture

(MCA) handler.

Upon booting/re-booting, the error logs and SP interfaces can be checked by firmware (SAL/PAL/

System Management) to identify faulty nodes. Faulty nodes can be isolated/disabled by the

firmware, and the system can be rebooted with the revised hardware configuration. Faulty modules

can be hot-replaced, and a new boot can be scheduled at a convenient time to integrate the modules

into the system.

6.2.1 Machine Check Architecture (MCA)

MCA provides “in-band” error handling features for high RAS systems on the Itanium processor

family. Some of the highlights of MCA are listed below. For details on MCA, refer to Itanium™

Processor Family Error Handling Guide.

• Error containment.

• Error correction.

• Error logging (this log may be combined with SMs error log).

• Error classification. This classification scheme assists firmware and O/S handler development.

Note that the error classification scheme used by MCA may be different than the error types

and classification used by the chipset.

• Platform error signaling and escalation:

— BERR#: It is recommended that hardware errors that are uncorrectable or fatal be reported

as BERR#. The system hardware may choose to assert BERR# locally to a particular SNC

node or globally to all SNC nodes (using BERRIN#). A processor assertion of BERR# is

observable on SNC BERROUT#.

— BINIT#: It is strongly recommended that the system hardware let software promote an

error from BERR# to BINIT#.A processor assertion of BINIT# is observable on the SNC

BINITOUT#. System hardware may choose to assert BINIT# globally to all SNC nodes

(via BINITIN#).

— PMI: Although platform events including errors can reported and logged through PMI

(Itanium processor family), this is not recommended because it violates the Developer’s

Interface Guide for IA-64 Servers requirements of not reporting errors via PMI.2x ECC or

Hardfail response.