Datasheet

Intel
®
E8870 Scalable Node Controller (SNC) Datasheet 6-9
Reliability, Availability, and Serviceability
An entry times-out if the counter wraps around (toggles the high-order) bit twice. As a result, the
time-out period can be from 1 x to 2 x the timer value.
Using such a mechanism, it is possible for multiple entries in a queue to time-out simultaneously.
When a time-out occurs, the hardware selects one entry as the first error for logging in the
FERRST. The presence of more than one error is indicated in the SERRST register.
6.2 RAS: System Components Roles and
Responsibilities
The fault isolation process is greatly enhanced with active monitoring and logging of error
conditions by the server management and the MCA handler. High RAS systems based on this
chipset are not possible without well-designed server management and machine check architecture
(MCA) handler.
Upon booting/re-booting, the error logs and SP interfaces can be checked by firmware (SAL/PAL/
System Management) to identify faulty nodes. Faulty nodes can be isolated/disabled by the
firmware, and the system can be rebooted with the revised hardware configuration. Faulty modules
can be hot-replaced, and a new boot can be scheduled at a convenient time to integrate the modules
into the system.
6.2.1 Machine Check Architecture (MCA)
MCA provides in-band error handling features for high RAS systems on the Itanium processor
family. Some of the highlights of MCA are listed below. For details on MCA, refer to Itanium
Processor Family Error Handling Guide.
Error containment.
Error correction.
Error logging (this log may be combined with SMs error log).
Error classification. This classification scheme assists firmware and O/S handler development.
Note that the error classification scheme used by MCA may be different than the error types
and classification used by the chipset.
Platform error signaling and escalation:
BERR#: It is recommended that hardware errors that are uncorrectable or fatal be reported
as BERR#. The system hardware may choose to assert BERR# locally to a particular SNC
node or globally to all SNC nodes (using BERRIN#). A processor assertion of BERR# is
observable on SNC BERROUT#.
BINIT#: It is strongly recommended that the system hardware let software promote an
error from BERR# to BINIT#.A processor assertion of BINIT# is observable on the SNC
BINITOUT#. System hardware may choose to assert BINIT# globally to all SNC nodes
(via BINITIN#).
PMI: Although platform events including errors can reported and logged through PMI
(Itanium processor family), this is not recommended because it violates the Developers
Interface Guide for IA-64 Servers requirements of not reporting errors via PMI.2x ECC or
Hardfail response.