Datasheet
Intel
®
E8870 Scalable Node Controller (SNC) Datasheet 6-11
Reliability, Availability, and Serviceability
6.2.5 Summary
Table 6-2 summarizes the different roles played by different RAS components. RAS components
must be designed to work closely with each other to provide good system RAS.
6.3 Availability
This chipset supports a system in which modules and interfaces can be duplicated to increase
availability. In the event that there is a fatal error, many of the features in the chipset are designed
so that a fast reboot in a degraded configuration is possible. The availability of the system is
increased in that there is “no-single-point-of-failure” that prohibits a fast reboot. Firmware can
interrogate the error status and log registers of the chipset and system error logs to determine if a
reboot in degraded mode is needed. There are many possible degraded system configurations.
Some are listed below.
• Single SPS. One of two SPSs failed. Every component that is connected to the failed SPS
disables the corresponding SP.
• Failed SNC nodes. The whole node is bad and cannot be salvaged. Each SPS connecting to the
failed node disables the corresponding SP.
• Some of the CPUs on a node failed. The failed CPU is “killed” by tri-stating (not hanging).
• All of the CPUs on a node have failed (memory only node).
• Failed DIMM modules. Part of the memory system may still be salvaged.
• Memory subsystem failed (CPU-only node).
• Failed SIOH nodes. The whole node has failed and cannot be salvaged. Each connecting
component disables the corresponding SP.
Table 6-2. RAS Roles of Different System Components
RAS Tasks Hardware MCA SM Device Driver OS/BIOS
Error
Logging
1 instance NVRAM NVRAM Report to OS Report to MCA.
Error
Containment
Data poisoning, hardfail
response, machine check via
BERR#/MCERR#.
Rendezvous N/A
Discard
uncorrectable
error.
Kill processes/
threads/application.
Error
Recovery or
correction
Correct SBECC
Detect DBECC
Link Level Retry
Rendezvous
Retry
Re-configuration
during reset/boot.
Retry
Fail-over
Hot plug
Retry
Re-configuration
Kill processes/
threads/application.
Error
Signaling
From:
ERR[2:0]#;
BERR#; BINIT#; PMI; hard
fail response; regular
interrupts.
BERR#; BINIT#;
RESET#
PMI; BERR#;
BINIT#; Regular
Interrupts
Regular
Interrupts;
NMI; report to
OS.
Transfer to MCA or
Device driver.
Issue reset.
Error
Signaling To:
Error detection circuit;
BINITIN#; BERRIN#.
BERR#; BINIT#;
hard fail response;
CPU internal errors.
ERR[2:0]#;
Chipset registers;
system sensors.
Regular
Interrupts; bad
software CRC.
Regular interrupts;
PMI; MCA; Device
driver.
Remote
Management
SMBus ports on all chipset
components.
Detailed error logs.
Out-of-band
access for the
remote node.
NA
In-band access for
the remote node.
Re-
configuration
Interface enabling/disabling.
Hot plug on SP, PCI and
Infiniband.
Diagnose the
problem and then
transfer to OS.
During reset/boot.
Fail-over
Online repair
and upgrade
(PCI hot-plug).
On-line repair and
upgrade (hot-plug).