Datasheet

Intel
®
E8870 Scalable Node Controller (SNC) Datasheet 6-11
Reliability, Availability, and Serviceability
6.2.5 Summary
Table 6-2 summarizes the different roles played by different RAS components. RAS components
must be designed to work closely with each other to provide good system RAS.
6.3 Availability
This chipset supports a system in which modules and interfaces can be duplicated to increase
availability. In the event that there is a fatal error, many of the features in the chipset are designed
so that a fast reboot in a degraded configuration is possible. The availability of the system is
increased in that there is no-single-point-of-failure that prohibits a fast reboot. Firmware can
interrogate the error status and log registers of the chipset and system error logs to determine if a
reboot in degraded mode is needed. There are many possible degraded system configurations.
Some are listed below.
Single SPS. One of two SPSs failed. Every component that is connected to the failed SPS
disables the corresponding SP.
Failed SNC nodes. The whole node is bad and cannot be salvaged. Each SPS connecting to the
failed node disables the corresponding SP.
Some of the CPUs on a node failed. The failed CPU is killed by tri-stating (not hanging).
All of the CPUs on a node have failed (memory only node).
Failed DIMM modules. Part of the memory system may still be salvaged.
Memory subsystem failed (CPU-only node).
Failed SIOH nodes. The whole node has failed and cannot be salvaged. Each connecting
component disables the corresponding SP.
Table 6-2. RAS Roles of Different System Components
RAS Tasks Hardware MCA SM Device Driver OS/BIOS
Error
Logging
1 instance NVRAM NVRAM Report to OS Report to MCA.
Error
Containment
Data poisoning, hardfail
response, machine check via
BERR#/MCERR#.
Rendezvous N/A
Discard
uncorrectable
error.
Kill processes/
threads/application.
Error
Recovery or
correction
Correct SBECC
Detect DBECC
Link Level Retry
Rendezvous
Retry
Re-configuration
during reset/boot.
Retry
Fail-over
Hot plug
Retry
Re-configuration
Kill processes/
threads/application.
Error
Signaling
From:
ERR[2:0]#;
BERR#; BINIT#; PMI; hard
fail response; regular
interrupts.
BERR#; BINIT#;
RESET#
PMI; BERR#;
BINIT#; Regular
Interrupts
Regular
Interrupts;
NMI; report to
OS.
Transfer to MCA or
Device driver.
Issue reset.
Error
Signaling To:
Error detection circuit;
BINITIN#; BERRIN#.
BERR#; BINIT#;
hard fail response;
CPU internal errors.
ERR[2:0]#;
Chipset registers;
system sensors.
Regular
Interrupts; bad
software CRC.
Regular interrupts;
PMI; MCA; Device
driver.
Remote
Management
SMBus ports on all chipset
components.
Detailed error logs.
Out-of-band
access for the
remote node.
NA
In-band access for
the remote node.
Re-
configuration
Interface enabling/disabling.
Hot plug on SP, PCI and
Infiniband.
Diagnose the
problem and then
transfer to OS.
During reset/boot.
Fail-over
Online repair
and upgrade
(PCI hot-plug).
On-line repair and
upgrade (hot-plug).