Datasheet
Intel
®
E8870 Scalable Node Controller (SNC) Datasheet 6-1
Reliability, Availability, and
Serviceability 6
This section describes the features provided by the E8870 chipset that play a role in the design and
development of high Reliability, Availability and Serviceability (RAS) systems. This chapter
describes the E8870 chipset support for system integrity. The E8870 chipset provides error logging
and employs a method called end-to-end error detection for data errors. These features support
identification of the first system error and its source. A summary of all errors, their classification,
and the chipset response and logging registers is provided.
This chapter also describes the features and their use to support availability and serviceability of
systems built using E8870 chipset components. An overview of the roles and responsibilities of
firmware and system software to support high RAS systems is provided. There is also description
on how high availability is supported by the chipset through a fast re-boot in a degraded mode upon
the occurrence of a fatal hardware error. Finally, support on hot-plug for PCI and SP is described.
6.1 Data Integrity
Errors are classified into two basic types: fatal (or non-recoverable) and non-fatal (or
recoverable).
1
Fatal errors include protocol errors, parity errors on header fields, time-outs, failed
link-level retry, etc. For fatal errors, continued operation of the chipset may be compromised.
For non-fatal errors chipset operations can continue (transactions are completed, resources
deallocated, etc.). Non-fatal errors are further classified into correctable and non-correctable errors.
Non-correctable errors are those that are not “corrected” by the chipset. Non-correctable errors
may or may not be correctable by software. Correctable errors include single bit ECC errors,
successful link level retry, and those transactions where the chipset performs a master abort of the
transaction.
Each component in the chipset indicates an error condition on external pins. A pin (open drain) is
provided for each error type (fatal, uncorrectable, and correctable). It is up to the system to decide
what is the best course of action upon the detection of an error.
Each E8870 chipset component provides error logging and error status for the first error detected
by the component and error status for subsequent errors. Errors are detected and logged at
intermediate entry points (on the inbound SP interface, for example). Errors are detected but not
logged at the end points (where the packet is consumed or translated to another interface with
different error coverage/detection). This method of error correction and error logging is called end-
to-end error correction.
Table 6-1 provides a summary of all errors detected by the E8870 chipset components. In this table
the error type and the chipset response is listed. If an error is the first error on the component (see
Section 6.1.3, “Error Reporting”) then information may be logged for the error. If a log exists for
the error, the information that is logged, and the name of the error log is provided. Some errors may
be detected in more than one component
1. These are hardware definitions used by the E8870 chipset, and are not the same error types that are used by software (MCA).