Datasheet

ManualsBrandsMasterclock ManualsAmplifierD232

131

132

133

134

135

136

137

138

139

140

Intel

E8870 Scalable Node Controller (SNC) Datasheet 6-1

Reliability, Availability, and

Serviceability 6

This section describes the features provided by the E8870 chipset that play a role in the design and

development of high Reliability, Availability and Serviceability (RAS) systems. This chapter

describes the E8870 chipset support for system integrity. The E8870 chipset provides error logging

and employs a method called end-to-end error detection for data errors. These features support

identification of the first system error and its source. A summary of all errors, their classification,

and the chipset response and logging registers is provided.

This chapter also describes the features and their use to support availability and serviceability of

systems built using E8870 chipset components. An overview of the roles and responsibilities of

firmware and system software to support high RAS systems is provided. There is also description

on how high availability is supported by the chipset through a fast re-boot in a degraded mode upon

the occurrence of a fatal hardware error. Finally, support on hot-plug for PCI and SP is described.

6.1 Data Integrity

Errors are classified into two basic types: fatal (or non-recoverable) and non-fatal (or

recoverable).

Fatal errors include protocol errors, parity errors on header fields, time-outs, failed

link-level retry, etc. For fatal errors, continued operation of the chipset may be compromised.

For non-fatal errors chipset operations can continue (transactions are completed, resources

deallocated, etc.). Non-fatal errors are further classified into correctable and non-correctable errors.

Non-correctable errors are those that are not “corrected” by the chipset. Non-correctable errors

may or may not be correctable by software. Correctable errors include single bit ECC errors,

successful link level retry, and those transactions where the chipset performs a master abort of the

transaction.

Each component in the chipset indicates an error condition on external pins. A pin (open drain) is

provided for each error type (fatal, uncorrectable, and correctable). It is up to the system to decide

what is the best course of action upon the detection of an error.

Each E8870 chipset component provides error logging and error status for the first error detected

by the component and error status for subsequent errors. Errors are detected and logged at

intermediate entry points (on the inbound SP interface, for example). Errors are detected but not

logged at the end points (where the packet is consumed or translated to another interface with

different error coverage/detection). This method of error correction and error logging is called end-

to-end error correction.

Table 6-1 provides a summary of all errors detected by the E8870 chipset components. In this table

the error type and the chipset response is listed. If an error is the first error on the component (see

Section 6.1.3, “Error Reporting”) then information may be logged for the error. If a log exists for

the error, the information that is logged, and the name of the error log is provided. Some errors may

be detected in more than one component

1. These are hardware definitions used by the E8870 chipset, and are not the same error types that are used by software (MCA).