Datasheet
Reliability, Availability, and Serviceability
6-16 Intel
®
E8870 Scalable Node Controller (SNC) Datasheet
These errors do not compromise further chipset operation. These errors leave an error status
and log trail as the error propagates from one component to its destination. There are three
types of trailing errors detected by the chipset: 2x ECC, 1x ECC, and master abort. Each CT
error can be an error source, midpoint, and/or endpoint.
Table 6-4 provides general information for all of the errors detected by the E8870 chipset. An error
may or may not be associated with a transaction. For example, an inbound 2x ECC error detected
by the SNC can only be the result of a processor read from memory or I/O. On the other hand, an
assertion of BINIT# on the node would not occur as a result of any specific transaction.
Table 6-4 provides information as to the type of transaction that is associated with an error. Also
provided is the error class. For CT errors, the trailing error type is provided, along with the
detection point (source, mid or end). The type of trailing error along with an understanding of
whether an error can be a source, midpoint, or endpoint can be used in determining the error trail.
6.5.3.1 Error Parsing
The error record can be viewed as the raw data. Identifying the error source is useful for scheduling
system maintenance, recovery, reconfiguration on boot after error, etc. For example, a frequently
occurring parity on a scalability port (corrected using link-level retry) may be an indication of
problems with the connector on that port.
The FERRST register of each component has a Last_Err_Value field for each error type (fatal, unc,
cor). As each component detects a fatal and/or non-fatal error, it latches the value of its error pin
associated with the error type. As a result, the value of the Last_Err_Value can be used to help
identify the error source for fatal and non-fatal errors. This relies on platform specific support for
this feature.
1
There are cases where parsing of the error record cannot be accomplished in a reliable way. For the
purposes of these guidelines, parsing an error means to analyze the contents of the error record
(using an algorithm) to determine the source of the error in the platform. Some error parsing
guidelines are:
• When the first fatal error in the platform is NC, parsing of any other errors is unreliable.
• If there is an NC error reported in any of the components, parsing of any error in the platform
may be unreliable.
• If the first non-fatal error is CT, parsing the CT error may be unreliable if there are NCS errors
in the system.
• When the first non-fatal error is CT, multiple errors have been detected if there are other CT
errors detected by the chipset that have different trailing types.
• Since CS errors do not propagate, determining the error source of CS errors is implied (source
is the component that detected the error). Allowing CS errors to populate the FERRST may
compromise the parsing of an error trail for subsequent CT errors.
• For CT errors, the trail can be recreated using the src, mid, and endpoint attributes of each
error. For example, a 2xECC error on the processor bus for a write to local memory will have
F6 reported in the FERRST, and M1 reported in the SERRST. F6 is a source of a CT error, and
M1 is shown as an endpoint. In addition, the value for LastERR2 in the FERRST will show the
SNC detected the first uncorrectable error in the system. Note that the error trail in a multi-
node E8870 platform can have up to two branches (this situation can occur on a implicit write
back that has a 2xECC error where the requesting node that receives the poisoned data is
different than the home node – that gets the memory update
1. A larger system may provide this information in external logic to support this feature.