PCI / PCIe Error Recovery Product Note HP-UX 11i v3 HP Part Number: 5900-0584 Published: September 2010
Legal Notices © Copyright 2003-2010 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor’s standard commercial license. The information contained herein is subject to change without notice.
Table of Contents 1 PCI / PCIe Error Recovery Product Note......................................................................5 Confirm PCI Error Recovery is Supported.............................................................................................6 Using ioscan to identify PCI Error Recovery Capability........................................................................9 Example:................................................................................................................
List of Tables 1-1 1-2 1-3 1-4 4 Utility Subsystem FW Revision Level: 15.22...................................................................................6 Error Recovery Attributes...............................................................................................................9 Events Generated on Legacy Platforms due to PCI / PCIe Errors................................................10 Events Generated on HP Superdome 2 Platform due to PCIe Errors......................................
1 PCI / PCIe Error Recovery Product Note The PCI / PCIe Error Recovery feature provides the ability to detect, isolate, and automatically recover from a PCI / PCIe error, avoiding a system crash. PCI Error Recovery is included with the HP-UX 11i v3 operating system, and it is enabled by default. NOTE: PCI / PCIe Error Recovery is not supported on all platforms. To determine if PCI / PCIe Error Recovery is supported on your system, see the PCI Error Recovery Support Matrix, available at http://www.hp.
Confirm PCI Error Recovery is Supported 1. To confirm PCI Error Recovery (ER) is supported with your configuration and system firmware version, see PCI Error Recovery Support Matrix, HP-UX 11i v3 at: http://docs.hp.com/en/ha.html NOTE: PCI-express ER functionality can be enabled on legacy platforms only if the patch set: PHKL_37099, PHKL_37329, PHKL_37330, PHKL_37331, PHKL_37648, PHKL_37405, and PHKL_37510 is installed on HP-UX 11i v3 OS.
Table 1-1 Utility Subsystem FW Revision Level: 15.22 (continued) CLU 15.2 15.2 15.2 15.2 PM 15.0 15.0 15.0 15.0 CIO (bay 15.0 0, chassis 1) 15.0 15.0 15.0 CIO (bay 15.0 0, chassis 3) 15.0 15.0 15.0 CIO (bay 15.0 1, chassis 1) 15.0 15.0 15.0 15.0 15.0 CIO (bay 15.
Cell 3 PDHC : A.003.027 Pri SFW : 23.001 (PA) Sec SFW : 23.001 NOTE: The sysrev command output on some systems includes extra zeros in the system firmware version number. These zeros can be ignored. For example, 3.88 and 3.088 on Integrity systems are the same firmware version, also 23.1 and 23.001 on HP 9000 systems represent the same firmware version. 3. The system firmware is the main component of the firmware recipe required to support PCI Error Recovery.
Using ioscan to identify PCI Error Recovery Capability The command ioscan -P error_recovery can be used to determine if Local Bus Adapters (LBA) in a system support PCI Error Recovery feature. The capability of an LBA is in turn determined by the hardware platform capability and the driver controlling the PCI adapter in the slot under that LBA.
Tunable Kernel Parameters There are two PCI Error Recovery tunables that you can configure: • pci_eh_enable This tunable is used to enable or disable the PCI Error Recovery feature. On HP-UX 11i v3, PCI Error Recovery is enabled by default. pci_eh_enable is not a dynamic tunable. A reboot will be required for changes to take effect. For more information about kernel tunable parameters, see the pci_eh_enable(5) manpage.
Table 1-3 Events Generated on Legacy Platforms due to PCI / PCIe Errors (continued) Event ID Summary 100160 A recovered platform or I/O error was detected 100161 A unrecoverable platform or I/O error was detected Table 1-4 Events Generated on HP Superdome 2 Platform due to PCIe Errors Error ID Summary 100143 Link Timeout to PCIe Device 100144 Malformed Transaction Layer Packet (TLP) Error 100145 Gross PCIe Link Failure 100146 PCIe Link Failure - Packet marked as Poisoned 100147 Surprise Dow
Automatic Recovery from a PCI Error With the PCI Error Recovery feature enabled, if an error occurs on a PCI bus containing an I/O card that supports PCI Error Recovery, the following sequence of events occur during automatic error recovery: 1. 2. 3. 4. 5. The PCI bus is isolated from further I/O The I/O devices are quiesced The error is cleared The bus is reset The devices are resumed The following example illustrates what you can expect if automatic recovery from a PCI error occurs: 1. 2.
Manual Recovery from a PCI Error After a successful automatic PCI error recovery, if another PCI Error is detected within the time interval specified by the pci_error_tolerance_time tunable, the card in the I/O slot will be suspended. A manual PCI Error Recovery operation is required to restore the card.
7. After the card has been resumed, a recovery message will be displayed in the console, for example: Hardware path 0/0/0 Successfully recovered from PCI Error 8. If the olrad -R command does not succeed, you have a persistent PCI error condition. There is a high probability that the I/O card is defective. A failure message will be displayed on the console, for example: Automatic PCI Error Recovery Operation failed at Hardware path 0/0/0. Path may be recovered using a Manual Error Recovery operation.
PCI Error Recovery Documentation The documentation that supports this release of the PCI Error Recovery feature consists of: • • • • • • PCI Error Recovery Support Matrix — available at http://www.hp.com/go/ hpux-networking-docs in the HP-UX 11i v3 Networking Software category. Interface Card OL* Support Guide — available at http://www.hp.com/go/hpux-networking-docs in the HP-UX 11i v3 Networking Software category. Patch Management User Guide for HP-UX 11.x Systems — available at http://www.hp.
Terms and Definitions HPMC High Priority Machine Check Highest Priority interruption on PA-RISC based systems 16 MCA Machine Check Abort Highest Priority interruption on Itanium based systems Post Replace Operation By issuing the olrad -R slot_id command after an I/O card is replaced, slot power is turned on, suspended drivers are resumed, driver scripts (post_replace) for the slot (slot_id) and affected slots (if any) are run, and the attention LED for the slot (slot_id) is set to OFF PCI / PCIe Erro