PCI Error Handling Product Note HP-UX Servers and Workstations Second Edition Manufacturing Part Number: 5991-5308 April 2006 United States © Copyright 2001-2006 Hewlett-Packard Development Company LP. All rights reserved.
Legal Notices The information in this document is subject to change without notice. Hewlett-Packard makes no warranty of any kind with regard to this manual, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Hewlett-Packard shall not be held liable for errors contained herein or direct, indirect, special, incidental or consequential damages in connection with the furnishing, performance, or use of this material.
Publishing History New editions of this manual will incorporate information that is new or has changed since the previous edition was published (minor typographical or formatting corrections do not result in the publication of a new edition). The publishing date, manufacturing part number, and edition number all change each time a new edition is published, providing unique identification for each edition.
Contents PCI Error Handling Product Note What is PCI Error Handling? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Accessing and Installing the PCI Error Handling Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Confirm PCI Error Handling is Supported . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contents 6
PCI Error Handling Product Note What is PCI Error Handling? The PCI Error Handling feature allows an HP-UX system to avoid a Machine Check Abort (MCA) or a High Priority Machine Check (HPMC), if a PCI error occurs (for example, a parity error). Without the PCI Error Handling feature installed, PCI slots are set in hardfail mode. If a PCI error occurs when a slot is in hardfail mode, an MCA or HPMC will occur, then the system will crash.
PCI Error Handling Product Note Accessing and Installing the PCI Error Handling Feature MP:CM> sysrev Utility Subsystem FW Revision Level: 15.22 | Cabinet #0 | Cabinet #1 | Cab #8 | Cab #9 | -----------------------+-----------------+-----------------+--------+--------+ | SYS FW | PDHC | | | Cell (slot 0) | 3.64 | 15.12 | 3.82 | 15.12 | | | Cell (slot 1) | 3.82 | 15.12 | 3.66 | 15.12 | | | Cell (slot 2) | 3.88 | 15.14 | 3.66 | 15.12 | | | Cell (slot 3) | 3.82 | 15.
PCI Error Handling Product Note Accessing and Installing the PCI Error Handling Feature System Backplane : PCI-X Backplane : Core IO : GPM FM OSP ------- ------- ------- 1.002 1.002 1.002 LPM HS ------- ------- 2.000 1.000 Master Slave -------- ------- 2.010 2.010 LPM ------- PDHC ------- Cell 0 : 1.002 1.010 Cell 1 : 1.002 1.010 Cell 2 : 1.002 1.010 Cell 3 : 1.002 1.010 FIRMWARE: Core IO Master : Event Dict. : Slave : Event Dict. : A.007.008 0.009 A.007.
PCI Error Handling Product Note Accessing and Installing the PCI Error Handling Feature Cell 1 PDHC : A.003.027 Pri SFW : 23.001 (PA) Sec SFW : 23.001 (PA) Cell 2 PDHC : A.003.027 Pri SFW : 23.001 (PA) Sec SFW : 23.001 (PA) Cell 3 PDHC : Pri SFW : 23.001 (PA) Sec SFW : 23.001 NOTE A.003.027 The sysrev command output on some systems includes extra zeros in the system firmware version number. These zeros can be ignored. For example, 3.88 and 3.
PCI Error Handling Product Note Accessing and Installing the PCI Error Handling Feature Installing PCI Error Handling from the Software Pack CD-ROM To install PCI Errror Handling from the Software Pack CD-ROM: Step 1. Log in as root. Step 2. Mount the CD drive to the desired directory. You can find the CD drive device file by using the ioscan -fnC disk command. The following example uses the /cdrom directory: # mount -r /dev/dsk/clt2d0 /cdrom Step 3.
PCI Error Handling Product Note Accessing and Installing the PCI Error Handling Feature Installing PCI Error Handling from the Software Depot To install PCI Error Handling from the Software Depot: Step 1. Go to the HP Software Depot at http://h20293.www2.hp.com Step 2. Select “Enhancement releases and patch bundles” Step 3. Select HP-UX Software Pack (Optional HP-UX 11i v2 Core Enhancements) Step 4.
PCI Error Handling Product Note PCI Error Handling Support Matrix PCI Error Handling Support Matrix For the March 2006 release, the PCI Error Handling feature is supported on HP-UX 11i v2 with four I/O card drivers, and six systems, as detailed in Table 1. Table 1 Supported OS Versions HP-UX 11i v2 PCI Error Handling Support Matrix Supported HP Systems igelan (IPF Gigabit Ethernet - networking) Integrity Superdome 3.88 rx8620 3.88 rx7620 3.88 PA RISC Superdome 23.1 rp8420 23.1 rp7420 23.
PCI Error Handling Product Note New Error Messages for PCI Error Handling New Error Messages for PCI Error Handling When the PCI Error Handling feature is installed, new error messages are included for each of the drivers that support PCI Error Handling. — Error messages for the btlan, igelan, and iether drivers appear in the console log only and do not get logged in syslog. — Error messages for the fcd driver are logged in syslog only and do not appear in the console log.
PCI Error Handling Product Note New Error Messages for PCI Error Handling -------------------100BT/Gigabit Ethernet LAN/9000 Networking---------------@#% Fri Dec 02 PST 2005 11:27:44.753351 DISASTER Subsys:IGELAN Loc:00000 <1004> 1000Base-T in path 1/0/8/1/0/6/1 Is being suspended due to a PCI Error.
PCI Error Handling Product Note How to Online Recover from a PCI Error How to Online Recover from a PCI Error The olrad command and the Attention Button can be used to attempt online recovery from a PCI error without requiring a system reboot. Recovery Using the olrad Command Step 1. If the PCI slot remains powered ON, use the olrad –p OFF slot_id command to power it OFF. Step 2. If power OFF succeeds, try a Post Replace operation at the slot using the olrad -R slot_id command. Step 3.
PCI Error Handling Product Note How to Online Recover from a PCI Error The following example shows how the PCI Error Handling feature is used to handle a PCI error involving the iether driver: NOTE The PCI Error Handling procedure detailed in this example may vary slightly from what you will experience, depending on the platform and IO card driver. A.
PCI Error Handling Product Note How to Online Recover from a PCI Error B.
PCI Error Handling Product Note How to Online Recover from a PCI Error 0-0-1-7 1/0/2/1 56 133 133 Off No N/A N/A N/A PCI-X PCI-X 0-0-1-8 1/0/1/1 28 133 66 On Yes No No N/A PCI-X PCI ================================================================================ F. Use the olrad -R command to resume the card: root [hpfcs774;ia64 hp;1123] olrad -R 0-0-1-1 Activity : Start of Post Replace Target slot : 0-0-1-1 Activity : post_replace:/usr/sbin/olrad.
PCI Error Handling Product Note PCI Error Handling Documentation Recovery Using the Attention Button To use the Attention Button to recover from a PCI error, refer to the Interface Card OL* Support Guide, September 2004, Manufacturing Part Number B2355-90862 for instructions on using the Attention Button, then use the Attention Button to complete the same steps that are illustrated in “Recovery Using the olrad Command” on page 16: Step 1. Confirm the driver/card is suspended Step 2.
PCI Error Handling Product Note Known Problems Known Problems IMPORTANT If you use Serviceguard, HP recommends the PCI Error Handling feature only be enabled if your storage devices are configured with multiple paths and are protected by high availability storage software such as PVLink, SecurePath, or MirrorDisk/UX. If PCI Error Handling is enabled, but your storage devices are configured with only a single path, Serviceguard may not detect when connectivity is lost and cause a failover.
PCI Error Handling Product Note Terms and Definitions Terms and Definitions HPMC High Priority Machine Check – Highest Priority interruption onPA-RISC based systems MCA Machine Check Abort – Highest Priority interruption on Itanium based systems Post Replace Operation - By issuing the olrad -R slot_id command after an I/O card is replaced, slot power is turned on, suspended drivers are resumed, driver scripts (post_replace) for the slot (slot_id) and affected slots (if any) are run, and the attention LED