System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Intel order number G74211-002 Revision 1.
Revision History System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Revision History Date August 2012 Revision Number 1.0 December 2013 1.1 ii Modifications Initial draft. Corrected IPMI Watchdog and PEF Sensors Typical Characteristics tables. Clarified Channel designators for DIMM memory errors. Added ME sensor 17h. Intel order number G74211-002 Revision 1.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Disclaimers Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.
Table of Contents System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table of Contents 1. Introduction ........................................................................................................................ 1 1.1 Purpose.................................................................................................................. 1 1.2 Industry Standard ...................................................................................................
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table of Contents 6.3 CPU Missing Sensor ............................................................................................ 41 6.3.1 CPU Missing Sensor – Next Steps ....................................................................... 42 6.4 QuickPath Interconnect Error Sensors ................................................................. 42 6.4.1 QPI Correctable Error Sensor .................................
Table of Contents System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 12.1 HSC Backplane Temperature Sensor .................................................................. 80 12.2 HSC Drive Slot Status Sensor .............................................................................. 81 12.2.1 HSC Drive Slot Status Sensor – Next Steps ......................................................... 82 12.3 HSC Drive Presence Sensor .............................................
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards List of Tables List of Tables Table 1: SEL Record Format....................................................................................................... 4 Table 2: Event Request Message Event Data Field Contents ..................................................... 6 Table 3: OEM SEL Record (Type C0h-DFh) ...............................................................................
List of Tables System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 40: Processor Thermal Control % Sensors Event Triggers – Description ........................ 37 Table 41: Processor Thermal Control % Sensors – Next Steps ................................................ 37 Table 42: Discrete Thermal Sensors Typical Characteristics..................................................... 38 Table 43: Discrete Thermal Sensors – Next Steps ....................................
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards List of Tables Table 81: HSC Backplane Temperature Sensor Typical Characteristics ................................... 80 Table 82: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps ............... 81 Table 83: HSC Drive Slot Status Sensor Typical Characteristics .............................................. 81 Table 84: HSC Drive Presence Sensor Typical Characteristics ....................................
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 1. Introduction Introduction The server management hardware that is part of Intel® Server Boards and Intel® Server Platforms serves as a vital part of the overall server management strategy. The server management hardware provides essential information to the system administrator and provides the administrator the ability to remotely control the server, even when the operating system is not running.
Introduction System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards The baseboard management controller and chassis The baseboard management controller and systems management software Between servers IPMI enables the following: Common access to platform management information, consisting of: - Local access from systems management software Remote access from LAN Inter-chassis access from Intelligent Chassis Management Bus Access from LAN, serial/modem, IPMB, PC
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Introduction The BMC allows access to SEL from in-band and out-of-band mechanisms. There are various tools and utilities that can be used to access the SEL. There is the Intel® SELViewer and multiple open sourced IPMI tools. 1.2.3 Intel®Intelligent Power Node Manager Version 1.5 Intel® Intelligent Power Node Manager version 1.
Basic Decoding of a SEL Record System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 2. Basic Decoding of a SEL Record The System Event Log (SEL) record format is defined in the IPMI Specification. The following section provides a basic definition for each of the fields in a SEL. For more details see the IPMI Specification. The definitions for the standard SEL can be found in Table 1. The definitions for the OEM defined event logs can be found in Table 3 and Table 4. 2.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Basic Decoding of a SEL Record Byte Field Description 8 9 Generator ID (GID) RqSA and LUN if event was generated from IPMB. Software ID if event was generated from system software.
Basic Decoding of a SEL Record System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 2: Event Request Message Event Data Field Contents Sensor Class Event Data Threshold Event Data 1 [7:6] – 00b = Unspecified Event Data 2 01b = Trigger reading in Event Data 2 10b = OEM code in Event Data 2 11b = Sensor-specific event extension code in Event Data 2 [5:4] – 00b = Unspecified Event Data 3 01b = Trigger threshold value in Event Data 3 10b = OEM code in Event Data 3 11b = Sen
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Basic Decoding of a SEL Record Table 3: OEM SEL Record (Type C0h-DFh) Byte Field Description 1 2 Record ID (RID) ID used for SEL Record access. 3 Record Type (RT) [7:0] – Record Type C0h-DFh = OEM timestamped, bytes 8-16 OEM defined 4 5 6 7 Timestamp (TS) Time when event was logged. LS byte first.
Sensor Cross Reference List 3. System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Sensor Cross Reference List This section contains a cross reference to help find details on any specific SEL entry. 3.1 BMC owned Sensors (GID = 0020h) The following table can be used to find the details of sensors owned by the BMC.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Sensor Number Sensor Name Sensor Cross Reference List Details Section Next Steps 10h BB +1.1V IOH (BB +1.1V IOH) Voltage Sensors Table 14: Voltage Sensors – Next Steps 11h BB +1.1V P1 Vccp (BB +1.1V P1 Vccp) Voltage Sensors Table 14: Voltage Sensors – Next Steps 12h BB +1.1V P2 Vccp (BB +1.1V P2 Vccp) Voltage Sensors Table 14: Voltage Sensors – Next Steps 13h BB +1.5V P1 DDR3 (BB +1.
Sensor Cross Reference List Sensor Number Sensor Name System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Details Section Next Steps 1Eh BB +1.35V P2 LV DDR3 (BB +1.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Sensor Number Sensor Name Sensor Cross Reference List Details Section Next Steps 54h Power Supply 1 +12V % of Maximum Current Output (PS1 Curr Out %) Power Supply Current Output % Sensors Table 24: Power Supply Current Output % Sensor – Event Trigger Offset – Next Steps 55h Power Supply 2 +12V % of Maximum Current Output (PS2 Curr Out %) Power Supply Current Output % Sensors Table 24: Power Supply Current Output %
Sensor Cross Reference List Sensor Number IOH Thermal Trip (IOH Thermal Trip) 6Ah 3.2 Sensor Name System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Details Section Discrete Thermal Sensors Next Steps Table 43: Discrete Thermal Sensors BIOS POST owned Sensors (GID = 0001h) The following table can be used to find the details of sensors owned by BIOS POST.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Sensor Cross Reference List Table 7: BIOS SMI owned Sensors Sensor Number Sensor Name Details Section Next Steps 02h Memory ECC Error Memory Correctable and Uncorrectable ECC Error Table 61: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps 03h Legacy PCI Error Legacy PCI Errors Table 68: Legacy PCI Error Sensor Event Trigger Offset – Next Steps 04h PCI Express Fatal Error PCI Exp
Sensor Cross Reference List 3.4 System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Hot Swap Controller Firmware owned Sensors (GID = 00C0h/00C2h) The following table can be used to find the details of sensors owned by the Hot Swap Controller (HSC) firmware. The HSC firmware resides on a Hot Swap Back Plane (HSBP). There can be up to two HSBPs in a system. Each HSBP will have its own GID.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Sensor Number Sensor Name Sensor Cross Reference List Details Section Next Steps 09h Drive Slot 7 Status HSC Drive Slot Status Sensor HSC Drive Slot Status Sensor – Next Steps 0Ah Drive Slot 0 Presence HSC Drive Presence Sensor HSC Drive Presence Sensor – Next Steps 0Bh Drive Slot 1 Presence HSC Drive Presence Sensor HSC Drive Presence Sensor – Next Steps 0Ch Drive Slot 2 Presence HSC Drive Presence Sensor
Sensor Cross Reference List 3.6 System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Microsoft* OS owned Events (GID = 0041) The following table can be used to find the details of records that are owned by the Microsoft* Operating System (OS). Table 10: Microsoft* OS owned Events Sensor Name Boot Event Shutdown Event Bug Check / Blue Screen 3.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 4. Power Subsystems Power Subsystems The BMC monitors the power subsystem including power supplies, select onboard voltages, and related sensors. 4.1 Voltage Sensors The BMC monitors the main voltage sources in the system, including the baseboard, memory, and processors, using IPMI-compliant analog/threshold sensors.
Power Subsystems System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 13: Voltage Sensors Event Triggers – Description Event Trigger Hex Description Assertion Severity Deassert Severity Description 00h Lower non-critical going low Degraded OK The voltage has dropped below its lower non-critical threshold. 02h Lower critical going low non-fatal Degraded The voltage has dropped below its lower critical threshold.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Sensor Number Sensor Name Power Subsystems Next Steps 13h BB +1.5V P1 DDR3 This 1.5V line is supplied by the main board. This 1.5V line is used by the memory on processor 1. 1. Ensure all cables are connected correctly. 2. Check the DIMMs are seated properly. 3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise replace the DIMM. 14h BB +1.5V P2 DDR3 This 1.
Power Subsystems Sensor Number 20 System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Sensor Name Next Steps 18h BB +3.3V Vbat +3.3V Vbat is supplied by the CMOS battery when power is off and by the main board when power is on. +3.3V Vbat is used by the CMOS and related circuits. 1. Replace the CMOS battery. Any battery of type CR2032 can be used. 2. If error remains (unlikely), replace the board. 19h BB +5.0V +5.0V is supplied by the power supplies. +5.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Sensor Number Sensor Name Power Subsystems Next Steps 1Dh BB +1.35 P1 Mem This 1.35V line is supplied by the main board. This 1.35V line is used by low voltage memory on processor 1. 1. Ensure all cables are connected correctly. 2. Check the DIMMs are seated properly. 3. Cross test the DIMMs. 4. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM. 1Eh BB +1.35 P2 Mem This 1.
Power Subsystems System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte Field Description 13 Event Direction and Event Type [7] Event direction 0b = Assertion Event 1b = Deassertion Event [6:0] Event Type = 6Fh (Sensor Specific) 14 Event Data 1 [7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] = Sensor Specific offset as described in Table 9 15 Event Data 2 Not used 16 Event Data 3 Not used Table 16: Power Unit Status Sensor
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Power Subsystems Table 17: Power Unit Redundancy Sensors Typical Characteristics Byte Field Description 11 Sensor Type 09h = Power Unit 12 Sensor Number 02h 13 Event Direction and Event Type [7] Event direction 0b = Assertion Event 1b = Deassertion Event [6:0] Event Type = 0Bh (Generic Discrete) 14 Event Data 1 [7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger O
Power Subsystems 4.3 System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Power Supply The BMC monitors the power supply subsystem. 4.3.1 Power Supply Status Sensors These sensors report the status of the power supplies in the system. When a system first AC applied or removed it can log an event. Also if there is a failure, predictive failure, or a configuration error it can log an event.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Sensor Specific Offset Hex Power Subsystems Description Description Next Steps 01h Failure Power supply failed. Indicates a power supply failed. 1) Remove and reapply AC. 2) If the power supply still fails, replace it. 02h Predictive Failure Typically means a fan inside the power supply is not cooling the power supply. It may indicate the fan is failing. Replace the power supply. 03h AC lost AC removed.
Power Subsystems System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte Field Description 15 Event Data 2 Reading that triggered event 16 Event Data 3 Threshold value that triggered event The following table describes the severity of each of the event triggers for both assertion and deassertion.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte Field Power Subsystems Description 13 Event Direction and Event Type [7] Event direction 0b = Assertion Event 1b = Deassertion Event [6:0] Event Type = 01h (Threshold) 14 Event Data 1 [7:6] – 01b = Trigger reading in Event Data 2 [5:4] – 01b = Trigger threshold in Event Data 3 [3:0] – Event Trigger Offset as described in Table 24 15 Event Data 2 Reading that triggered event 16 Event Data 3 Threshold value
Power Subsystems System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 25: Power Supply Temperature Sensors Typical Characteristics Byte Field Description 11 Sensor Type 01h = Temperature 12 Sensor Number 56h = Power Supply 1 Temperature 57h = Power Supply 2 Temperature 13 Event Direction and Event Type [7] Event direction 0b = Assertion Event 1b = Deassertion Event [6:0] Event Type = 01h (Threshold) 14 Event Data 1 [7:6] – 01b = Trigger reading in Event Data
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 5. Cooling Subsystem 5.1 Fan Sensors Cooling Subsystem There are three types of fan sensors that can be present on Intel® Server Systems: speed, presence, and redundancy. The last two are only present in systems with hot-swap redundant fans. 5.1.1 Fan Speed Sensors Fan speed sensors monitor the rpm signal on the relevant fan headers on the platform. Fan speed sensors are threshold-based sensors.
Cooling Subsystem System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 28: Fan Speed Sensor – Event Trigger Offset – Next Steps Event Trigger Offset Assertion Severity Hex Description 00h Lower non-critical going low Degraded 02h Lower critical going low non-fatal 5.1.2 Deassert Severity Description Next Steps OK The fan speed has dropped below its lower non-critical threshold. Degraded The fan speed has dropped below its lower critical threshold.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte Cooling Subsystem Field Description 14 Event Data 1 [7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset as described in Table 30 15 Event Data 2 Not used 16 Event Data 3 Not used The following table describes the severity of each of the event triggers for both assertion and deassertion.
Cooling Subsystem System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte Field Description 13 Event Direction and Event Type [7] Event direction 0b = Assertion Event 1b = Deassertion Event [6:0] Event Type = 0Bh (Generic Discrete) 14 Event Data 1 [7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset as described in Table 32 15 Event Data 2 Not used 16 Event Data 3 Not used The following table describes the
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 5.2 Cooling Subsystem Temperature Sensors There are a variety of temperature sensors that can be implemented on Intel® Server Systems. They are split into three types: regular temperature sensors, thermal margin sensors, and discrete temperature sensors. Each of them has its own types of events that can be logged. 5.2.
Cooling Subsystem System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 34: Temperature Sensors Event Triggers – Description Event Trigger Hex Description Assertion Severity Deassert Severity Description 00h Lower non-critical going low Degraded OK The temperature has dropped below its lower non-critical threshold. 02h Lower critical going low non-fatal Degraded The temperature has dropped below its lower critical threshold.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 5.2.2 Cooling Subsystem Thermal Margin Sensors Margin sensors are also linear sensors but typically report a negative value. This is not an actual temperature, but in fact an offset to a critical temperature. Example sensors are Processor Thermal Margin, Memory Thermal Margin, and IOH Thermal margin. Values reported should be seen as number of degrees below a critical temperature for the particular component.
Cooling Subsystem System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 38: Thermal Margin Sensors – Next Steps Sensor Number Sensor Name 22h IOH Therm Margin 23h Mem P1 Therm Margin 24h Mem P2 Therm Margin 62h P1 Therm Margin 63h P2 Therm Margin 5.2.3 Next Steps 1. 2. 3. 4. Check for clear and unobstructed airflow into and out of the chassis. Ensure the SDR is programmed and correct chassis has been selected. Ensure there are no fan failures.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte 16 Field Cooling Subsystem Description Event Data 3 Threshold value that triggered event. Table 40: Processor Thermal Control % Sensors Event Triggers – Description Event Trigger Hex Description Assertion Severity Deassert Severity Description 07h Upper non-critical going high Degraded OK The thermal margin has gone over its upper non-critical threshold.
Cooling Subsystem System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 42: Discrete Thermal Sensors Typical Characteristics Byte Field Description 11 Sensor Type 01h = Temperature 12 Sensor Number See Table 43 13 Event Direction and Event Type [7] Event direction 0b = Assertion Event 1b = Deassertion Event [6:0] Event Type = See Table 43 14 Event Data 1 [7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 6. Processor Subsystem Processor Subsystem Intel® servers report several processor-centric sensors in the SEL. 6.1 Processor Status Sensor The status sensor reports processor presence or a thermal trip condition. Each processor has a status sensor. Table 44: Process Status Sensors Typical Characteristics Byte Revision 1.
Processor Subsystem System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 45: Processor Status Sensors – Next Steps Sensor Number Sensor Name 60h P1 Status 61h 6.2 P2 Status Event Trigger Offset Hex Description Description 01h Thermal trip The processor exceeded the maximum temperature. 07h State Asserted Indicates the processor is present. 01h Thermal trip The processor exceeded the maximum temperature.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte 16 6.2.1 Field Processor Subsystem Description Event Data 3 Not used. Catastrophic Error Sensor – Next Steps This error is typically caused by other platform components. 1. Check for other errors near the time of the CATERR event. 2. Verify all peripherals are plugged in and operating correctly, particularly Hard Drives, Optical Drives, and I/O. 3. Update system firmware and drivers. 6.
Processor Subsystem 6.3.1 System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards CPU Missing Sensor – Next Steps Verify the processor is installed in the correct slot. 6.4 QuickPath Interconnect Error Sensors The Intel® QuickPath Interconnect (QPI) bus on Intel® S5500/S3420 series server boards is the interconnection between processors and to the chipset. The QPI Error sensors are all reported by the BIOS SMI Handler to the BMC so the Generator ID will be 33h. 6.4.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 6.4.1.1 Processor Subsystem QPI Correctable Error Sensor – Next Steps This is an Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues: 1. Check the processor is installed correctly. 2. Inspect the socket for bent pins. 3. Cross test the processor if possible. 6.4.
Processor Subsystem 6.4.2.1 System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards QPI Non-Fatal Error Sensor – Next Steps This is an Informational event only. Non-Fatal errors are acceptable and normal at a low rate of occurrence. If the error continues: 1. Check the processor is installed correctly. 2. Inspect the socket for bent pins. 3. Cross test the processor if possible. 6.4.3 QPI Fatal and Fatal #2 The system detected a QPI fatal or non-recoverable error.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Processor Subsystem Table 51: QPI Fatal #2 Error Sensor Typical Characteristics Byte 6.4.3.
Memory Subsystem 7. System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Memory Subsystem Intel® servers report memory errors, status, and configuration in the SEL. 7.1 Memory RAS Mirroring and Sparing “Memory RAS Configuration Status” refers to the BIOS sending the current RAS mode and RAS operational state to the BMC to log into the SEL as a SEL record. This allows a remote software/application to query and retrieve the system memory state.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte Field Memory Subsystem Description 13 Event Direction and Event Type [7] Event direction 0b = Assertion Event 1b = Deassertion Event [6:0] Event Type = 09h (digital Discrete) 14 Event Data 1 [7:6] – 10b = OEM code in Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset as described in Table 53 15 Event Data 2 Not used 16 Event Data 3 Not used Table 53: Mirroring Configuration S
Memory Subsystem System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 54: Mirrored Redundancy State Sensor Typical Characteristics Byte 48 Field Description 8 9 Generator ID 0001h = BIOS POST 11 Sensor Type 0ch = Memory 12 Sensor Number 01h 13 Event Direction and Event Type [7] Event direction 0b = Assertion Event 1b = Deassertion Event [6:0] Event Type = 0Bh (Generic Discrete) 14 Event Data 1 [7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte 16 Field Memory Subsystem Description Event Data 3 [7] – Domain Instance Type 0b: Local memory sparing domain instance. This SEL pertains to a local memory mirroring domain that is restricted to memory mirroring pairs within a processor socket only. 1b: Global memory sparing domain instance. This SEL pertains to a global memory mirroring domain that pertains to memory mirroring between processor sockets.
Memory Subsystem System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte Field Description 12 Sensor Number 13h 13 Event Direction and Event Type [7] Event direction 0b = Assertion Event 1b = Deassertion Event [6:0] Event Type = 09h (digital Discrete) 14 Event Data 1 [7:6] – 10b = OEM code in Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset as described in Table 57 15 Event Data 2 Not used 16 Event Data 3 Not used Table 57: S
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Memory Subsystem Table 58: Sparing Redundancy State Sensor Typical Characteristics Byte Revision 1.
Memory Subsystem System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte 16 Field Event Data 3 Description [7] – Domain Instance Type 0b: Local memory sparing domain instance. This SEL pertains to a local memory sparing domain that is restricted to memory sparing pairs within a processor socket only. 1b: Global memory sparing domain instance. This SEL pertains to a global memory sparing domain that pertains to memory sparing between processor sockets.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 7.2 Memory Subsystem ECC and Address Parity 1. Memory data errors are logged as correctable or uncorrectable. 2. Uncorrectable errors are fatal. 3. Memory addresses are protected with parity bits and a parity error is logged. This is a fatal error. 7.2.1 Memory Correctable and Uncorrectable ECC Error ECC errors are divided into Uncorrectable ECC Errors and Correctable ECC Errors.
Memory Subsystem Byte 16 System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Field Event Data 3 Description [7:5] – Indicates the Processor Socket to which the DDR3 DIMM having the ECC error is attached: 000b = Processor Socket 1 001b = Processor Socket 2 All other values are reserved.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Event Trigger Offset Hex 00h Description Description Correctable ECC Error threshold reached 7.2.2 Memory Subsystem There have been too many (10 or more) correctable ECC errors for this particular DIMM since last boot. This event in itself does not pose any direct problems as the ECC errors are still being corrected. Depending on the RAS configuration of the memory, the IMC may take the affected DIMM offline.
Memory Subsystem Byte 56 System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Field Description 14 Event Data 1 [7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger Offset = 02h 15 Event Data 2 [7:5] – Reserved. Set to 0.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 7.2.2.1 Memory Subsystem Memory Address Parity Error Sensor Next Steps These are bit errors that are detected in the memory addressing hardware. An Address Parity Error implies that the memory address transmitted to the DIMM addressing circuitry has been compromised, and data read or written is compromised in turn.
PCI Express* and Legacy PCI Subsystem 8. System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards PCI Express* and Legacy PCI Subsystem The PCI Express* (PCIe) Specification defines standard error types under the Advanced Error Reporting (AER) capabilities. The BIOS logs AER events into the SEL. The Legacy PCI Specification error types are PERR and SERR. These errors are supported and logged into the SEL. 8.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte Field PCI Express* and Legacy PCI Subsystem Description 14 Event Data 1 [7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger Offset as described in Table 64 15 Event Data 2 PCI Bus number 16 Event Data 3 [7:3] – PCI Device number [2:0] – PCI Function number Table 64: PCI Express* Correctable Error Sensor Event Trigger Offset – Next Steps Event Trigger Offset Hex
PCI Express* and Legacy PCI Subsystem System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 65: PCI Express* Fatal Error Sensor Typical Characteristics Byte Field Description 8 9 Generator ID 0033h = BIOS SMI Handler 11 Sensor Type 13h = Critical Interrupt 12 Sensor Number 04h 13 Event Direction and Event Type [7] Event direction 0b = Assertion Event 1b = Deassertion Event [6:0] Event Type = 70h (OEM Specific) 14 Event Data 1 [7:6] – 10b = OEM code in Even
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Event Trigger Offset Hex Description Description 04h Poisoned TLP Error Typically indicates a parity error in a TLP transaction. This means the data received is not correct. 05h Flow Control Protocol Error Indicates an error during initialization with the device not providing enough flow control credits. This means the bus configuration is incorrect and it cannot continue.
PCI Express* and Legacy PCI Subsystem System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 67: Legacy PCI Error Sensor Typical Characteristics Byte Field Description 8 9 Generator ID 0033h = BIOS SMI Handler 11 Sensor Type 13h = Critical Interrupt 12 Sensor Number 03h 13 Event Direction and Event Type [7] Event direction 0b = Assertion Event 1b = Deassertion Event [6:0] Event Type = 6Fh (Sensor Specific) 14 Event Data 1 [7:6] – 10b = OEM code in Event Dat
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 9. System BIOS Events System BIOS Events There are a number of events that are owned by the system BIOS. These events can occur during Power On Self Test (POST) or when coming out of a sleep state. Not all of these events signify errors. Some events are described in other chapters in this document (for example, memory events). 9.1 System Events These events can occur during POST or when coming out of a sleep state.
System BIOS Events System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 69: System Event Sensor Typical Characteristics Byte 9.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards System BIOS Events Table 70: POST Error Sensor Typical Characteristics Byte 9.2.
System BIOS Events System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Error Code 66 Error Message Response 0113 Fixed Media The SAS RAID firmware cannot run properly. The user should attempt to reflash the firmware. Major 0140 PCI component encountered a PERR error. Major 0141 PCI resource conflict Major 0146 PCI out of resources error Major 0192 Processor 0x cache size mismatch detected. Fatal 0193 Processor 0x stepping mismatch.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Error Code Revision 1.1 Error Message System BIOS Events Response 8500 Memory component could not be configured in the selected RAS mode. Major 8501 DIMM Population Error. Major 8502 CLTT Configuration Failure Error. Major 8520 DIMM_A1 failed Self-Test (BIST). Major 8521 DIMM_A2 failed Self-Test (BIST). Major 8522 DIMM_B1 failed Self-Test (BIST). Major 8523 DIMM_B2 failed Self-Test (BIST).
System BIOS Events System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Error Code 68 Error Message Response 854A DIMM_F1 Disabled. Major 854B DIMM_F2 Disabled. Major 8560 DIMM_A1 Component encountered a Serial Presence Detection (SPD) fail error. Major 8561 DIMM_A2 Component encountered a Serial Presence Detection (SPD) fail error. Major 8562 DIMM_B1 Component encountered a Serial Presence Detection (SPD) fail error.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Error Code Revision 1.1 Error Message System BIOS Events Response 85AB DIMM_F2 Uncorrectable ECC error encountered. Major 8604 Chipset Reclaim of non-critical variables complete. Minor 9000 Unspecified processor component has encountered a non-specific error. Major 9223 Keyboard component was not detected. Minor 9226 Keyboard component encountered a controller error.
System BIOS Events System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Error Code 70 Error Message Response 9641 PEI Core component encountered a load error. Minor 9667 PEI module component encountered an illegal software state error. Fatal 9687 DXE core component encountered an illegal software state error. Fatal 96A7 DXE boot services driver component encountered an illegal software state error.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Chassis Subsystem 10. Chassis Subsystem The BMC monitors several aspects of the chassis. Next to logging when the power and reset buttons get pressed, the BMC also monitors chassis intrusion if a chassis intrusion switch is included in the chassis; as well as looking at the network connections, and logging an event whenever the physical network link is lost. 10.
Chassis Subsystem System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte Field Description 15 Event Data 2 Not used 16 Event Data 3 Not used Table 73: Physical Security Sensor Event Trigger Offset – Next Steps Event Trigger Offset Hex 00h Description Chassis intrusion Description Next Steps Somebody has opened the chassis (or the chassis intrusion sensor is not connected). 1. 2. 3.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Chassis Subsystem 10.2 FP (NMI) Interrupt The front panel interrupt button (also referred to as NMI button) is a recessed button on the front panel that allows the user to force a critical interrupt which causes a crash error or kernel panic. Table 74: FP (NMI) Interrupt Sensor Typical Characteristics Byte 10.2.
Chassis Subsystem System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 10.3 Button Press Events The BMC logs when the front panel power and reset buttons get pressed. This is purely for informational purposes and these events do not indicate errors.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Miscellaneous Events 11. Miscellaneous Events The miscellaneous events section addresses sensors not easily grouped with other sensor types. 11.1 IPMI Watchdog PCSD server systems support an IPMI watchdog timer, which can check to see whether the OS is still responsive. The timer is disabled by default, and has to be enabled manually.
Miscellaneous Events System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte Field Description 15 Event Data 2 [7:4] – Interrupt type 0h = None 1h = SMI 2h = NMI 3h = Messaging Interrupt Fh = Unspecified All other = Reserved [3:0] – Timer use at expiration 0h = Reserved 1h = BIOS FRB2 2h = BIOS/POST 3h = OS Load 4h = SMS/OS 5h = OEM Fh = Unspecified All other = Reserved 16 Event Data 3 Not used Table 77: IPMI Watchdog Sensor Event Trigger Offset – Next Steps Event T
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Miscellaneous Events 11.2 SMI Timeout SMI stands for system management interrupt and is an interrupt that gets generated so the processor can service server management events (typically memory or PCI errors, or other forms of critical interrupts), in order to log them to the SEL. If this interrupt times out, the system is frozen. Table 78: SMI Timeout Sensor Typical Characteristics Byte 11.2.
Miscellaneous Events System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 11.3 System Event Log Cleared The BMC logs a SEL clear event. This is only ever the first event in the SEL. Cause of this event is either a manual SEL clear using Intel® SEL Viewer or some other IPMI-aware utility, or is done in the factory as one of the last steps in the manufacturing process. This is an informational event only.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Miscellaneous Events This functionality is built into the BMC to allow it to send alerts (SNMP or other) for any event that gets logged to the SEL. PEF filters are turned off by default and have to be enabled manually using Intel® deployment assistant, Intel® syscfg utility, or an IPMI-aware utility. Table 80: System Event – PEF Action Sensor Typical Characteristics Byte 11.4.
Hot Swap Controller Events System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 12. Hot Swap Controller Events The Hot Swap Controller (HSC) implements the same basic sensor model that is utilized by the other management controllers in the system. Sensor model information is contained in the document Intelligent Platform Management Interface Specification. A common set of IPMI commands is used for configuring the sensors and returning threshold status. 12.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Hot Swap Controller Events Table 82: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps Event Trigger Hex Description Assertion Severity Deassert Severity Description Next Steps 00h Lower non-critical going low Degraded OK The temperature has dropped below its lower non-critical threshold. 1.
Hot Swap Controller Events System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte Field Description 02h = Drive Slot 0 Status 03h = Drive Slot 1 Status 04h = Drive Slot 2 Status 05h = Drive Slot 3 Status 06h = Drive Slot 4 Status 07h = Drive Slot 5 Status 12.2.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Hot Swap Controller Events Table 84: HSC Drive Presence Sensor Typical Characteristics Byte 12.3.
Hot Swap Controller Events System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards If during normal operation a drive is removed or installed, it will also log an event. If you get a drive removed or installed without operator intervention, ensure that the drive was seated properly and the drive carrier was properly latched. 84 Intel order number G74211-002 Revision 1.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Manageability Engine (ME) Events 13. Manageability Engine (ME) Events The Manageability Engine controls the PECI interface and also contains the Node Manager functionality. 13.1 Node Manager Exception Event A Node Manager Exception Event will be sent each time maintained policy power limit is exceeded over Correction Time Limit. Table 85: Node Manager Exception Sensor Typical Characteristics Byte Revision 1.
Manageability Engine (ME) Events 13.1.1 System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Node Manager Exception Event – Next Steps This is an informational event. Next steps depend on the policy that was set. See the Node Manager Specification for more details. 13.2 Node Manager Health Event A Node Manager Health Event message provides a runtime error indication about Intel® Intelligent Power Node Manager’s health.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte 13.2.
Manageability Engine (ME) Events System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 13.3 Node Manager Operational Capabilities Change This message provides a runtime error indication about Intel® Intelligent Power Node Manager’s operational capabilities. This applies to all domains. Assertion and deassertion of these events are supported.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Byte 16 13.3.1 Field Event Data 3 Manageability Engine (ME) Events Description Not used Node Manager Operational Capabilities Change – Next Steps Policy Interface available indicates that Intel® Intelligent Power Node Manager is able to respond to the external interface about querying and setting Intel® Intelligent Power Node Manager policies. This is generally available as soon as the microcontroller is initialized.
Manageability Engine (ME) Events System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 13.4 Node Manager Alert Threshold Exceeded Policy Correction Time Exceeded Event will be sent each time maintained policy power limit is exceeded over Correction Time Limit.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards 13.4.1 Manageability Engine (ME) Events Node Manager Alert Threshold Exceeded – Next Steps First occurrence of an unacknowledged event will be retransmitted no faster than every 300 milliseconds. First occurrence of Threshold exceeded event assertion/deassertion will be retransmitted no faster than every 300 milliseconds. Next steps depend on the policy that was set. See the Node Manager Specification for more details.
Manageability Engine (ME) Events 13.5.1 System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards ME Firmware Health Event – Next Steps In the following table Event Data 3 is only noted for specific errors. If the issue continues to be persistent, provide the content of Event Data 3 to Intel support team for interpretation. Event Data 3 codes are in general not documented, because their meaning only provides some clues, varies, and usually needs to be individually interpreted.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Microsoft Windows* Records 14. Microsoft Windows* Records With Microsoft Windows Server 2003* R2 and later versions, an Intelligent Platform Management Interface (IPMI) driver was added. This added the capability of logging some OS events to the SEL. The driver can write multiple records to the SEL for the following events: Boot-up Shutdown Bug Check / Blue Screen 14.
Microsoft Windows* Records System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 92: Boot-up OEM Event Record Typical Characteristics Byte Field Description 1 2 Record ID ID used for SEL Record access. 3 Record Type [7:0] – DCh = OEM timestamped, bytes 8-16 OEM defined 4 5 6 7 Timestamp Time when event was logged. LS byte first.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Microsoft Windows* Records Table 93: Shutdown Reason Code Event Record Typical Characteristics Byte Field Description 8 9 Generator ID 0041h – System Software with an ID = 20h 11 Sensor Type 20h = OS Stop/Shutdown 12 Sensor Number 00h 13 Event Direction and Event Type [7] Event direction 0b = Assertion Event 1b = Deassertion Event [6:0] Event Type = 6Fh (Sensor Specific) 14 Event Data 1 [7:6] – 00b = Unspeci
Microsoft Windows* Records Byte System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Field Description 11 Record ID Sequential number reflecting the order in which the records are read. The numbers start at 1 for the first entry in the SEL and continue sequentially to n, the number of entries in the SEL. 12 13 14 15 Shutdown Reason Shutdown Reason code from the registry (LSB first.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Microsoft Windows* Records 14.3 Bug Check / Blue Screen Event Records When the system experiences a bug check (blue screen), there will be multiple records written to the event log. The first is a Bug Check / Blue Screen OS Stop/Shutdown Event Record; this can be followed by multiple Bug Check / Blue Screen code OEM records that will contain the Bug Check / Blue Screen codes.
Microsoft Windows* Records Byte System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Field Description 4 5 6 7 Timestamp Time when event was logged. LS byte first. 8 9 10 IPMI Manufacturer ID 0137h (311) = IANA enterprise number for Microsoft 0157h (343) = IANA enterprise number for Intel The value logged depends on the Intelligent Management Bus Driver (IMBDRV) that is loaded. 11 Sequence Number Sequential number reflecting the order in which the records are read.
System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Linux* Kernel Panic Records 15. Linux* Kernel Panic Records The OpenIPMI driver supports the ability to put semi-custom and custom events in the system event log if a panic occurs. If you enable the “Generate a panic event to all BMCs on a panic” option, you will get one event on a panic in a standard IPMI event format.
Linux* Kernel Panic Records System Event Log Troubleshooting Guide for Intel®S5500/S3420 Series Server Boards Table 99: Linux* Kernel Panic String Extended Record Characteristics Byte Field Description 1 2 Record ID ID used for SEL Record access. 3 Record Type [7:0] – F0h = OEM non-timestamped, bytes 4-16 OEM defined 4 Slave Address The slave address of the card saving the panic. 5 Sequence Number A sequence number (starting at zero). 6 … 16 Kernel Panic Data These hold the panic sting.