SPARC® Enterprise T1000 Server Service Manual Manual Code : C120-E384-01EN Part No.
Copyright 2007 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, California 95054, U.S.A. All rights reserved. FUJITSU LIMITED provided technical input and review on portions of this material. Sun Microsystems, Inc. and Fujitsu Limited each own or control intellectual property rights relating to products and technology described in this document, and such products, technology and this document are protected by copyright laws, patents and other intellectual property laws and international treaties.
Copyright 2007 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, California 95054, Etats-Unis. Tous droits réservés. Entrée et revue tecnical fournies par FUJITSU LIMITED sur des parties de ce matériel. Sun Microsystems, Inc. et Fujitsu Limited détiennent et contrôlent toutes deux des droits de propriété intellectuelle relatifs aux produits et technologies décrits dans ce document.
Contents Preface 1. 2. 3. xv Safety Information 1–1 1.1 Safety Information 1–1 1.2 Safety Symbols 1.3 Electrostatic Discharge Safety 1–1 1–2 1.3.1 Using an Antistatic Wrist Strap 1.3.2 Using an Antistatic Mat Server Overview 1–2 2–1 2.1 Server Overview 2.2 Obtaining the Chassis Serial Number Server Diagnostics 3.1 3.2 2–1 2–3 3–1 Overview of Server Diagnostics 3.1.1 1–2 3–1 Memory Configuration and Fault Handling 3.1.1.1 Memory Configuration 3.1.1.
3.2.2 3.3 Connecting to ALOM 3.3.1.2 Switching Between the System Console and ALOM 14 3.3.1.3 Service-Related ALOM CMT Commands Running the showenvironment Command 3.3.4 Running the showfru Command Running POST vi 3– 3–14 3–16 3–17 3–19 3–22 3.4.1 Controlling How POST Runs 3.4.2 Changing POST Parameters 3.4.3 Reasons to Run POST 3–22 3–26 3–27 3.4.3.1 Verifying Hardware Functionality 3–27 3.4.3.2 Diagnosing the System Hardware 3–28 3.4.4 Running POST in Maximum Mode 3.4.
3.7 3.8 4. 3.7.1 Displaying System Components 3.7.2 Disabling Components 3.7.3 Enabling Disabled Components 5.2 5.3 5.4 3–46 3–48 3–48 3.8.1 Checking Whether SunVTS Software Is Installed 3.8.2 Exercising the System Using SunVTS Software 3.8.3 Using SunVTS Software 3–48 3–49 3–50 4–1 Common Procedures for Parts Replacement 4–1 4.1.1 Required Tools 4–2 4.1.2 Shutting the System Down 4.1.3 Removing the Server From a Rack 4.1.
5.5 5.4.1 Removing the Single-Drive Assembly 5.4.2 Installing the Dual-Drive Assembly Replacing a Hard Drive 5.5.1 5.5.2 5.6 5.7 5.8 6. 5–12 5–12 5.5.1.1 Removing the Hard Drive in a Single-Drive Assembly 5–12 5.5.1.2 Installing the Hard Drive in a Single-Drive Assembly 5–13 Replacing a Hard Drive in a Dual-Drive Assembly 5–15 5.5.2.1 Removing a Hard Drive in a Dual-Drive Assembly 15 5.5.2.2 Installing the Hard Drive in a Dual-Drive Assembly 17 Replacing DIMMs 5–19 5.6.
Index Index–1 Contents ix
x SPARC Enterprise T1000 Server Service Manual • April 2007
Figures FIGURE 2-1 Server 2–1 FIGURE 2-2 Server Components FIGURE 2-3 Server Front Panel FIGURE 2-4 Server Rear Panel 2–3 FIGURE 3-1 Diagnostic Flow Chart FIGURE 3-2 LEDs on the Server Front Panel 3–8 FIGURE 3-3 LEDs on the Server Rear Panel 3–9 FIGURE 3-4 ALOM CMT Fault Management 3–12 FIGURE 3-5 Flow Chart of ALOM CMT Variables for POST Configuration FIGURE 3-6 SunVTS GUI FIGURE 3-7 SunVTS Test Selection Panel 3–52 FIGURE 4-1 Unlocking a Mounting Bracket 4–4 FIGURE 4-2 Loca
FIGURE 5-7 Location of Drive Power and Data Connectors on the Motherboard FIGURE 5-8 Installing the Drive Assembly FIGURE 5-9 Removing the Single-Drive Assembly FIGURE 5-10 Installing the Single-Drive Assembly FIGURE 5-11 Location of Drive Power and Data Connectors on the Motherboard FIGURE 5-12 Removing the Dual-Drive Assembly FIGURE 5-13 Installing the Dual-Drive Assembly FIGURE 5-14 DIMM Locations FIGURE 5-15 Removing the Clock Battery From the Motherboard FIGURE 5-16 Installing the Cl
Tables TABLE 3-1 Diagnostic Flow Chart Actions 3–4 TABLE 3-2 Front and Rear Panel LEDs TABLE 3-3 Power Supply LEDs TABLE 3-4 Service-Related ALOM CMT Commands TABLE 3-5 ALOM CMT Parameters Used for POST Configuration TABLE 3-6 ALOM CMT Parameters and POST Modes TABLE 3-7 ASR Commands 3–46 TABLE 3-8 Useful SunVTS Tests to Run on This Server TABLE 5-1 DIMM Names and Socket Numbers TABLE A-1 Server FRU List 3–10 3–11 3–14 3–23 3–26 3–52 5–20 A–3 xiii
xiv SPARC Enterprise T1000 Server Service Manual • April 2007
Preface The SPARC Enterprise T1000 Server Service Manual provides information to aid in troubleshooting problems with and replacing components within SPARC Enterprise T1000 servers. This manual is written for technicians, service personnel, and system administrators who service and repair computer systems.
Structure and Contents of This Manual This manual is organized as described below: ■ Chapter 1 Safety Information Provides important safety information for servicing the server. ■ Chapter 2 Server Overview Describes the main features of the server. ■ Chapter 3 Server Diagnostics Describes the diagnostics that are available for monitoring and troubleshooting the server. ■ Chapter 4 Preparing for Servicing Describes how to prepare for servicing the server.
Title Description Manual Code SPARC Enterprise T1000 Server Product Notes Information about the latest product updates and issues C120-E381 SPARC Enterprise T1000 Server Site Planning Guide Server specifications for site planning C120-H018 SPARC Enterprise T1000 Server Getting Started Guide Information about where to find documentation to get your system installed and running quickly C120-E379 SPARC Enterprise T1000 Server Overview Guide Provides an overview of the features of this server C120
Using UNIX Commands This document might not contain information about basic UNIX® commands and procedures such as shutting down the system, booting the system, and configuring devices. Refer to the following for this information: ■ Software documentation that you received with your system ■ Solaris™ Operating System documentation, which is at: http://docs.sun.com Text Conventions This manual uses the following fonts and symbols to express specific types of information.
Prompt Notations The following prompt notations are used in this manual. Shell Prompt Notations C shell machine-name% C shell superuser machine-name# Bourne shell and Korn shell $ Bourne shell and Korn shell and Korn shell superuser # Conventions for Alert Messages This manual uses the following conventions to show alert messages, which are intended to prevent injury to the user or bystanders as well as property damage, and important messages that are useful to the user.
Caution – The following tasks regarding this product and the optional products provided from Fujitsu should only be performed by a certified service engineer. Users must not perform these tasks. Incorrect operation of these tasks may cause malfunction. ■ Unpacking optional adapters and such packages delivered to the users Also, important alert messages are shown in “Important Alert Messages” on page xx.
Product Handling Maintenance Warning – Certain tasks in this manual should only be performed by a certified service engineer. User must not perform these tasks. Incorrect operation of these tasks may cause electric shock, injury, or fire.
Alert Labels The followings are labels attached to this product: ■ Never peel off the labels. ■ The following labels provide information to the users of this product. Sample of SPARC Enterprise T1000 Fujitsu Welcomes Your Comments We would appreciate your comments and suggestions to improve this document.
Reader's Comment Form Preface xxiii
FOLD AND TAPE NO POSTAGE NECESSARY IF MAILED IN THE UNITED STATES BUSINESS REPLY MAIL FIRST-CLASS MAIL PERMIT NO 741 SUNNYVALE CA POSTAGE WILL BE PAID BY ADDRESSEE FUJITSU COMPUTER SYSTEMS AT TENTION ENGINEERING OPS M/S 249 1250 EAST ARQUES AVENUE P O BOX 3470 SUNNYVALE CA 94088-3470 FOLD AND TAPE xxiv SPARC Enterprise T1000 Server Service Manual • April 2007
CHAPTER 1 Safety Information This chapter provides important safety information for servicing the server. The following topics are covered: ■ ■ ■ 1.1 Section 1.1, “Safety Information” on page 1-1 Section 1.2, “Safety Symbols” on page 1-1 Section 1.3, “Electrostatic Discharge Safety” on page 1-2 Safety Information This section describes safety information you need to know prior to removing or installing parts in the server.
Caution – There is a risk of personal injury and equipment damage. To avoid personal injury and equipment damage, follow the instructions. Caution – Hot surface. Avoid contact. Surfaces are hot and might cause personal injury if touched. Caution – Hazardous voltages are present. To reduce the risk of electric shock and danger to personal health, follow the instructions. 1.
CHAPTER 2 Server Overview This chapter provides an overview of the server. Topics include: ■ ■ 2.1 Section 2.1, “Server Overview” on page 2-1 Section 2.2, “Obtaining the Chassis Serial Number” on page 2-3 Server Overview The server is a high-performance, entry-level server that is highly scalable and very reliable (FIGURE 2-1).
FIGURE 2-2 shows the major components in the server, and FIGURE 2-3 and FIGURE 2-4 show the front and rear panels of the server.
Power supply LEDs Ethernet ports Locator LED/button Service Required LED PCI-E slot SC network management port Power OK LED SC serial management port DB9 serial port FIGURE 2-4 2.2 Server Rear Panel Obtaining the Chassis Serial Number To obtain support for your system, you need your chassis serial number. On the server, the chassis serial number is located on a sticker that is on the front of the server and another sticker at the rear of the server, below the AC power connector.
2-4 SPARC Enterprise T1000 Server Service Manual • April 2007
CHAPTER 3 Server Diagnostics This chapter describes the diagnostics that are available for monitoring and troubleshooting the server. This chapter does not provide detailed troubleshooting procedures, but instead describes the server diagnostics facilities and how to use them. This chapter is intended for technicians, service personnel, and system administrators who service and repair computer systems. The following topics are covered: 3.1 ■ Section 3.
■ ALOM CMT firmware – Is the system firmware that runs on the system controller. In addition to providing the interface between the hardware and OS, ALOM CMT also tracks and reports the health of key server components. ALOM CMT works closely with POST and Solaris Predictive Self-Healing technology to keep the system up and running even when there is a faulty component. ■ Power-on self-test (POST) – Performs diagnostics on system components upon system reset to ensure the integrity of those components.
flow chart FIGURE 3-1 Diagnostic Flow Chart Chapter 3 Server Diagnostics 3-3
TABLE 3-1 Action No. Diagnostic Flow Chart Actions For more information, see these sections Diagnostic Action Resulting Action 1. Check Power OK and AC OK LEDs on the server. The Power OK LED is located on the front and rear of the chassis. The AC OK LED is located on the rear of the server on each power supply. If these LEDs are not on, check the power source and power connections to the server. Section 3.2, “Using LEDs to Identify the State of Devices” on page 3-8 2.
TABLE 3-1 Diagnostic Flow Chart Actions (Continued) Action No. Diagnostic Action Resulting Action 5. Run POST. POST performs basic tests of the server components and reports faulty FRUs. Note - diag_level=min is the default ALOM CMT setting, which tests devices required to boot the server. Use diag_level=max for troubleshooting and hardware replacement. • If POST indicates a faulty FRU while diag_level=min, replace the FRU.
TABLE 3-1 Action No. 7. Diagnostic Flow Chart Actions (Continued) Diagnostic Action Resulting Action Determine if the fault was detected by PSH. If the fault message displays the following text, the fault was detected by the Solaris Predictive SelfHealing software: Host detected fault If the fault is a PSH detected fault, identify the faulty FRU from the fault message and replace the faulty FRU. After the FRU is replaced, perform the procedure to clear PSH detected faults.
3.1.1.1 Memory Configuration In the server memory, there are eight slots that hold DDR-2 memory DIMMs in the following DIMM sizes: ■ ■ ■ ■ 512 MB (maximum 1 GB (maximum of 2 GB (maximum of 4 GB (maximum of of 4 GB) 8 GB) 16 GB) 32 GB) All DIMMS installed must be the same size, and DIMMs must be added four at a time. In addition, Rank 0 memory must be fully populated for the server to function. See Section 5.6.2, “Installing DIMMs” on page 5-21, for instructions about adding memory to the server. 3.1.1.
■ 3.1.1.3 Solaris Predictive Self-Healing (PSH) technology – A feature of the Solaris OS, uses the fault manager daemon (fmd) to watch for various kinds of faults. When a fault occurs, the fault is assigned a unique fault ID (UUID), and logged. PSH reports the fault and provides a recommended proactive replacement for the DIMMs associated with the fault. Troubleshooting Memory Faults If you suspect that the server has a memory problem, follow the flow chart (see TABLE 3-1).
Activity LED Fault LED DC OK LED AC OK LED FIGURE 3-3 Activity LED Link LED Link LED Power OK LED Service Required LED Locator LED/button LEDs on the Server Rear Panel Chapter 3 Server Diagnostics 3-9
3.2.1 Front and Rear Panel LEDs Two LEDs and one LED/button are located in the upper left corner of the front panel (TABLE 3-2). The LEDs are also provided on the rear panel. TABLE 3-2 Front and Rear Panel LEDs LED Location Color Description Locator LED/button Front and rear panels White Enables you to identify a particular server. Activate the LED using one of the following methods: • Issuing the setlocator on or off command. • Pressing the button to toggle the indicator on or off.
3.2.2 Power Supply LEDs The power supply LEDs (TABLE 3-3) are located on the back of the power supply. 3.3 TABLE 3-3 Power Supply LEDs Name Color Description Fault Amber • On – Power supply has detected a failure. • Off – Normal operation. DC OK Green • On – Normal operation. DC output voltage is within normal limits. • Off – Power is off. AC OK Green • On – Normal operation. Input power is within normal limits. • Off – No input voltage, or input voltage is below limits.
Faults detected by ALOM CMT, POST, and the Solaris Predictive Self-Healing (PSH) technology are forwarded to ALOM CMT for fault handling (FIGURE 3-4). In the event of a system fault, ALOM CMT ensures that the Service Required LED is lit, FRU ID PROMs are updated, the fault is logged, and alerts are displayed. Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see Appendix A.
Many environmental faults can automatically recover. A temperature that is exceeding a threshold might return to normal limits. An unplugged power supply can be plugged in, and so on. Recovery of environmental faults is automatically detected. Recovery events are reported using one of two forms: ■ ■ fru at location is OK. sensor at location is within normal range. Environmental faults can be repaired through the removal of the faulty FRU.
3.3.1.2 Switching Between the System Console and ALOM 3.3.1.3 ■ To switch from the console output to the ALOM CMT sc> prompt, type #. (Hash-Period). Note that this command is user-configureable. Refer to the Advanced Lights Out Management (ALOM) CMT Guide for more information. ■ To switch from the sc> prompt to the console, type console. Service-Related ALOM CMT Commands TABLE 3-4 describes the typical ALOM CMT commands for servicing the server.
TABLE 3-4 Service-Related ALOM CMT Commands (Continued) ALOM CMT Command Description powercycle [-f] Performs a poweroff followed by poweron. The -f option forces an immediate poweroff, otherwise the command attempts a graceful shutdown. poweroff [-y] [-f] Powers off the host server. The -y option enables you to skip the confirmation question. The -f option forces an immediate shutdown. poweron [-c] Powers on the host server.
Note – See 3.3.2 TABLE 3-7 for the ALOM CMT ASR commands. Running the showfaults Command The ALOM CMT showfaults command displays the following kinds of faults: ■ Environmental faults – temperature or voltage problems that might be caused by faulty FRUs (a power supply or fan tray), or by room temperature or blocked air flow to the server. ■ POST detected faults – faults on devices detected by the power-on self-test diagnostics.
■ Example showing a fault that was detected by POST. These kinds of faults are identified by the message deemed faulty and disabled and by a FRU name. sc> showfaults -v ID Time 1 OCT 13 12:47:27 faulty and disabled ■ FRU Fault MB/CMP0/CH0/R1/D0 MB/CMP0/CH0/R1/D0 deemed Example showing a fault that was detected by the PSH technology. These kinds of faults are identified by the text Host detected fault and by a UUID.
SYS/LOCATE SYS/SERVICE SYS/ACT OFF OFF ON ----------------------------------------------------------------------------------------------------------------Fans (Speeds Revolution Per Minute): ---------------------------------------------------------Sensor Status Speed Warn Low ---------------------------------------------------------FT0/F0 OK 6762 2240 1920 FT0/F1 OK 6762 2240 1920 FT0/F2 OK 6762 2240 1920 FT0/F3 OK 6653 2240 1920 ------------------------------------------------------------------------------
PS0 OK OFF OFF OFF OFF OFF sc> Note – Some environmental information might not be available when the server is in Standby mode. 3.3.4 Running the showfru Command The showfru command displays information about the FRUs in the server. Use this command to see information about an individual FRU, or for all the FRUs. Note – By default, the output of the showfru command for all FRUs is very long. ● At the sc> prompt, enter the showfru command.
/ManR/Shortname: /SpecPartNo: PS 885-0407-02 FRU_PROM at MB/CMP0/CH0/R0/D0/SEEPROM /SPD/Timestamp: MON OCT 03 12:00:00 2005 /SPD/Description: DDR2 SDRAM, 2048 MB /SPD/Manufacture Location: /SPD/Vendor: Infineon (formerly Siemens) /SPD/Vendor Part No: 72T256220HR3.
FRU_PROM at MB/CMP0/CH3/R0/D1/SEEPROM /SPD/Timestamp: MON OCT 03 12:00:00 2005 /SPD/Description: DDR2 SDRAM, 2048 MB /SPD/Manufacture Location: /SPD/Vendor: Infineon (formerly Siemens) /SPD/Vendor Part No: 72T256220HR3.7A /SPD/Vendor Serial No: d040920 FRU_PROM at MB/CMP0/CH3/R1/D0/SEEPROM /SPD/Timestamp: MON OCT 03 12:00:00 2005 /SPD/Description: DDR2 SDRAM, 2048 MB /SPD/Manufacture Location: /SPD/Vendor: Infineon (formerly Siemens) /SPD/Vendor Part No: 72T256220HR3.
3.4 Running POST Power-on self-test (POST) is a group of PROM-based tests that run when the server is powered on or reset. POST checks the basic integrity of the critical hardware components in the server (CPU, memory, and I/O buses). If POST detects a faulty component, the component is disabled automatically, preventing faulty hardware from potentially harming any software. If the system is capable of running without the disabled component, the system will boot when POST is complete.
TABLE 3-5 lists the ALOM CMT variables used to configure POST and FIGURE 3-5 shows how the variables work together. Note – Use the ALOM CMT setsc command to set all the parameters in TABLE 3-5 except setkeyswitch. TABLE 3-5 ALOM CMT Parameters Used for POST Configuration Parameter Values Description setkeyswitch normal The system can power on and run POST (based on the other parameter settings). For details see TABLE 3-6. This parameter overrides all other commands.
TABLE 3-5 Parameter 3-24 ALOM CMT Parameters Used for POST Configuration (Continued) Values Description min POST output displays functional tests with a banner and pinwheel. normal POST output displays all test and informational messages. max POST displays all test, informational, and some debugging messages.
FIGURE 3-5 Flow Chart of ALOM CMT Variables for POST Configuration Chapter 3 Server Diagnostics 3-25
TABLE 3-6 shows combinations of ALOM CMT variables and associated POST modes.
To change the POST parameters using the setsc command, you must first set the setkeyswitch parameter to normal, then you can change the POST parameters using the setsc command: sc> setkeyswitch normal sc> setsc value Example: sc> setkeyswitch normal sc> setsc diag_mode service 3.4.3 Reasons to Run POST You can use POST for basic hardware verification and diagnosis, and for troubleshooting as described in the following sections. 3.4.3.
3.4.3.2 Diagnosing the System Hardware You can use POST as an initial diagnostic tool for the system hardware. In this case, configure POST to run in maximum mode (diag_mode=service, setkeyswitch= diag, diag_level=max) for thorough test coverage and verbose output. 3.4.4 Running POST in Maximum Mode This procedure describes how to run POST when you want maximum testing, as in the case when you are troubleshooting a server or verifying a hardware upgrade or repair. 1.
4. Switch to the system console to view the POST output: sc> console Example of POST output: SC: Alert: Host system has reset1 0:0> Note: Some output omitted. 0:0>@(#) ERIE Integrated POST 4.x.0.build_17 2005/08/30 11:25 /export/common-source/firmware_re/ontariofireball_fio/build_17/post/Niagara/erie/integrated (firmware_re) 0:0>Copyright © 2005 Sun Microsystems, Inc. All rights reserved SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms. 0:0>VBSC selecting POST IO Testing.
0:0> Niagara, Version 2.0 0:0> Serial Number 00000098.00000820 = fffff238.6b4c60e9 0:0>Init JBUS Config Regs 0:0>IO-Bridge unit 1 init test 0:0>sys 200 MHz, CPU 1000 MHz, mem 200 MHz. 0:0>Integrated POST Testing 0:0>L2 Tests..... 0:0>Setup L2 Cache 0:0>L2 Cache Control = 00000000.00300000 0:0>Scrub and Setup L2 Cache 0:0>L2 Directory clear 0:0>L2 Scrub VD & UA 0:0>L2 Scrub Tags 0:0>Test Memory Basic.....
0:0>Address Bitwalk 0:0> Testing Memory Channel 0 Rank 0 Stack 0 0:0> Testing Memory Channel 3 Rank 0 Stack 0 0:0> Testing Memory Channel 0 Rank 0 Stack 1 0:0> Testing Memory Channel 3 Rank 0 Stack 1 0:0>Test Slave Threads Basic..... 0:0>Set Mailbox 0:0>Setup Final DMMU Entries 0:0>Post Image Region Scrub 0:0>Run POST from Memory 0:0>Verifying checksum on copied image. 0:0>The Memory’s CHECKSUM value is cc1e. 0:0>The Memory’s Content Size value is 7b192. 0:0>Success... Checksum on Memory Validated.
0:0>Enable Icache 0:0>Enable Dcache 0:0>Scrub Memory..... 0:0>Scrub Memory 0:0>Scrub 00000000.00600000->00000001.00000000 on Memory Channel [0 3 ] Rank 0 Stack 0 0:0>Scrub 00000001.00000000->00000002.00000000 on Memory Channel [0 3 ] Rank 0 Stack 1 0:0>IMMU Functional 0:0>DMMU Functional 0:0>Extended Memory Tests..... 0:0>Print Mem Config 0:0>Caches : Icache is ON, Dcache is ON. 0:0> Bank 0 4096MB : 00000000.00000000 -> 00000001.00000000. 0:0> Bank 1 4096MB : 00000001.00000000 -> 00000002.00000000.
0:0>IO-Bridge unit 1 jbus perf test 0:0>IO-Bridge unit 1 int init test 0:0>IO-Bridge unit 1 msi init test 0:0>IO-Bridge unit 1 ilu init test 0:0>IO-Bridge unit 1 tlu init test 0:0>IO-Bridge unit 1 lpu init test 0:0>IO-Bridge unit 1 link train port B 0:0>IO-Bridge unit 1 interrupt test 0:0>IO-Bridge unit 1 Config MB bridges 0:0>Config port B, bus 2 dev 0 func 0, tag 5714 BRIDGE 0:0>Config port B, bus 3 dev 8 func 0, tag PCIX BRIDGE 0:0>IO-Bridge unit 1 PCI id test 0:0> INFO:10 count read passed for MB/IOB_PC
a. Interpret the POST messages: POST error messages use the following syntax: c:s > ERROR: TEST = failing-test c:s > H/W under test = FRU c:s > Repair Instructions: Replace items in order listed by H/W under test above c:s > MSG = test-error-message c:s > END_ERROR In this syntax, c = the core number and s = the strand number. Warning and informational messages use the following syntax: INFO or WARNING: message The following example shows a POST error message. . . .
b. Run the showfaults command to obtain additional fault information. The fault is captured by ALOM, where the fault is logged, the Service Required LED is lit, and the faulty component is disabled. Example: ok #. sc> showfaults -v ID Time FRU Fault 1 APR 24 12:47:27 MB/CMP0/CH0/R1/D0 MB/CMP0/CH0/R1/D0 deemed faulty and disabled In this example, MB/CMP0/CH0/R1/D0 is disabled. The system can boot using memory that was not disabled until the faulty component is replaced.
3.4.5.1 Correctable Errors for Single DIMMs If POST faults a single DIMM (CODE EXAMPLE 3-1) that was not part of a hardware upgrade or repair, it is likely that POST encountered a correctable error that can be handled by PSH. CODE EXAMPLE 3-1 POST Fault for a Single DIMM sc> showfaults -v ID Time FRU Fault 1 OCT 13 12:47:27 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled In this case, reenable the DIMM and run POST in minimum mode as follows: 1. Reenable the DIMM.
3.4.5.2 Determining When to Replace Detected Devices Note – This section assumes faults are detected by POST in maximum mode. If a detected device is part of a hardware upgrade or repair, or if POST detects multiple DIMMs (CODE EXAMPLE 3-2), replace the detected devices.
3. If a device detected by POST is a single DIMM and the same DIMM is not detected by PSH, follow the procedure in Section 3.4.5.1, “Correctable Errors for Single DIMMs” on page 3-36. After the detected devices are repaired or replaced, return POST to the default minimum level. sc> setkeyswitch normal sc> setsc diag_mode normal sc> setsc diag_level min 3.4.
2. Use the enablecomponent command to clear the fault and remove the component from the ASR blacklist. Use the FRU name that was reported in the fault in the previous step. Example: sc> enablecomponent MB/CMP0/CH0/R1/D0 The fault is cleared and should not appear when you run the showfaults command. Additionally, if there are no other faults remaining, the Service Required LED should be extinguished. 3. Power cycle the server. You must reboot the server for the enablecomponent command to take effect. 4.
provides a fault notification with a message ID (MSGID). You can use the message ID to get additional information about the problem from Sun’s knowledge article database.
The following is an example of the ALOM CMT alert for the same PSH diagnosed fault: SC Alert: Host detected fault, MSGID: SUN4V-8000-DX Note – The Service Required LED is also turns on for PSH diagnosed faults. 3.5.1.1 Using the fmdump Command to Identify Faults The fmdump command displays the list of faults detected by the Solaris PSH facility and identifies the faulty FRU for a particular EVENT_ID (UUID).
Note – fmdump displays the PSH event log. Entries remain in the log after the fault has been repaired. 2. Use the message ID to obtain more information about this type of fault. a. In a browser, go to the Predictive Self-Healing Knowledge Article web site: http://www.sun.com/msg b. Obtain the message ID from the console output or the ALOM CMT showfaults command. c. Enter the message ID in the SUNW-MSG-ID field, and click Lookup.
rsrc: mem:///component=MB/CMP0/CH0:R0/D0/J0601 In this example, the DIMM location is: MB/CMP0/CH0:R0/D0/J0601 Refer to the Service Manual or the Service Label attached to the server chassis to find the physical location of the DIMM. Once the DIMM has been replaced, use the Service Manual for instructions on clearing the fault condition and validating the repair action. NOTE - The server Product Notes may contain updated service procedures.
3. Run the clearfault command with the UUID provided in the showfaults output: sc> clearfault 7ee0e46b-ea64-6565-e684-e996963f7b86 Clearing fault from all indicted FRUs... Fault cleared. 4. Clear the fault from all persistent fault records. In some cases, even though the fault is cleared, some persistent fault information remains and results in erroneous fault messages at boot time.
2. Issue the dmesg command: # dmesg The dmesg command displays the most recent messages generated by the system. 3.6.2 Viewing System Message Log Files The error logging daemon, syslogd, automatically records various system warnings, errors, and faults in message files. These messages can alert you to system problems such as a device that is about to fail. The /var/adm directory contains several message files. The most recent messages are in the /var/adm/messages file.
The database that contains the list of disabled components is called the ASR blacklist (asr-db). In most cases, POST automatically disables a faulty component. After the cause of the fault is repaired (FRU replacement, loose connector reseated, and so on), you must remove the component from the ASR blacklist. The ASR commands (TABLE 3-7) enable you to view, and manually add or remove components from the ASR blacklist. These commands are run from the ALOM CMT sc> prompt.
Example with no disabled components: sc> showcomponent Keys: . . . ASR state: clean Example showing a disabled component: sc> showcomponent Keys: . . . ASR state: Disabled Devices MB/CMP0/CH3/R1/D1 : dimm8 deemed faulty 3.7.2 Disabling Components The disablecomponent command disables a component by adding it to the ASR blacklist. 1. At the sc> prompt, enter the disablecomponent command. sc> disablecomponent MB/CMP0/CH3/R1/D1 SC Alert:MB/CMP0/CH3/R1/D1 disabled 2.
3.7.3 Enabling Disabled Components The enablecomponent command enables a disabled component by removing it from the ASR blacklist. 1. At the sc> prompt, enter the enablecomponent command. sc> enablecomponent MB/CMP0/CH3/R1/D1 SC Alert:MB/CMP0/CH3/R1/D1 reenabled 2. After receiving confirmation that the enablecomponent command is complete, reset the server so that the ASR command takes effect. sc> reset 3.
■ If SunVTS software is not installed, you see an error message for each missing package. ERROR: information for "SUNWvts" was not found ERROR: information for "SUNWvtsr" was not found ... The following table lists the SunVTS packages: Package Description SUNWvts SunVTS framework SUNWvtsr SunVTS framework (root) SUNWvtsts SunVTS for tests SUNWvtsmn SunVTS man pages If SunVTS is not installed, you can obtain the installation packages from the Solaris Operating System DVDs. The SunVTS 6.
This procedure also assumes that the server is headless, that is, it is not equipped with a monitor capable of displaying bitmap graphics. In this case, you access the SunVTS GUI by logging in remotely from a machine that has a graphics display. Finally, this procedure describes how to run SunVTS tests in general. Individual tests may presume the presence of specific hardware, or might require specific drivers, cables, or loopback connectors.
FIGURE 3-6 SunVTS GUI 5. Expand the test lists to see the individual tests. The test selection area lists tests in categories, such as Network, as shown in FIGURE 3-7. To expand a category, left-click the + icon (expand category icon) to the left of the category name.
Processor(s) Memory Cryptography SCSI - Devices(mpt0) Network e1000g3(netlbtest) e1000g1(netlbtest) e1000g2(netlbtest) e1000g0(nettest) FIGURE 3-7 SunVTS Test Selection Panel 6. (Optional) Select the tests you want to run. Certain tests are enabled by default, and you can choose to accept these. Alternatively, you can enable and disable individual tests or blocks of tests by clicking the checkbox next to the test name or test category name. Tests are enabled when checked, and disabled when not checked.
8. Start testing. Click the Start button that is located at the top left of the SunVTS window. Status and error messages appear in the test messages area located across the bottom of the window. You can stop testing at any time by clicking the Stop button. During testing, SunVTS software logs all status and error messages. To view these messages, click the Log button or select Log Files from the Reports menu.
3-54 SPARC Enterprise T1000 Server Service Manual • April 2007
CHAPTER 4 Preparing for Servicing This chapter describes how to prepare the server for servicing. The following topics are covered: ■ Section 4.1, “Common Procedures for Parts Replacement” on page 4-1 For a list of FRUs, see Appendix A. Note – Never attempt to run the system with the cover removed. The cover must be in place for proper air flow. The cover interlock switch immediately shuts the system down when the cover is removed. 4.
4.1.1 Required Tools The server can be serviced with the following tools: ■ ■ ■ 4.1.2 Antistatic wrist strap Antistatic mat No. 2 Phillips screwdriver Shutting the System Down Performing a graceful shutdown ensures that all of your data is saved and the system is ready for restart. 1. Log in as superuser or equivalent. Depending on the nature of the problem, you might want to view the system status or the log files, or run diagnostics before you shut down the system.
6. Using the SC console, issue the poweroff command. sc> poweroff -fy SC Alert: SC Request to Power Off Host Immediately. Note – You can also use the Power On/Off button on the front of the server to initiate a graceful system shutdown. Refer to the SPARC Enterprise T1000 Server Administration Guide for more information about the ALOM poweroff command. 4.1.
FIGURE 4-1 Unlocking a Mounting Bracket 6. Press the gray release tab on both mounting brackets to release the right and left mounting brackets, then pull the server chassis out of the rails (FIGURE 4-2). The mounting brackets slide approximately 4 in. (10 cm) farther before disengaging. FIGURE 4-2 Location of the Mounting Bracket Release Buttons 7. Set the chassis on a sturdy work surface.
4.1.4 Performing Electrostatic Discharge (ESD) Prevention Measures 1. Prepare an antistatic surface to set parts on during removal and installation. Place ESD-sensitive components, such as the printed circuit boards, on an antistatic mat. The following items can be used as an antistatic mat: ■ Antistatic bag used to wrap a replacement part ■ ESD mat, part number 250-1088 ■ Disposable ESD mat (shipped with some replacement parts or optional system components) 2. Use an antistatic wrist strap. 4.1.
Cover release button FIGURE 4-3 4-6 Location of Top Cover Release Button SPARC Enterprise T1000 Server Service Manual • April 2007 Top cover
CHAPTER 5 Replacing Field-Replaceable Units This chapter describes how to remove and replace customer-replaceable fieldreplaceable units (FRUs) in the server. The following topics are covered: ■ ■ ■ ■ ■ ■ ■ ■ Section 5.1, Section 5.2, Section 5.3, Section 5.4, Section 5.5, Section 5.6, Section 5.7, Section 5.
5.1 Replacing the Optional PCI-Express Card 5.1.1 Removing the Optional PCI-Express Card Use this procedure to remove the optional low-profile PCI-Express (PCI-E) card from the server. 1. Perform the procedures described in Chapter 4. 2. Remove any cables that are attached to the card. 3. On the rear of the chassis, pull the release lever that secures the PCI-Express card to the chassis (FIGURE 5-1).
4. Carefully pull the PCI-Express card out of the connector on the PCI-Express card riser board and the note slot (FIGURE 5-2). Note slot Connector FIGURE 5-2 PCI-E riser board Removing and Installing the PCI-Express Card 5. Place the PCI-Express card on an antistatic mat. 5.1.2 Installing the Optional PCI-Express Card Use this procedure to replace the PCI-Express cards. 1. Unpack the replacement PCI-Express card and place it on an antistatic mat.
3. On the rear of the chassis, engage the release lever to secure the card to the chassis (FIGURE 5-1). 4. Perform the procedures described in Chapter 6. 5.2 Replacing the Fan Tray Assembly 5.2.1 Removing the Fan Tray Assembly 1. Perform the procedures described in Chapter 4. 2. Disconnect the fan power cable from the motherboard. 3. Push in on the clasps on both sides of the fan assembly (FIGURE 5-3). Fan tray assembly FIGURE 5-3 Removing the Fan Tray Assembly 4.
5.2.2 Installing the Fan Tray Assembly 1. Unpack the replacement fan tray assembly and place it on an antistatic mat. 2. Align the fan tray assembly with the sheet metal mounting brackets and slide it into place until the clasps on each side lock it into place. 3. Reconnect the fan power cable to the motherboard. 4. Perform the procedures described in Chapter 6. 5.3 Replacing the Power Supply 5.3.1 Removing the Power Supply 1. Perform the procedures described in Chapter 4. 2.
Fastener Power supply FIGURE 5-4 5.3.2 Removing the Power Supply Installing the Power Supply 1. Unpack the replacement power supply. 2. Slide the power supply into the chassis and engage the two alignment pins in the rear of the chassis that mate with the power supply. 3. Push the fastener down on the front of the power supply to lock it into place in the chassis (FIGURE 5-5).
Power supply Fastener FIGURE 5-5 Installing the Power Supply 4. Redress the power cable through the midwall in the chassis and connect the cable to the motherboard. 5. Perform the procedures described in Chapter 6. 6. At the sc> prompt, issue the showenvironment command to verify the status of the power supply. 5.4 Replacing the Hard Drive Assembly 5.4.1 Removing the Single-Drive Assembly 1. Disconnect the drive cable from the data/power connector at the rear of the hard drive (FIGURE 5-6). 2.
FIGURE 5-6 5.4.2 Removing the Single-Drive Assembly Installing the Dual-Drive Assembly 1. Unpack the drive assembly and the dual-drive cable. The drive assembly should be shipped to you with one or two drives already installed in the assembly, depending on the type of drive assembly that you ordered. 2. Disconnect the drive cable from the data and power connectors on the motherboard and remove the drive cable from your server (FIGURE 5-7).
Data connector (J5002) Data connector (J5003) Power connector FIGURE 5-7 Location of Drive Power and Data Connectors on the Motherboard Chapter 5 Replacing Field-Replaceable Units 5-9
3. Get the dual-drive cable that was shipped with the new drive assembly. 4. Plug the drive connectors into the data/power connectors at the rear of the hard drives. Note – Make sure the connector is correctly oriented before plugging it into the data/power connector on the drives. When connecting the cable to the data/power connector on the lower drive in a dual-drive configuration, it may be easier to first remove the upper drive to get a clear view of the data/power connector on the lower drive.
6. Push the fasteners down to lock the drive assembly into place in the chassis (FIGURE 5-8). 7. Redress the cable through the midwall in the chassis. 8. Route the drive data cables underneath the power supply cable. 9. Plug the power connector on the dual-drive cable to the power connector on the motherboard (FIGURE 5-7). 10. Plug the data connector marked J5003 on the cable to the J5003 data connector on the motherboard (the connector furthest from the power supply).
You should see output similar to the following: Jun 7 13:23:16 wgs57-57 genunix: [ID 540533 kern.notice] SunOS Release 5.10 Version Generic_118833-08 64-bit Jun 7 13:23:16 wgs57-57 mpt0 Firmware version v1.a.0.0 (IR) ■ If you see the following output: ■ Firmware version v1.a.0.0 or higher (for example, v1.b.0.0, v1.c.0.0, and so on), or ■ Firmware version v1.10.0.0 or higher (for example, v1.11.0.0, v1.12.0.0, and so on) then you have the latest drive controller firmware. Go to Step 17.
2. Disconnect the drive cable from the data/power connector at the rear of the hard drive (FIGURE 5-9). 3. Pull the fasteners up on the rear of the single-drive assembly and remove the assembly from the chassis (FIGURE 5-9). FIGURE 5-9 5.5.1.2 Removing the Single-Drive Assembly Installing the Hard Drive in a Single-Drive Assembly 1. Unpack the replacement single-drive assembly. 2. Slide the single-drive assembly into the chassis until it mates with the front of the chassis (FIGURE 5-10).
FIGURE 5-10 Installing the Single-Drive Assembly 3. Push the fasteners down to lock the drive assembly into place in the chassis. 4. Redress the cable through the midwall in the chassis. 5. Reconnect the data cable to the data/power connector on the drive (FIGURE 5-10). If you have a dual-drive cable installed in your system, connect the DRIVE 0 connector on the cable to the data/power connector at the rear of the drive.
5.5.2 Replacing a Hard Drive in a Dual-Drive Assembly 5.5.2.1 Removing a Hard Drive in a Dual-Drive Assembly 1. Perform the procedures described in Chapter 4. 2. Disconnect the drive cable from the data and power connectors on the motherboard (FIGURE 5-11).
3. Pull the fasteners up on the rear of the dual-drive assembly and remove the dualdrive assembly from the chassis (FIGURE 5-12). Fasteners FIGURE 5-12 Removing the Dual-Drive Assembly 4. Determine which of the two hard drives you want to remove. ■ ■ The upper drive (drive 1) is typically the data drive or mirror drive. The lower drive (drive 0) is typically the boot drive. 5. Remove the drive from the drive bracket.
5.5.2.2 Installing the Hard Drive in a Dual-Drive Assembly 1. Unpack the replacement hard drive. 2. Install the replacement drive in the drive bracket. ■ To replace the lower drive (drive 0): a. Install the replacement drive in the lower drive slot in the drive bracket. b. Push the drive firmly toward the front of the drive bracket until the hard drive is completely seated. c. Plug the DRIVE 0 connector on the drive cable into the data/power connector on the lower drive.
Fasteners FIGURE 5-13 Installing the Dual-Drive Assembly 4. Push the fasteners down to lock the drive assembly into place in the chassis (FIGURE 5-13). 5. Redress the cable through the midwall in the chassis. 6. Route the drive data cables underneath the power supply cable. 7. Plug the power connector on the dual-drive cable to the power connector on the motherboard (FIGURE 5-11). 8.
5.6 Replacing DIMMs 5.6.1 Removing DIMMs Note – Not all DIMMs detected as faulty and offlined by POST must be replaced. In service (maximum) mode, POST detects memory devices with errors that might be corrected with Solaris PSH. See Section 3.4.5, “Correctable Errors Detected by POST” on page 3-35. Caution – This procedure requires that you handle components that are sensitive to static discharges that can cause the component to fail.
FIGURE 5-14 DIMM Locations TABLE 5-1 maps the DIMM names that are displayed in faults to the socket numbers that identify the location of the DIMM on the motherboard. The Channel/Rank/DIMM locations (for example, CH0/R0/D0) are silkscreened on the board and on a label near the board.
5. Grasp the top corners of the DIMM and remove it from the motherboard. 6. Place the DIMM on an antistatic mat. 5.6.2 Installing DIMMs Use the following guidelines and FIGURE 5-14 and TABLE 5-1 to plan the memory configuration of your server. ■ ■ Eight slots hold industry-standard DDR-2 memory DIMMs. The server accepts the following DIMM sizes: ■ ■ ■ ■ 512 MB 1 GB 2 GB 4 GB ■ All DIMMs installed must be the same size. ■ DIMMs must be added four at a time.
■ If the fault is a host-detected fault (displays a UUID), continue to Step 8.
b. Issue the poweron command. sc> poweron c. Switch to the system console to view the POST output. sc> console Watch the POST output for possible fault messages. The following output is a sign that POST did not detect any faults: . . . 0:0>POST Passed all devices. 0:0> 0:0>DEMON: (Diagnostics Engineering MONitor) 0:0>Select one of the following functions 0:0>POST:Return to OBP. 0:0>INFO: 0:0>POST Passed all devices. 0:0>Master set ACK for vbsc runpost command and spin...
9. Obtain the ALOM CMT sc> prompt. 10. Run the showfaults command. If the fault was detected by the host and the fault information persists, the output will be similar to the following example: sc> showfaults -v ID Time FRU Fault 0 SEP 09 11:09:26 MB/CMP0/CH0/R0/D0 Host detected fault MSGID: SUN4V-8000-DX UUID: f92e9fbe-735e-c218-cf87-9e1720a28004 If the showfaults command does not report a fault with a UUID, then you do not need to proceed with the following steps because the fault is cleared. 11.
5.7 Replacing the Motherboard and Chassis 5.7.1 Removing the Motherboard and Chassis The motherboard and chassis are replaced as a unit. Therefore, you must remove all FRUs and associated cables from your chassis, and install them in the new chassis. 1. Perform the procedures described in Chapter 4. 2. Remove the PCI-Express card. See Section 5.1, “Replacing the Optional PCI-Express Card” on page 5-2. 3. Remove the fan tray assembly and cable. See Section 5.
2. Replace the fan tray assembly and cable. See Section 5.2, “Replacing the Fan Tray Assembly” on page 5-4. 3. Replace the power supply and cable. See Section 5.3, “Replacing the Power Supply” on page 5-5. 4. Replace the hard drive and cable. See Section 5.5, “Replacing a Hard Drive” on page 5-12. 5. Replace the memory DIMMs. See Section 5.6, “Replacing DIMMs” on page 5-19. 6. Replace the socketed system configuration SEEPROM. The location of this SEEPROM is shown in Appendix A. 7.
5.8 Replacing the Clock Battery 5.8.1 Removing the Clock Battery on the Motherboard 1. Perform the procedures described in Chapter 4. 2. Using a small flathead screwdriver, carefully pry the battery from the motherboard (FIGURE 5-15). FIGURE 5-15 5.8.2 Removing the Clock Battery From the Motherboard Installing the Clock Battery on the Motherboard 1. Unpack the replacement battery. 2. Press the new battery into the motherboard with the + facing upward (FIGURE 5-16).
FIGURE 5-16 Installing the Clock Battery on the Motherboard 3. Perform the procedures described in Chapter 6. 4. Use the ALOM setdate command to set the day and time. Use the setdate command before you power on the host system. For details about this command, refer to the Advanced Lights Out Management (ALOM) CMT Guide.
CHAPTER 6 Finishing Up Servicing This chapter describes how to finish up servicing the server. The following topics are covered: ■ ■ ■ 6.1 Section 6.1.1, “Replacing the Top Cover” on page 6-1 Section 6.1.2, “Reinstalling the Server Chassis in the Rack” on page 6-1 Section 6.1.3, “Applying Power to the Server” on page 6-2 Final Service Procedures This section provides the finishing tasks in servicing your server. 6.1.1 Replacing the Top Cover 1. Place the top cover on the chassis.
6.1.3 Applying Power to the Server Note – If you have just disconnected the power cord from the power supply, you must wait about five seconds before reconnecting the power cord to the power supply. ● Reconnect the power cord to the power supply. Note – As soon as the power cord is connected, standby power is applied. Depending on the configuration of the firmware, the system might boot.
APPENDIX A Field-Replaceable Units FIGURE A-1 shows the locations of the field-replaceable units (FRUs) in the server. TABLE A-1 lists the FRUs. Note that item number 4 in FIGURE A-1 is a 3.5-inch SATA drive used in the single-drive configuration. The 2.5-inch SAS drives used in the dual-drive configuration look different, but would be installed in the same location in the server.
5 2 4 6 7 8 1 .
1 TABLE A-1 Server FRU List Item No. FRU 1 Motherboard and chassis assembly 2 Replacement Instructions Description Location Section 5.7, “Replacing the Motherboard and Chassis” on page 5-25 The motherboard and chassis are replaced as a single assembly. The motherboard is provided in different configurations to accommodate the different processor models (6 core and 8 core). MB DIMMs Section 5.
A-4 SPARC Enterprise T1000 Server Service Manual • April 2007
Index A AC OK LED, 3-4 Advanced ECC technology, 3-7 Advanced Lights Out Management (ALOM) CMT connecting to, 3-13 diagnosis and repair of server, 3-11 POST, and, 3-23 prompt, 3-13 service related commands, 3-13 airflow, blocked, 3-5 ALOM CMT see Advanced Lights Out Management (ALOM) CMT antistatic mat, 1-2 antistatic wrist strap, 1-2 ASR blacklist, 3-46, 3-47 asrkeys, 3-46 Automatic System Recovery (ASR), 3-45 B blacklist, ASR, 3-46 bootmode command, 3-14 break command, 3-14 C chipkill, 3-7 clearasrdb co
enablecomponent command, 3-39, 3-46, 3-48 environmental faults, 3-4, 3-5, 3-13, 3-16 event log, checking the PSH, 3-41 exercising the system with SunVTS, 3-49 L LEDs AC OK, 3-4 Power OK, 3-4 log files, viewing, 3-45 F fan status, displaying, 3-17 fan tray assembly installing, 5-5 removing, 5-4 fault manager daemon, fmd(1M), 3-39 fault message ID, 3-16 fault records, 3-44 faults, 3-12, 3-16 environmental, 3-4, 3-5 recovery, 3-12 repair, 3-12 types of, 3-16 fmadm command, 3-44 fmdump command, 3-41 front pan
about, 3-39 clearing faults, 3-43 memory faults, and, 3-8 PSH detected faults, 3-16 PSH see also Predictive Self-Healing (PSH), 3-39 R removing clock battery, 5-27 DIMMs, 5-19, 5-25 fan tray assembly, 5-4 hard drive, 5-12, 5-15 motherboard and chassis, 5-25 PCI-Express card, 5-2 power supply, 5-5 top cover, 4-5 removing the server from the rack, 4-3 required tools, 4-2 reset command, 3-15 resetsc command, 3-15 S safety information, 1-1 safety symbols, 1-1 Service Required LED, 3-12, 3-39 setkeyswitch para
Index-4 SPARC Enterprise T1000 Server Service Manual • April 2007