Storage Subsystems Troubleshooting Guide

ManualsBrandsSun Microsystems ManualsComputer HardwareStorEdge 3900 Series

Sun Microsystems, Inc.

4150 Network Circle

Santa Clara, CA 95054 U.S.A.

650-960-1300

Send comments about this document to: docfeedback@sun.com

Sun StorEdge

™

3900 and 6900

Series Troubleshooting Guide

Part No. 816-4290-11

March 2002, Revision A

Summary of content (162 pages)

PAGE 1
Sun StorEdge™ 3900 and 6900 Series Troubleshooting Guide Sun Microsystems, Inc. 4150 Network Circle Santa Clara, CA 95054 U.S.A. 650-960-1300 Part No. 816-4290-11 March 2002, Revision A Send comments about this document to: docfeedback@sun.
PAGE 2
Copyright 2002 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, CA 95054 U.S.A. All rights reserved. This product or document is distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers.
PAGE 3
Contents 1. Introduction 1 Predictive Failure Analysis Capabilities 2.
PAGE 4
Command Line Test Examples qlctest(1M) 19 19 switchtest(1M) 20 Storage Automated Diagnostic Environment Event Grid ▼ 3.
PAGE 5
. ▼ To Verify Configuration Settings ▼ To Clear the Lock File 50 Troubleshooting Host Devices Host Event Grid ▼ 47 53 53 Using the Host Event Grid 53 Replacing the Master, Alternate Master, and Slave Monitoring Host ▼ To Replace the Master Host ▼ To Replace the Alternate Master or Slave Monitoring Host Conclusion 6.
PAGE 6
Virtualization Engine LEDs 72 Power LED Codes 73 Interpreting LED Service and Diagnostic Codes Back Panel Features 74 Ethernet Port LEDs 74 Fibre Channel Link Error Status Report ▼ ▼ 75 To Check Fibre Channel Link Error Status Manually Translating Host Device Names 73 76 78 To Display the VLUN Serial Number 79 Devices That Are Not Sun StorEdge Traffic Manager-Enabled Sun StorEdge Traffic Manager-Enabled Devices ▼ To View the Virtualization Engine Map ▼ To Failback the Virtualization Eng
PAGE 7
Troubleshooting the T1/T2 Data Path Notes 102 102 T1/T2 Notification Events 103 Sun StorEdge T3+ Array Storage Service Processor Verification T1/T2 FRU Tests Available Notes Sun StorEdge T3+ Array Event Grid 109 122 122 Troubleshooting Ethernet Hubs setupswitch Exit Values 109 122 To Replace the Master Midplane Conclusion 9.
PAGE 8
viii Sun StorEdge 3900 and 6900 Series Troubleshooting Guide • March 2002
PAGE 9
List of Figures FIGURE 2-1 Sun StorEdge 3900 Series Fibre Channel Link Diagram 16 FIGURE 2-2 Sun StorEdge 6900 Series Fibre Channel Link Diagram 17 FIGURE 3-1 Data Host Notification of Intermittent Problems 23 FIGURE 3-2 Data Host Notification of Severe Link Error 24 FIGURE 3-3 Storage Service Processor Notification FIGURE 3-4 A2/B2 FC Link Host Side Event 29 FIGURE 3-5 A2/B2 FC Link Storage Service Processor Side Event 30 FIGURE 3-6 A3/B3 FC Link Host-Side Event FIGURE 3-7 A3/B3 FC Link S
PAGE 10
FIGURE 7-6 Path Failure —I/O Routed through Both HBAs 94 FIGURE 7-7 Virtualization Engine Event Grid FIGURE 8-1 Storage Service Processor Event FIGURE 8-2 Virtualization Engine Alert FIGURE 8-3 Manage Configuration Files Menu 106 FIGURE 8-4 Example Link Test Text Output from the Storage Automated Diagnostic Environment FIGURE 8-5 Sun StorEdge T3+ array Event Grid 109 95 103 105 107 List of Figures x
PAGE 11
Preface The Sun StorEdge 3900 and 6900 Series Troubleshooting Guide provides guidelines for isolating problems in supported configurations of the Sun StorEdge TM 3900 and 6900 series. For detailed configuration information, refer to the Sun StorEdge 3900 and 6900 Series Reference Manual.
PAGE 12
Chapter 7 provides detailed information for troubleshooting the virtualization engines. Chapter 8 describes how to troubleshoot the Sun StorEdge T3+ array devices. Also included in this chapter is information about the Explorer Data Collection Utility. Chapter 9 discusses ethernet hub troubleshooting. Information associated with the 3COM Ethernet hubs is limited in this guide, however, as this is third-party information.
PAGE 13
Typographic Conventions Typeface Meaning Examples AaBbCc123 The names of commands, files, and directories; on-screen computer output Edit your.login file. Use ls -a to list all files. % You have mail. AaBbCc123 What you type, when contrasted with on-screen computer output % su Password: AaBbCc123 Book titles, new words or terms, words to be emphasized Read Chapter 6 in the User’s Guide. These are called class options. You must be superuser to do this.
PAGE 14
Related Documentation Product Title Part Number Late-breaking News • Sun StorEdge 3900 and 6900 Series Release Notes 816-3247 Sun StorEdge 3900 and 6900 series hardware information • Sun StorEdge 3900 and 6900 Series Site Preparation Guide • Sun StorEdge 3900 and 6900 Series Regulatory and Safety Compliance Manual • Sun StorEdge 3900 and 6900 Series Hardware Installation and Service Manual 816-3242 816-3243 • Sun StorEdge • Sun StorEdge Service Manual • Sun StorEdge • Sun StorEdge • Sun StorEdge •
PAGE 15
Accessing Sun Documentation Online A broad selection of Sun system documentation is located at: http://www.sun.com/products-n-solutions/hardware/docs A complete set of Solaris documentation and many other titles are located at: http://docs.sun.com Sun Welcomes Your Comments Sun is interested in improving its documentation and welcomes your comments and suggestions. You can email your comments to Sun at: docfeedback@sun.
PAGE 16
xvi Sun StorEdge 3900 and 6900 Series Troubleshooting Guide • March 2002
PAGE 17
CHAPTER 1 Introduction The Sun StorEdge 3900 and 6900 series storage subsystems are complete preconfigured storage solutions. The configurations for each of the storage subsystems are shown in TABLE 1-1.
PAGE 18
Predictive Failure Analysis Capabilities The Storage Automated Diagnostic Environment software provides the health and monitoring functions for the Sun StorEdge 3900 and 6900 series systems. This software provides the following predictive failure analysis (PFA) capabilities. ■ FC links—Fibre Channel links are monitored at all end points using the link FCELS link counters. When link errors surpass the threshold values, an alert is sent.
PAGE 19
CHAPTER 2 General Troubleshooting Procedures This chapter contains the following sections: ■ “Troubleshooting Overview Tasks” on page 3 ■ “Multipathing Options in the Sun StorEdge 6900 Series” on page 7 ■ “Fibre Channel Links” on page 15 ■ “Storage Automated Diagnostic Environment Event Grid” on page 21 Troubleshooting Overview Tasks This section lists the high-level steps to isolate and troubleshoot problems in the Sun StorEdge 3900 and 6900 series.
PAGE 20
1. Discover the error by checking one or more of the following messages or files: ■ ■ Storage Automated Diagnostic Environment alerts or email messages ■ /var/adm/messages ■ Sun StorEdge T3+ array syslog file Storage Service Processor messages ■ /var/adm/messages.t3 messages ■ /var/adm/log/SEcfglog file 2.
PAGE 21
4.
PAGE 22
8. Verify the fix using the following tools: ■ Storage Automated Diagnostic Environment GUI Topology View and Diagnostic Tests ■ /var/adm/messages on the data host 9.
PAGE 23
Multipathing Options in the Sun StorEdge 6900 Series Using the virtualization engines presents several challenges in how multipathing is handled in the Sun StorEdge 6900 series. Unlike Sun StorEdge T3+ array and Sun StorEdge network FC switch-8 and switch16 switch installations, which present primary and secondary pathing options, the virtualization engines present only primary pathing options to the data host.
PAGE 24
Note that in the Class and State fields, the virtualization engines are presented as two primary/ONLINE devices. The current Sun StorEdge Traffic Manager design does not enable you to manually halt the I/O (that is, you cannot perform a failover to the secondary path) when only primary devices are present. Alternatives to Sun StorEdge Traffic Manager As an alternative to using Sun StorEdge Traffic Manager, you can manually halt the I/O using one of two methods: quiesce I/O and unconfigure the c2 path.
PAGE 25
2. Using Storage Automated Diagnostic Environment Topology GUI, determine which virtualization engine is in the path you need to disable. 3.
PAGE 26
▼ To Suspend the I/O Use one of the following methods to suspend the I/O while the failover occurs: 1. Stop all customer applications that are accessing the Sun StorEdge T3+ array. 2. Manually pull the link from the Sun StorEdge T3+ array to the switch and wait for a Sun StorEdge T3+ array LUN failover. ■ After the failover occurs, replace the cable and proceed with testing and FRU isolation.
PAGE 27
▼ To View the VxDisk Properties 1. Type the following: # vxdisk list Disk_1 Device: Disk_1 devicetag: Disk_1 type: sliced hostid: diag.xxxxx.xxx.COM disk: name=t3dg02 id=1010283311.1163.diag.xxxxx.xxx.com group: name=t3dg id=1010283312.1166.diag.xxxxx.xxx.com flags: online ready private autoconfig nohotuse autoimport imported pubpaths: block=/dev/vx/dmp/Disk_1s4 char=/dev/vx/rdmp/Disk_1s4 privpaths: block=/dev/vx/dmp/Disk_1s3 char=/dev/vx/rdmp/Disk_1s3 version: 2.
PAGE 28
2. Use the luxadm(1M) command to display further information about the underlying LUN. # luxadm display /dev/rdsk/c20t2B000060220041F4d0s2 DEVICE PROPERTIES for disk: /dev/rdsk/c20t2B000060220041F4d0s2 Status(Port A): O.K. Vendor: SUN Product ID: SESS01 WWN(Node): 2a000060220041f4 WWN(Port A): 2b000060220041f4 Revision: 080C Serial Num: Unsupported Unformatted capacity: 102400.
PAGE 29
▼ To Quiesce the I/O on the A3/B3 Link 1. Determine the path you want to disable. 2. Disable the path by typing the following: # vxdmpadm disable ctlr= 3. Verify that the path is disabled: # vxdmpadm listctlr all Steps 1 and 2 halt I/O only up to the A3/B3 link. I/O will continue to move over the T1 & T2 paths, as well as the A4/B4 links to the Sun StorEdge T3+ array. ▼ To Suspend the I/O on the A3/B3 Link Use one of the following methods to suspend I/O while the failover occurs: 1.
PAGE 30
▼ To Return the Path to Production 1. Type: # vxdmpadm enable ctlr= 2.
PAGE 31
Fibre Channel Links The following sections provide troubleshooting information for the basic components and Fibre Channel links, listed in TABLE 2-1.
PAGE 32
Fibre Channel Link Diagrams FIGURE 2-1 shows the basic components and the Fibre Channel links for a Sun StorEdge 3900 series system: ■ A1 to B1—HBA to Sun StorEdge FC network switch-8 and switch-16 switch link ■ A4 to B4—Sun StorEdge FC network switch-8 and switch-16 switch to Sun StorEdge T3+ array link HOST HBA-B HBA-A B1 A1 Sw1a Sw1b B4 T3 Alt-Master A4 T3 Master FIGURE 2-1 16 Sun StorEdge 3900 Series Fibre Channel Link Diagram Sun StorEdge 3900 and 6900 Series Troubleshooting Guide • March
PAGE 33
FIGURE 2-2 shows the basic components and the Fibre Channel links for a Sun StorEdge 6900 series system: ■ A1 to B1—HBA to Sun StorEdge network FC switch-8 and switch-16 switch link ■ A2 to B2—Sun StorEdge network FC switch-8 and switch-16 switch to virtualization engine link on the host side ■ A3 to B3—Sun StorEdge network FC switch-8 and switch-16 switch to the virtualization engine link on the device side ■ A4 to B4—Sun StorEdge network FC switch-8 and switch-16 switch to Sun StorEdge T3+ array sw
PAGE 34
Host Side Troubleshooting Host-side troubleshooting refers to the messages and errors the data host detects. Usually, these messages appear in the /var/adm/messages file. Storage Service Processor Side Troubleshooting Storage Service Processor-side Troubleshooting refers to messages, alerts, and errors that the Storage Automated Diagnostic Environment, running on the Storage Service Processor, detects.
PAGE 35
Command Line Test Examples To run a single Sun StorEdge diagnostic test from the command line rather than through the Storage Automated Diagnostic Environment interface, you must log into the appropriate Host or Slave for testing the components. The following two tests, the qlctest(1M) and the switchtest(1M) are provided as examples. qlctest(1M) The qlctest(1M) comprises several subtests that test the functions of the Sun StorEdge PCI dual Fibre Channel (FC) host adapter board.
PAGE 36
switchtest(1M) switchtest(1M) is used to diagnose the Sun StorEdge network FC switch-8 and switch-16 switch devices. The switchtest process also provides command line access to switch diagnostics. switchtest supports testing on local and remote switches. switchtest runs the port diagnostic on connected switch ports. While switchtest is running, the port statistics are monitored for errors, and the chassis status is checked.
PAGE 37
Storage Automated Diagnostic Environment Event Grid The Storage Automated Diagnostic Environment generates component-specific event grids that describe the severity of an Event, whether action is required, a description of the event, and recommended action. Refer to Chapters 5 through 9 of this troubleshooting guide for component-specific event grids. ▼ To Customize an Event Report 1. Click the Event Grid link on the the Storage Automated Diagnostic Environment Help menu. 2.
PAGE 38
22 Sun StorEdge 3900 and 6900 Series Troubleshooting Guide • March 2002
PAGE 39
CHAPTER 3 Troubleshooting the Fibre Channel Links A1/B1 Fibre Channel (FC) Link If a problem occurs with the A1/B1 FC link: ■ In a Sun StorEdge 3900 series system, the Sun StorEdge T3+ array will fail over. ■ In a Sun StorEdge 6900 series system, no Sun StorEdge T3+ array will fail over, but a severe problem can cause a path to go offline. FIGURE 3-1, FIGURE 3-2, and FIGURE 3-3 are examples of A1/B1 Fibre Channel Link Notification Events.
PAGE 40
Site : Source : Severity : Category : EventType: EventTime: FSDE LAB Broomfield CO diag.xxxxx.xxx.com Normal Message Key: message:diag.xxxxx.xxx.com LogEvent.driver.MPXIO_offline 01/08/2002 14:48:02 Found 2 ’driver.MPXIO_offline’ warning(s) in logfile: /var/adm/messages on diag.xxxxx.xxx.com (id=80fee746): Jan 8 14:47:07 WWN:2b000060220041f9 diag.xxxxx.xxx.com mpxio: [ID 779286 kern.
PAGE 41
▼ To Verify the Data Host An error in the A1/B1 FC link can cause a path to go offline in the multipathing software. CODE EXAMPLE 3-1 luxadm(1M) Display # luxadm display /dev/rdsk/c6t29000060220041F96257354230303052d0s2 DEVICE PROPERTIES for disk: /dev/rdsk/ c6t29000060220041F96257354230303052d0s2 Status(Port A): O.K. Status(Port B): O.K.
PAGE 42
An error in the A1/B1 FC link can also cause a device to enter the “unusable” state in cfgadm. In this case, the output for luxadm -e port will show that a device that was “connected” changed to an “unconnected” state. CODE EXAMPLE 3-2 cfgadm -al Display ...
PAGE 43
CODE EXAMPLE 3-3 switchtest(1M) called with options # ./switchtest -v -o "dev=2:192.168.0.30:0" "switchtest: called with options: dev=2:192.168.0.30:0" "switchtest: Started." "Testing port: 2" "Using ip_addr: 192.168.0.30, fcaddr: 0x0 to access this port." "Chassis Status for Device: Switch Power: OK Temp: OK 23.0c Fan 1: OK Fan 2: OK " 02/06/02 15:09:45 diag Storage Automated Diagnostic Environment MSGID 4001 switchtest.WARNING switch0: "Maximum transfer size for a FABRIC port is 200.
PAGE 44
▼ To Isolate the A1/B1 FC Link 1. Quiesce the I/O on the A1/B1 FC link path. 2. Run switchtest or qlctest to test the entire link. 3. Break the connection by uncabling the link. 4. Insert a loopback connector into the switch port. 5. Rerun switchtest. a. If switchtest fails, replace the GBIC and rerun switchtest. b. If switchtest fails again, replace the switch. 6. Insert a loopback connector into the HBA. 7. Run qlctest. ■ If the test fails, replace the HBA. ■ If the test passes, replace the cable.
PAGE 45
A2/B2 Fibre Channel (FC) Link If a problem occurs with the A2/B2 FC link: ■ In a Sun StorEdge 3900 series system, the Sun StorEdge T3+ array will fail over. ■ In a Sun StorEdge 6900 series system, no Sun StorEdge T3+ array will fail over, but a severe problem can cause a path to go offline. FIGURE 3-4 and FIGURE 3-5 are examples of A2/B2 FC Link Notification Events. From root Tue Jan 8 18:39:48 2002 Date: Tue, 8 Jan 2002 18:39:47 -0700 (MST) Message-Id: <200201090139.g091dlg07015@diag.xxxxx.xxx.
PAGE 46
Site : Source : Severity : Category : EventType: EventTime: FSDE LAB Broomfield CO diag.xxxxx.xxx.com Normal Switch Key: switch:100000c0dd0061bb StateChangeEvent.X.port.1 01/08/2002 17:38:32 ’port.1’ in SWITCH diag-sw1b (ip=192.168.0.31) is now Unknown (statusstate changed from ’Online’ to ’Admin’): ---------------------------------------------------------------Site : Source : Severity : Category : EventType: EventTime: FSDE LAB Broomfield CO diag.xxxxx.xxx.
PAGE 47
▼ To Verify the Host Side An error in the A2/B2 FC link can result in a device being listed as in an “unusable” state in cfgadm, but no HBAs are listed as in the “unconnected” state in luxadm output. The multipathing software will note an OFFLINE path.
PAGE 48
CODE EXAMPLE 3-4 cfgadm -al # cfgadm -al Ap_Id c0 Type scsi-bus Receptacle connected Occupant Condition configured unknown # luxadm -e port Found path to 2 HBA ports /devices/pci@6,4000/SUNW,qlc@2/fp@0,0:devctl CONNECTED /devices/pci@6,4000/SUNW,qlc@3/fp@0,0:devctl CONNECTED # luxadm display /dev/rdsk/c6t29000060220041F96257354230303052d0s2 DEVICE PROPERTIES for disk: /dev/rdsk/ c6t29000060220041F96257354230303052d0s2 Status(Port A): O.K. Status(Port B): O.K.
PAGE 49
Note – You can find procedures for restoring virtualization engine settings in the Sun StorEdge 3900 and 6900 Series Reference Manual. ▼ To Verify the A2/B2 FC Link You can check the A2/B2 FC link using the Storage Automated Diagnostic Environment, Diagnose—Test from Topology functionality. The Storage Automated Diagnostic Environment’s implementation of diagnostic tests verifies the operation of user-selected components. Using the Topology view, you can select specific tests, subtests, and test options.
PAGE 50
5. If the switch or the GBIC show no errors, replace the remaining components in the following order: a. Replace the virtualization engine-side GBIC, recable the link, and monitor the link for errors. b. Replace the cable, recable the link, and monitor the link for errors. c. Replace the virtualization engine, restore the virtualization engine settings, recable the link, and monitor the link for errors 6. Return the path to production.
PAGE 51
A3/B3 Fibre Channel (FC) Link If a problem occurs with the A3/B3 FC link: ■ In a Sun StorEdge 3900 series system, the Sun StorEdge T3+ array will fail over. ■ In a Sun StorEdge 6900 series system, no Sun StorEdge T3+ array will fail over, but a severe problem can cause a path to go offline. FIGURE 3-6, FIGURE 3-7, and FIGURE 3-8 are examples of A3/B3 FC link Notification Events. Site : Source : Severity : Category : EventType: EventTime: FSDE LAB Broomfield CO diag.xxxxx.xxx.
PAGE 52
Site : Source : Severity : Category : EventType: EventTime: FSDE LAB Broomfield CO diag.xxxxx.xxx.com Normal Switch Key: switch:100000c0dd0057bd StateChangeEvent.M.port.1 01/08/2002 18:28:38 ’port.1’ in SWITCH diag-sw1a (ip=192.168.0.30) is now Not-Available (status-state changed from ’Online’ to ’Offline’): Info: A port on the switch has logged out of the fabric and gone offline Action: 1. Verify cables, GBICs and connections along Fibre Channel path 2.
PAGE 53
▼ To Verify the Host Side An error in the A3/B3 FC link results in a device being listed as in an “unusable” state in cfgadm, but no HBAs are listed as in the “unconnected” state in luxadm output. The multipathing software will note an “offline” path.
PAGE 54
CODE EXAMPLE 3-6 VxDMP Error Message Jan 8 18:26:38 diag.xxxxx.xxx.com vxdmp: [ID 619769 kern.notice] NOTICE: vxdmp: Path failure on 118/0x1f8 Jan 8 18:26:38 diag.xxxxx.xxx.com vxdmp: [ID 997040 kern.notice] NOTICE: vxvm:vxdmp: disabled path 118/0x1f8 belonging to the dmpnode 231/0xd0 ▼ To Verify the Storage Service Processor You can check the A3/B3 FC link using the Storage Automated Diagnostic Environment, Diagnose—Test from Topology functionality.
PAGE 55
▼ To Isolate the A3/B3 FC Link 1. Quiesce the I/O on the A3/B3 FC link path. 2. Break the connection by uncabling the link. 3. Insert the loopback connector into the switch port. 4. Run switchtest: a. If the test fails, replace the GBIC and rerun switchtest. b. If the test fails again, replace the switch. 5. If the switch or the GBIC show no errors, replace the remaining components in the following order: a. Replace the virtualization engine-side GBIC, recable the link, and monitor the link for errors. b.
PAGE 56
A4/B4 Fibre Channel (FC) Link If a problem occurs with the A4/B4 FC link: ■ In a Sun StorEdge 3900 series system, the Sun StorEdge T3+ array will fail over. ■ In a Sun StorEdge 6900 series system, no Sun StorEdge T3+ array will fail over, but a severe problem can cause a path to go offline. and FIGURE 3-10 are examples of A4/B4 Link Notification Events. Site : Source : Severity : Category : DeviceId : EventType: EventTime: FSDE LAB Broomfield CO diag.xxxxx.xxx.com Warning Message message:diag.xxxxx.
PAGE 57
Site : Source : Severity : Category : DeviceId : EventType: EventTime: FSDE LAB Broomfield CO diag Warning Switch switch:100000c0dd0061bb LogEvent.MessageLog 01/29/2002 14:25:05 Change in Port Statistics on switch diag-sw1b (ip=192.168.0.31): Port-1: Received 16289 ’InvalidTxWds’ in 0 mins (value=365972 ) ---------------------------------------------------------------------Site : FSDE LAB Broomfield CO Source : diag Severity : Warning Category : T3message DeviceId : t3message:83060c0c EventType: LogEvent.
PAGE 58
▼ To Verify the Data Host A problem in the A4/B4 FC Link appears differently on the data host, depending on if the array is a Sun StorEdge 3900 series or a Sun StorEdge 6900 seriesdevice. Sun StorEdge 3900 Series In a Sun StorEdge 3900 series device, the data host multipathing software is responsible for initiating the failover and reports it in /var/adm/messages, such as those reported by the Storage Automated Diagnostic Environment email notifications.
PAGE 59
To verify the failover luxadm display can be used, the failed path will be marked OFFLINE, as shown in CODE EXAMPLE 3-7. CODE EXAMPLE 3-7 Failed Path marked OFFLINE # luxadm display /dev/rdsk/c26t60020F200000644> DEVICE PROPERTIES for disk: /dev/rdsk/ c26t60020F20000064433C3352A60003E82Fd0s2 Status(Port A): O.K. Status(Port B): O.K.
PAGE 60
CODE EXAMPLE 3-8 Failed Path marked “unusable” # cfgadm -al Ap_Id ac0:bank0 ac0:bank1 c1 c16 c18 c19 c1::dsk/c1t6d0 c20 c21 c21::50020f2300006355 Type Receptacle Occupant Condition memory connected configured ok memory empty unconfigured unknown scsi-bus connected configured unknown scsi-bus connected unconfigured unknown scsi-bus connected unconfigured unknown scsi-bus connected unconfigured unknown CD-ROM connected configured unknown fc-private connected unconfigured unknown fc-fabric connected configu
PAGE 61
5. Rerun switchtest. a. If switchtest fails, replace the GBIC and rerun switchtest. b. If the test fails again, replace the switch. 6. If switchtest passes, assume that the suspect components are the cable and the Sun StorEdge T3+ array controller. a. Replace the cable. b. Rerun switchtest. 7. If the test fails again, replace the Sun StorEdge T3+ array controller. 8. Return the path to production. 9.
PAGE 62
46 Sun StorEdge 3900 and 6900 Series Troubleshooting Guide • March 2002
PAGE 63
CHAPTER 4 Configuration Settings This chapter contains the following sections: ■ “Verifying Configuration Settings” on page 47 ■ “To Clear the Lock File” on page 50 For a complete listing of SUNWsecfg Error Messages and recommended action, refer to Appendix B. Verifying Configuration Settings During the course of troubleshooting, you might need to verify configuration settings on the various components in the Sun StorEdge 3900 or 6900 series.
PAGE 64
Note – For cluster configurations and systems that are attached to Windows NT, the default configurations may not match the current installed configuration. Be aware of this when running the verification scripts. Certain items may be flagged as FAIL in these special circumstances. CODE EXAMPLE 4-1 /opt/SUNWsecfg/checkdefaultconfig output # /opt/SUNWsecfg/checkdefaultconfig Checking all accessible components.....
PAGE 65
10. If anything is marked FAIL, check the /var/adm/log/SEcfglog file for the details of the failure. Mon Jan 7 18:07:51 PST 2002 checkt3config: t3b0 INFO : ----------SAVED CONFIGURATION--------------. Mon Jan 7 18:07:51 PST 2002 checkt3config: Mon Jan 7 18:07:51 PST 2002 checkt3config: Mon Jan 7 18:07:51 PST 2002 checkt3config: Mon Jan 7 18:07:51 PST 2002 checkt3config: Mon Jan 7 18:07:51 PST 2002 checkt3config: Mon Jan 7 18:07:51 PST 2002 checkt3config: Mon Jan 7 18:07:51 PST 2002 checkt3config: MBytes.
PAGE 66
11. Fix the FAIL condition, and then verify the settings again. # /opt/SUNWsecfg/bin/checkt3config -n t3b0 Checking : t3b0 Configuration....... Checking Checking Checking Checking Checking command command command command command ver vol stat port list port listmap sys list : : : : : PASS PASS PASS PASS PASS If you interrupt any of the SUNWsecfg scripts (by typing a Control-C default font, for example), a lock file might remain in the /opt/SUNWsecfg/etc directory, causing subsequent commands to fail.
PAGE 67
CODE EXAMPLE 4-2 Tue Tue Tue Tue Jan Jan Jan Jan 29 29 29 29 savevemap output 16:12:34 16:12:34 16:12:42 16:14:01 MST MST MST MST 2002 2002 2002 2002 savevemap: v1 ENTER. checkslicd: v1 ENTER. checkslicd: v1 EXIT. savevemap: v1 EXIT. When savevemap: EXIT is displayed, the savevemap process has successfully exited.
PAGE 68
52 Sun StorEdge 3900 and 6900 Series Troubleshooting Guide — March 2002
PAGE 69
CHAPTER 5 Troubleshooting Host Devices This chapter describes how to troubleshoot components associated with a Sun StorEdge 3900 or 6900 series Host. This chapter contains the following sections: ■ “Using the Host Event Grid” on page 53 ■ “To Replace the Master Host” on page 57 ■ “To Replace the Alternate Master or Slave Monitoring Host” on page 58 Host Event Grid The Storage Automated Diagnostic Environment Event Grid enables you to sort host events by component, category, or event type.
PAGE 70
FIGURE 5-1 54 Host Event Grid Sun StorEdge 3900 and 6900 Series Troubleshooting Guide • March 2002
PAGE 71
TABLE 5-1 lists all the host events in the Storage Automated Diagnostic Environment. TABLE 5-1 Storage Automated Diagnostic Environment Event Grid for the Host Category Component EventType Sev host hba Alarm+ Yellow host hba Alarm- Red host lun.t300 Alarm- Red Action Description Information [ Info ] status of hba /devices/ sbus@9,0/ SUNW,qlc@0,30000 /fp@0,0:devctl on diag.xxxxx.xxx.com changed from NOT CONNECTED to CONNECTED Monitors changes in the output of the luxadm -e port.
PAGE 72
TABLE 5-1 Storage Automated Diagnostic Environment Event Grid for the Host (Continued) luxadm display reported a change in the port status of one of its paths. The Storage Automated Diagnostic Environment then tries to find to which enclosure this path corresponds by reviewing its database of Sun StorEdge T3+ arrays and virtualization engines. host lun.VE Alarm- Red Y [ Info ] The state of lun.VE.c14t50020 F2300003EE5d0s2. statusA on diag.xxxxx.xxx.
PAGE 73
Replacing the Master, Alternate Master, and Slave Monitoring Host The following procedures are a high-level overview of the procedures that are detailed in the Storage Automated Diagnostic Environment User’s Guide. Follow these procedures when replacing a master, alternate master, or slave monitoring host. Note – The procedures for replacing the master host are different from the procedures for replacing an alternate master or slave monitoring host.
PAGE 74
5. Choose Utilities -> System -> Recover Config. Refer to Chapter 7 of the Storage Automated Diagnostic Environment User’s Guide for detailed instructions. a. In the Recover Config window, enter the IP address of any alternate master or slave monitoring host (all hosts keep a copy of the configuration). b. Make sure the Recover Config and Reset slave to this master checkboxes are checked. c. Click Recover. 6. Choose Maintenance -> General Maintenance.
PAGE 75
7. Choose Maintenance -> General Maintenance -> Maintain Hosts. Refer to Chapter 3, “Maintenance,” of the Storage Automated Diagnostic User’s Guide for detailed instructions. 8. In the Maintain Hosts window, select the new host. 9. Configure the options as needed. 10. Choose Maintenance -> Topology Maintenance -> Topology Snapshot. a. In the Topology Snapshot window, select the new host. b. Click Create and Retrieve Selected Topologies. c. Click Merge and Push Master Topology.
PAGE 76
60 Sun StorEdge 3900 and 6900 Series Troubleshooting Guide • March 2002
PAGE 77
CHAPTER 6 Troubleshooting Sun StorEdge FC Switch-8 and Switch-16 Devices This chapter describes how to troubleshoot the switch components associated with a Sun StorEdge 3900 or 6900 series system.
PAGE 78
These switches can be monitored through the SANSurfer GUI, which is available on the Storage Service Processor. You configure and modify the switches using the Configuration Utilities. Do not configure or modify the switches using any method other than the SUNWsecfg tools. ▼ To Diagnose and Troubleshoot Switch Hardware 1. To diagnose and troubleshoot the switch hardware, begin by running the SUNWsecfg checkswitch utility. 2.
PAGE 79
FIGURE 6-1 Switch Event Grid Chapter 6 Troubleshooting Sun StorEdge FC Switch-8 and Switch-16 Devices For Internal Use Only 63
PAGE 80
TABLE 6-1 lists the switch events. TABLE 6-1 Storage Automated Diagnostic Environment Event Grid for Switches Cat Component EventType Sev Action Description Information/Action switch port statistics Log Yellow Y [ Info/Action ] Information: The switch has reported a change in an error counter. This could indicate a failing component in the link. Change in port statistics on switch diag156-sw1b (ip=192.168.0.31) Action: Check the Topology GUI for any link errors.
PAGE 81
TABLE 6-1 Storage Automated Diagnostic Environment Event Grid for Switches (Continued) Cat Component EventType Sev Action Description switch enclosure Audit Auditing a new switch called ras d2-swb1 (ip=xxx.0.0.41) 10002000007a609 switch oob Comm_ Established Communication regained with sw1a Information/Action (ip=xxx.20.67.213) switch oob Comm_Lost Down Yes [ Info/Action ] Lost communication with sw1a (ip=xxx.20.67.213) Information: Ethernet connectivity to the switch has been lost.
PAGE 82
TABLE 6-1 Storage Automated Diagnostic Environment Event Grid for Switches (Continued) Cat Component EventType switch enclosure switch enclosure 66 Sev Action Description Information/Action Discovery [ Info ] Discovered a new switch called ras d2-swb1 (ip=xxx.0.0.41) 10002000007a609 Discovery events occur the very first time the agent probes a storage device. It creates a detailed description of the device monitored and sends it using any active notifier (NetConnect, Email).
PAGE 83
TABLE 6-1 Storage Automated Diagnostic Environment Event Grid for Switches (Continued) Cat Component EventType switch port StateChange+ switch port StateChange- switch enclosure Sev Red Statistics Chapter 6 Action Y Description Information/Action [ Info/Action ] port.1 in SWITCH diag185 (ip= xxx.20.67.185) is now Available (statusstate changed from OFFLINE to ONLINE) Port on switch is now available. [ Info/Action ] port.1 in SWITCH diag185 (ip=xxx.20.67.
PAGE 84
Replacing the Master Midplane Follow this procedure when replacing the master midplane in a Sun StorEdge network FC switch-8 or switch-16 switch or a Brocade Silkworm switch. This procedure is detailed in the Storage Automated Diagnostic Environment User’s Guide. ▼ To Replace the Master Midplane 1. Choose Maintenance --> General Maintenance -- > Maintain Devices. Refer to Chapter 3 of the Storage Automated Diagnostic Environment User’s Guide. 2.
PAGE 85
CHAPTER 7 Troubleshooting Virtualization Engine Devices This chapter describes how to troubleshoot the virtualization engine component of a Sun StorEdge 6900 series system.
PAGE 86
Virtualization Engine Diagnostics The virtualization engine monitors the following components: ■ ■ ■ Virtualization engine router Sun StorEdge T3+ array Cabling among the router and storage Service Request Numbers The service request numbers are used to inform the user of storage subsystem activities. Service and Diagnostic Codes The virtualization engine’s service and diagnostic codes inform the user of subsystem activities. The codes are presented as a LED readout.
PAGE 87
▼ To Display Log Files and Retrieve SRNs Use the /opt/svengine/sduc/sreadlog command to display log files and retrieve the Service Request Numbers (SRN) for errors that need action. Data is returned in the following format: TimeStamp:nnn:Txxxxx.uuuuuuuu SRN=mmmmm TimeStamp:nnn:Txxxxx.uuuuuuuu SRN=mmmmm TimeStamp:nnn:Txxxxx.uuuuuuuu SRN=mmmmm Item Description TimeStamp Time and date when error occurred nnn The name of the virtualization engine pair (v1 or v2) Txxxxx The LUN where the error occurred.
PAGE 88
Item Description TimeStamp January 3, 2002 10:13 nnn v1 (virtualization engine pair v1) uuuuuuuu 29000060-220041F9 (v1a, obtained by checking the virtualization engine map from the SEcfg utility) SRN=mmmmm SRN=70030: SAN Configuration Changed (Refer to Appendix A for codes.) ▼ To Clear the Log ● Use the /opt/svengine/sduc/sclrlog command. Virtualization Engine LEDs TABLE 7-1 describes the LEDs on the back of the virtualization engine..
PAGE 89
Power LED Codes The virtualization engine LEDs are shown in FIGURE 7-1. FIGURE 7-1 Virtualization Engine Front Panel LEDs Interpreting LED Service and Diagnostic Codes The Status LED communicates the status of the virtualization engine in decimal numbers. Each decimal number is represented by number of blinks, followed by a medium duration (two seconds) of LED off. TABLE 7-2 lists the status LED code descriptions.
PAGE 90
Back Panel Features The back panel of the virtualization engine contains the Sun StorEdge network FC switch-8 or switch-16 switches and a socket for the AC power input, and various data ports and LEDs. Ethernet Port LEDs The Ethernet port LEDs indicate the speed, activity, and validity of the link, shown in TABLE 7-3.
PAGE 91
Fibre Channel Link Error Status Report The virtualization engine’s host-side and device-side interfaces provide statistical data for the counts listed in TABLE 7-4. TABLE 7-4 Virtualization Engine Statistical Data Count Type Description Link Failure Count The number of times the virtualization engine’s frame manager detects a non-operational state or other failure of N_Port initialization protocol.
PAGE 92
▼ To Check Fibre Channel Link Error Status Manually The Storage Automated Diagnostic Environment, which runs on the Storage Service Processor, monitors the Fibre Channel link status of the virtualization engine. The virtualization engine must be power-cycled to reset the counters. Therefore, you should manually check the accumulation of errors between a fixed period of time. To check the status manually, follow these steps: 1. Use the svstat command to take a reading, as shown in CODE EXAMPLE 7-1.
PAGE 93
CODE EXAMPLE 7-1 Fibre Channel Link Error Status Example # /opt/svengine/sduc/svstat -d v1 I00001 Host Side FC Vital Statistics: Link Failure Count 0 Loss of Sync Count 0 Loss of Signal Count 0 Protocol Error Count 0 Invalid Word Count 8 Invalid CRC Count 0 I00001 Device Side FC Vital Statistics: Link Failure Count 0 Loss of Sync Count 0 Loss of Signal Count 0 Protocol Error Count 0 Invalid Word Count 139 Invalid CRC Count 0 I00002 Host Side FC Vital Statistics: Link Failure Count 0 Loss of Sync Count 0
PAGE 94
Translating Host Device Names You can translate host device names to VLUN, disk pool, and physical Sun StorEdge T3+ array LUNs. The luxadm output for a host device, shown in CODE EXAMPLE 7-2, does not include the unique VLUN serial number that is needed to identify this LUN. CODE EXAMPLE 7-2 luxadm Output for a Host Device # luxadm display /dev/rdsk/c4t2B00006022004186d0s2 DEVICE PROPERTIES for disk: /dev/rdsk/c4t2B00006022004186d0s2 Status(Port A): O.K.
PAGE 95
▼ To Display the VLUN Serial Number Devices That Are Not Sun StorEdge Traffic Manager-Enabled 1. Use the format -e command. 2. Type the disk on which you are working at the format prompt. 3. Type inquiry at the scsi prompt. 4. Find the VLUN serial number in the Inquiry displayed list. # format -e c4t2B00006022004186d0 format> scsi ...
PAGE 96
Sun StorEdge Traffic Manager-Enabled Devices 1. If the devices support the Sun StorEdge Traffic Manager software, you can use this shortcut. 2. Type: # luxadm display /dev/rdsk/c6t29000060220041956257334B30303148d0s2 DEVICE PROPERTIES for disk: /dev/rdsk/ c6t29000060220041956257334B30303148d0s2 Status(Port A): O.K. Status(Port B): O.K.
PAGE 97
▼ To View the Virtualization Engine Map The virtualization engine map is stored on the Storage Service Processor. 1. To view the virtualization engine map, type: # showvemap -n v1 -f VIRTUAL LUN SUMMARY Disk pool VLUN Serial MP Drive VLUN VLUN Size Slic Zones Number Target Target Name GB --------------------------------------------------------------------------t3b00 6257334B30303148 T49152 T16384 VDRV000 55.0 t3b00 6257334B30303149 T49152 T16385 VDRV001 55.
PAGE 98
2. You can optionally establish a telnet connection to the virtualization engine and run the runsecfg utility to poll a live snapshot of the virtualization engine map. Refer to “To Replace a Failed Virtualization Engine” on page 84 for telnet instructions. Determining the virtualization engine pairs on the system .........
PAGE 99
▼ To Failback the Virtualization Engine In the event of a Sun StorEdge T3+ array LUN failover, use the following procedure to fail the LUN back to its original controller. 1.
PAGE 100
For detailed information about the SUNWsecfg scripts, refer to the Sun StorEdge 3900 and 6900 Series Reference Manual. ▼ To Replace a Failed Virtualization Engine 1. Replace the old (failed) virtualization engine unit with a new unit. 2. Identify the MAC address of the new unit and replace the old MAC address with the new one in the /etc/ethers file: 8:0:20:7d:82:9e virtualization engine-name 3. Verify that RARP is running on the Storage Service Processor. 4.
PAGE 101
11. Enable the switch port: # /opt/SUNWsecfg/flib/setveport -v virtualization engine-name -e 12. Reset the virtualization engine: # resetve -n virtualization engine-name 13. Find the initiator number for the new and old number: # showvemap -n virtualization engine-pairname -l The new unit will not have any zones defined. 14. If zones were present before the replacement, type the following: # restorevemap -n virtualization engine pair -z \ -c old-ve-initiator-number -d new-ve-initiator-number 15.
PAGE 102
▼ To Manually Clear the SAN Database It is occasionally necessary to manually clear the SAN database on the virtualization engine routers. Caution – This procedure will wipe out the SAN database and will remove the configuration of disk pools, Multipath drives, Zoning, and VLUNs. After performing this procedure, the virtualization map must be restored to the virtualization engine pair using /opt/SUNWsecfg/bin/restorevemap. This requires a valid copy of the /opt/SUNWsecfg/etc/v1.san or v2.san file.
PAGE 103
Stopping and Restarting the SLIC Daemon Follow this procedure to restart the SLIC daemon if the SLIC daemon becomes unresponsive, or if messages such as the following are displayed: connect: Connection refused or Socket error encountered.. ▼ To Restart the SLIC Daemon 1. Check whether the SLICD is running: # ps -eaf | grep slicd 2.
PAGE 104
3. Remove the segments by typing the following: # ipcrm -m 301 -m 302 -m 303 -s 196608 -s 196609 -s 196610 Check the ipcrm(1m) man page for details. 4. Restart the SLIC daemon # /opt/SUNWsecfg/bin/startslicd -n v1 (or v2, depending on configuration) # * 5.
PAGE 105
Sun StorEdge 6900 Series Multipathing Example One Sun StorEdge T3+ array partner pair with 1 500GB RAID 5 LUN per brick (2 LUNs total) Currently, there is one 10GB VLUN created from each physical LUN, for a total of two VLUNs. In a Sun StorEdge 6900 series, there are four possible physical paths to each Sun StorEdge T3+ array Volume (LUN). Refer to FIGURE 7-4 and FIGURE 7-3.
PAGE 106
In the event of a path failure after the second tier of Sun StorEdge network FC switch-8 and switch-16 switches (or in the event of both T Ports failing between the switches), the virtualization engines force a LUN failover of the affected Sun StorEdge T3+ array and routes all I/O to its secondary path. From the host side, nothing has changed; all I/O is routed through both HBAs (refer to FIGURE 7-6).
PAGE 107
FIGURE 7-3 Primary Data Paths to the Alternate Master Chapter 7 For Internal Use Only Troubleshooting Virtualization Engine Devices 91
PAGE 108
FIGURE 7-4 92 Primary Data Paths to the Master Sun StorEdge T3+ Array Sun StorEdge 3900 and 6900 Series Troubleshooting Guide — March 2002
PAGE 109
FIGURE 7-5 Path Failure—Before the Second Tier of Switches Chapter 7 For Internal Use Only Troubleshooting Virtualization Engine Devices 93
PAGE 110
FIGURE 7-6 94 Path Failure —I/O Routed through Both HBAs Sun StorEdge 3900 and 6900 Series Troubleshooting Guide — March 2002
PAGE 111
Virtualization Engine Event Grid The Storage Automated Diagnostic Environment Event Grid enables you to sort virtualization engine events by component, category, or event type. The Storage Automated Diagnostic Environment GUI displays an event grid that describes the severity of the event, whether action is required, a description of the event, and the recommended action. Refer to the Storage Automated Diagnostic Environment User’s Guide Help section for more information.
PAGE 112
TABLE 7-5 lists the Virtualization Engine Events. TABLE 7-5 Storage Automated Diagnostic Environment Event Grid for Virtualization Engine Category Component EventType Sev Action virtualization engine enclosure Alarm Yellow Volume E00012 on v1a changed mapping. virtualization engine enclosure Alarm.
PAGE 113
TABLE 7-5 Storage Automated Diagnostic Environment Event Grid for Virtualization Engine (Continued) Category Component EventType Sev Action virtualization engine ve_diag Diagnostic Test- Red ve_diag (diag240) on ve-1 (ip=xxx.20.67.213) failed virtualization engine veluntest Diagnostic Test- Red veluntest (diag240) on ve-1 (ip=xxx.20.67.
PAGE 114
98 Sun StorEdge 3900 and 6900 Series Troubleshooting Guide — March 2002
PAGE 115
CHAPTER 8 Troubleshooting the Sun StorEdge T3+ Array Devices This chapter contains the following sections: ■ “Explorer Data Collection Utility” on page 99 ■ “Sun StorEdge T3+ Array Event Grid” on page 109 Explorer Data Collection Utility The Explorer Data Collection Utility script is included on the Storage Service Processor in the /export/packages directory. The Explorer Data Collection Utility is not installed by default, but can be installed during rack setup.
PAGE 116
Do not accept automatic emailing of the Explorer Data Collection Utility output, unless the Storage Service Processor is properly set up to handle mail correctly. Automatic Email Submission Would you like all explorer output to be sent to: explorer-database-americas@sun.com at the completion of explorer when -mail or -e is specified? [y,n] n Before running the Explorer Data Collection Utility, make sure that the switch and Sun StorEdge T3+ array information is added to the proper /opt/SUNWexplo/etc files.
PAGE 117
CODE EXAMPLE 8-2 Editing Sun StorEdge T3+ array information using vi # vi t3input.txt # Input file for extended data collection # Format is HOST PASSWORD t3b0 t3b2 t3b3 XXXX XXXX XXXX :wq! Note – xxxx represents Sun StorEdge T3+ array passwords.
PAGE 118
Troubleshooting the T1/T2 Data Path Notes 102 ■ There are two T Port links for redundancy. ■ If one of the two links is lost, no Sun StorEdge T3+ array LUN failover will occur, and no pathing failures will be noted. ■ If both T Port links fail, there will be a Sun StorEdge T3+ array LUN failover, as one of the virtualization engines take control of the I/O operations. One of the Sun StorEdge T3+ array LUNs will failover, as all I/O is routed to the controlling virtualization engine.
PAGE 119
T1/T2 Notification Events The example below shows a typical port failure event Site : Source : Severity : Category : DeviceId : EventType: EventTime: Lab 3286 - DSQA1 Broomfield diag.xxxxx.xxx.com Error (Actionable) Switch switch:100000c0dd00b682 StateChangeEvent.M.port.8 01/30/2002 11:17:22 ’port.8’ in SWITCH diag209-sw2a (ip=192.168.0.
PAGE 120
If both T Ports go offline, you might see messages like the following. Note the virtualization engine Event alerting the LUN failover. Site : Source : Severity : Category : DeviceId : EventType: EventTime: Lab 3286 - DSQA1 Broomfield diag.xxxxx.xxx.com Warning (Actionable) Ve ve:6257335A-30303142 AlarmEvent.
PAGE 121
...continued from previous page... ---------------------------------------------------------------------Site : Lab 3286 - DSQA1 Broomfield Source : diag.xxxxx.xxx.com Severity : Warning Category : Message DeviceId : message:diag.xxxxx.xxx.com EventType: LogEvent.driver.Fabric_Warning EventTime: 01/30/2002 11:50:07 Found 1 ’driver.Fabric_Warning’ warning(s) in logfile: /var/adm/messages on diag.xxxxx.xxx.com (id=809f76b4): INFORMATION: Fabric warning Jan 30 11:46:37 WWN:2b00006022004186 diag.xxxxx.xxx.
PAGE 122
Sun StorEdge T3+ Array Storage Service Processor Verification 1. Run port listmap on the Sun StorEdge T3+ array to see the failover event. # t3b0:/:<1>port listmap port u1p1 u1p1 u2p1 u2p1 targetid 0 0 1 1 addr_type hard hard hard hard lun 0 1 0 1 volume vol1 vol2 vol1 vol2 owner u1 u1 u1 u1 access primary failover failover primary 2. Compare the virtualization engine configuration to a saved configuration by running /opt/SUNWsecfg/runsecfg and choosing Verify Virtualization Engine Map.
PAGE 123
T1/T2 FRU Tests Available ■ ■ Switch - switchtest Link - linktest Running linktest from the Storage Automated Diagnostic Environment GUI will guide the Service Engineer to discover the failed FRU. Once the test has completed its run, an email message, similar to the following message, will be sent to the Email recipient that was specified in linktest. running on diag.xxxxx.xxx.
PAGE 124
Notes ■ When inserting a loopback connector into the T Port, there will be NO green light indicating a proper insertion. However, the test will run and be valid. There is currently an RFE to address this issue. ■ If only one of the links has failed and the I/O is travelling over the remaining link, once the failed link is replaced and recabled, I/O will be automatically be routed over the repaired link by the switch. No manual intervention is required.
PAGE 125
Sun StorEdge T3+ Array Event Grid The Storage Automated Diagnostic Environment Event Grid enables you to sort Sun StorEdge T3+ array events by component, category, or event type. The Storage Automated Diagnostic Environment GUI displays an event grid that describes the severity of the event, whether action is required, a description of the event, and the recommended action. Refer to the Storage Automated Diagnostic Environment User’s Guide for more information.
PAGE 126
The following table lists all of the events for the Sun StorEdge T3+ array. Category Component EventType t3 power.temp Alarm+ t3 disk.port Alarm- Sev Action Description Information The state of power.u1pcu1.PowTe mp on diag213 (ip=xxx.20.67.213) is Normal Red Y [ Info/Action ] The state of disk.u1d1. Port1State on Sun StorEdge T3+ array t300 changed from OK to failed. Information: The Sun StorEdge T3+ array has reported that one port of a dual-ported disk has failed. Recommended action: 1.
PAGE 127
Category Component EventType Sev Action Description Information t3 power. battery Alarm- Red Y [ Info/Action ] The state of power.u1pcu1.BatStat e on diag213 (ip=xxx.20.67.213) is Fault Information: The state of the batteries in the Sun StorEdge T3+ array is not optimal. Possible causes are: 1. Voltage level on power supply and battery have moved out of acceptable thresholds. 2. The internal PCU temp has exceeded acceptable thresholds. 3. A PCU fan has failed. 1.
PAGE 128
Category Component EventType Sev Action Description Information t3 power. output Alarm- Red Y [ Info/Action ] The state of power.u1pcu1.PowOu tput on diag213 (ip=xxx.20.67.21 3) is Fault Information: The state of the power in the Sun StorEdge T3+ array power cooling unit is not optimal. Recommended action: 1. Telnet to affected Sun StorEdge T3+ array. 2. Verify power cooling unit state in fru stat. 3. Replace PCU, if necessary. t3 power.
PAGE 129
Category Component EventType Sev t3 enclosure Alarm. time Discrepancy Yello w Action Description Information [ Action ] Time of T3 diag213 (ip=xxx.20.67.213) is different from host: T3=Fri Oct 26 10:16:17 200, Host=2001-10-26 12:21:04 Recommended action: Fix the date and time on the Sun StorEdge T3+ array using the date command. Date and time should be the same as the monitoring host. t3 enclosure Audit [ Info ] Auditing a new Sun StorEdge T3+ array called ras d2-t3b1 (ip=xxx.0.0.41) slr-mi.
PAGE 130
Category Component EventType Sev Action Description Information t3 ib Comm_Lost Down Y [ Info/Action ] Lost communication (InBandwithdiag213 (ip=xxx.20.67.21 3) ( last reboot was 2001-09-27 15:22:00) Recommended action: 1. Verify luxadm via command line (luxadm probe, luxadm display) 2. Verify cables, GBICs and connections along data path. 3. Check the Storage Automated Diagnostic Environment SAN Topology GUI to identify the failing segment of the data path. 4.
PAGE 131
Category Component EventType Sev Action Description t3 t3ofdg Diagnostic Test- Red t3ofdg (diag240) on diag213 (ip=xxx.20.67.213) failed t3 t3test Diagnostic Test- Red t3test (diag240) on diag213 (ip=xxx.20.67.213) failed t3 t3volverify Diagnostic Test- Red t3volverify (diag240) on diag213 (ip=xxx.20.67.213) failed t3 enclosure Discovery [ Info ] Discovered a new Sun StorEdge T3+ array called ras d2-t3b1 (ip=xxx.0.0.41) slr-mi.370-399001-e-e1.
PAGE 132
Category Component EventType Sev t3 power Insert Component [ Info ] ’power.u1pcu2’(TE CTROL-CAN.300145401(50).008275) was added to T3 diag213 (ip=xxx.20.67.21 3) t3 enclosure Location Change Location of t3 rasd2-t3b0 (ip=xxx.0.0.40) was changed t3 enclosure QuiesceEnd Quiesce End on t3 d2-t3b1 (ip=xxx.0.0.41) t3 enclosure QuiesceStart Quiesce Start on t3 d2-t3b1 (ip=xxx.0.0.41) t3 enclosure Removal Monitoring of t3 d2t3b1 (ip=xxx.0.0.
PAGE 133
Category Component EventType Sev Action t3 disk Remove Component Red Y Description [ Info/Action ] disk.u2d3(SEAGAT E.ST318203FSUN18 G.LRG07139) was removed from diag158 (ip=xxx.20.67.158) Information Information: The Sun StorEdge T3+ array has reported a disk has been removed from the chassis. Recommended action: Replace the disk within the 30-minute power shutdown window. t3 interface.
PAGE 134
Category Component EventType t3 disk State Change+ Sev Action Description Information disk.u1d5 in Sun StorEdge T3+ array rasd3-t3b1 (ip=xxx.0.0.41) is now Available (status-state changed from faultdisabled to readyenabled) t3 interface. loopcard State Change+ [ Info ] loopcard.u1l1(SLR -MI.375-0085-01G-G4.070924) in T3 msp0-t3b0 t3 volume State Change+ ’volume.u1vol1 (slr-mi.370-399001-ef0.022542.u1vol1) in T3 dvt2-t3b0 (ip=192.168.0.
PAGE 135
Category Component EventType Sev Action t3 controller State Change- Red Y Description [ Info/Action ] controller.u1ctr in T3 diag213 (ip=xxx.20.67.213) is now Not-Available (status-state changed from unknown to ready-disabled) Information: The Sun StorEdge T3+ array controller has been disabled. t3 disk StateChange- Red Y [ Info/Action ] disk.u1d5 in T3 rasd3-t3b1 (ip=xxx.0.0.41) is now Not-Available (status-state changed from unknown to fault-disabled).
PAGE 136
Category Component EventType Sev Action Description Information t3 interface. loopcard StateChange- Red Y [ Info/Action ] Recommended action: Information: 1. Telnet to the affected Sun StorEdge T3+ array. 2. Verify loopcard state with fru stat 3. Verify matching firmware with other loopcard. 4. Re-enable loopcard if possible (enable u(encid)|[1|2|] 5. Replace the loopcard if necessary. The Sun StorEdge T3+ array has indicated that the loopcard is no longer in an optimal state.
PAGE 137
Category Component EventType Sev Action t3 volume StateChange- Red Y Description [ Info/Action ] Information Information: The Sun StorEdge T3+ array has reported that a power cooling unit has been disabled. Recommended action: 1. Check the Sun StorEdge T3+ array syslog for battery hold times. 2. If < 6 minutes, replace the battery, or the entire PCU, as required. t3 t3 power enclosure StateChange- Red Statistics Y [ Info/Action ] power.u1pcu2(TECT ROL-CAN.300145401(50).
PAGE 138
Replacing the Master Midplane Follow this procedure when replacing the master midplane in a Sun StorEdge T3+ array. This procedure is detailed in the Storage Automated Diagnostic Environment User’s Guide. ▼ To Replace the Master Midplane 1. Choose Maintenance --> General Maintenance -- > Maintain Devices. Refer to Chapter 3 of the Storage Automated Diagnostic Environment User’s Guide. 2. In the Maintain Devices window, delete the device that is to be replaced. 3.
PAGE 139
CHAPTER 9 Troubleshooting Ethernet Hubs The Sun StorEdge 3900 and 6900 series uses an Ethernet hub as the backbone for the internal service network.
PAGE 140
124 Sun StorEdge 3900 and 6900 Series Troubleshooting Guide • March 2002
PAGE 141
APPENDIX A Virtualization Engine References This Appendix contains the following Tables: ■ Table A-1 “SRN and SNMP Reference” ■ Table A-2 “SRN/SNMP Single Point of Failure Table” ■ Table A-3 “Port Communication” ■ Table A-4 “Service Codes” TABLE A-1 provides an explanation of Service Request Numbers for the virtualization engine. TABLE A-1 SRN and SNMP Reference SRN Description Corrective Action 1xxxx Disk drive Check Condition status. xxxx is the Unit Error Code.
PAGE 142
TABLE A-1 126 SRN and SNMP Reference SRN Description Corrective Action 70005 Write error is detected by master. If the initiator is master, then it has detected a write error on a member within a mirror drive. If a spare drive is available, it will be brought in and used to replace the failed drive. If no spare is available, replace the failed drive with a new drive. 70006 virtualization engine-to-virtualization engine communication has failed. Internal error. Update firmware.
PAGE 143
TABLE A-1 SRN and SNMP Reference SRN Description Corrective Action 7009A Read degrade recorded. A mirror drive was written to, causing it to enter the degrade state. Reinsert the missing drive, or replace it with a drive of equal or greater capacity. 7009B Write degrade recorded. If a spare drive is available, it will be brought in and used to replace the failed drive. The removed drive needs to be (if good) reinserted or (if bad) replaced. 7009C Last primary failed during rebuild.
PAGE 144
TABLE A-1 SRN and SNMP Reference SRN Description 72005 Failed to check for SAN changes. 72006 Failed to read SAN event log. 72007 SLIC daemon connection is down. TABLE A-2 Corrective Action Wait for 1-5 minutes for backup daemon to come up. If it doesn’t, check the network connection for virtualization engine halt, or hardware failure.
PAGE 145
TABLE A-3 Port Communication Port Port Port Number Daemon Management Programs 20000 Daemon Daemon 20001 Daemon virtualization engine 25000 virtualization engine virtualization engine 25001 TABLE A-4 provides service codes for the virtualization engine. TABLE A-4 Service Codes Code Number Cause Corrective Action 005 PCI bus parity error. • Replace virtualization engine. 24 The attempt to report one error resulted in another error. • Cycle power to the virtualization engine.
PAGE 146
TABLE A-4 Service Codes 54 Unauthorized cabling configuration. • Check cabling. 57 Too many HBAs attempting to log in. • Check cabling. 60 Node mapping table cleared using SW2. • No action required. 62 Improper SW2 setting. • Correct SW2 setting. • Cycle virtualization engine power. 126 Too many virtualization engines in SAN. • Remove the extra virtualization engine. • Cycle virtualization engine power. 130 Heartbeat connection between virtualization engines is down. • Correct problem.
PAGE 147
APPENDIX B SUNWsecfg Error Messages The Sun StorEdge 3900 and 6900 Series Reference Manual lists and defines the command utilities that configure the various components of the Sun StorEdge 3900 and 6900 series storage systems. The information in this appendix expands on that information by providing recommendations for corrective action, should you encounter errors with the command utilities.
PAGE 148
. TABLE B-1 Virtualization Engine SUNWsecfg Error Messages Message Description and Cause of Error Suggested Action Common to virtualization engines Invalid virtualization engine pair name $vepair, or virtualization engine is unavailable. Confirm that the configuration locks are set. This is usually due to the savevemap command running. Try ps -ef | grep savevemap or listavailable -v (which returns the status of individual virtualization engines).
PAGE 149
TABLE B-1 Virtualization Engine SUNWsecfg Error Messages (Continued) Message Description and Cause of Error Suggested Action Common to virtualization engine 1. Device-side operating mode is not set properly. 2. Device-side UID reporting scheme is not set properly. 3. Host-side operating mode is not set properly. 4. Host-side LUN mapping mode is not set properly. 5. Host-side Command Queue Depth is not set properly. 6. Host-side UID distinguish is not set properly. 7. IP is not set properly. 8.
PAGE 150
TABLE B-1 Virtualization Engine SUNWsecfg Error Messages (Continued) Message Description and Cause of Error Suggested Action createvezone Invalid WWN $wwn on $vepair initiator $init, or virtualization engine is unavailable. WWN that has already been specified has a SLIC zone and/or an HBA alias assigned. Note that for a WWN to be available for createvezone, the zone name in the map file (showvemap -n ve_pairname) must be “undefined” and the online status should be “yes.
PAGE 151
TABLE B-2 Sun StorEdge Network FC Switch-8 and Switch-16 Switch SUNWsecfg Error Messages Message Description and Cause of Error Suggested Action Common Switch Sun StorEdge system type entered, ${cab_type}, does not match system type discovered, ${boxtype}. Either call the command with the -f force option to force the series type, or do not specify the cabinet type (no -c option). Common Switch 1. Unable to obtain lock on switch ${switch}. Another command is running. 1.
PAGE 152
TABLE B-2 Sun StorEdge Network FC Switch-8 and Switch-16 Switch SUNWsecfg Error Messages Message Description and Cause of Error Suggested Action setswitchflash Invalid flash file $flashfile. Check the number of ports on switch $switch. You might be attempting to download a flash file for an 8-port switch to a 16port switch. Check showswitch -s $switch and look for “number of ports.” Ensure that this matches the second and third characters of the flash file name; for example: m08030462.fls.
PAGE 153
TABLE B-3 Sun StorEdge T3+ Array SUNWsecfg Error Messages Message Description and Cause of Error Suggested Action Common to Sun StorEdge T3+ array Present configuration does not match Reference configurations Check the present Sun StorEdge T3+ array configuration with showt3 -n command and verify whether the configuration is corrupted or has changed. If it is not one of the standard configurations, restore the configuration using the restoret3config command. Common to Sun StorEdge T3+ array 1.
PAGE 154
TABLE B-3 Sun StorEdge T3+ Array SUNWsecfg Error Messages (Continued) Message Description and Cause of Error Suggested Action checkt3config Snapshot configuration files are not present. Unable to check configuration. Make sure that the snapshot files are saved and have read permissions in the /opt/SUNWsecfg/etc/t3name/ directory. If the snapshot files are not available, , create them by using the savet3config command. checkt3mount 1. The $lun status reported a bad or nonexistent LUN.
PAGE 155
TABLE B-3 Sun StorEdge T3+ Array SUNWsecfg Error Messages (Continued) Message Description and Cause of Error Suggested Action restoret3config Error while the block size compare command is executing. The $BRICK_IP{$IPADD} command is aborted. The Sun StorEdge T3+ array block size parameter is different from the snapshot file. The Sun StorEdge T3+ array may have been reconfigured. Run restoret3config.
PAGE 156
TABLE B-4 Other SUNWsecfg Error Messages Message Description and Cause of Error Suggested Action Common to all components If the Sun StorEdge 3900 or 6900 series has multiple (more than two) failures (for example, both virtualization engines and two switches are down), the getcabinet tool might not determine the correct cabinet type. In this example, the getcabinet script might determine the device to be a Sun StorEdge 3900 series when, in reality, it is a Sun StorEdge 6900 series.
PAGE 157
setupswitch Exit Values TABLE 9-1 lists the setupswitch exit values. The associated messages are logged in the /var/adm/log/SEcfglog log file. TABLE 9-1 Severity Level setupswitch Exit Values Message Type Message Meaning 0 INFO All switch settings are properly set. The switch setting matches the default configuration. 1 ERROR Errors occurred while trying to set the proper switch settings.The switch setting does not match the default configuration or any valid alternatives.
PAGE 158
142 Sun StorEdge 3900 and 6900 Series Troubleshooting Guide • March 2002
PAGE 159
Index A accessing documentation online, xv C checkswitch used to diagnose and troubleshooting switch, 62 comments sending documentation comments, xv configuration settings, 47 verification of, 47 D data host verification for Sun StorEdge 39x0 series, 42 for Sun StorEdge 69x0 series, 42 diagrams fibre channel link, 15, 16 documentation how book is organized, xi shell prompts, xiii using UNIX commands, xii E ethernet hubs related documentation, 123 troubleshooting, 123 event grid host, 53 Explorer Data C
PAGE 160
H health functions for Sun StorEdge 3900 and 6900 series, 2 host device names translating, 78 host devices troubleshooting, 53 host event grid, 53 host side troubleshooting, 18 Predictive Failure Analysis, 2 problem isolation, 15 Q quiesce IO, 13 S I IO suspension of, 10, 13 isolation procedures for A2/B2 link, 33 L link error example of severe data host error, 24 lock file how to clear, 50 luxadm(1M) used to display information, 12 M monitoring functions for Sun StorEdge 3900 and 6900 Series, 2 N not
PAGE 161
notification events, 103 T1/T2 data path troubleshooting, 102 test examples command line, 19 qlctest(1M), 19 switchtest(1M), 20 thresholds used in PFA, 2 troubleshooting broad steps, 3 check status of Sun StorEdge T3+ array, 4 check status of the Sun StorEdge FC Network Switch-8 and Switch-16, 5 check status of the virtualization engine, 5 determine extent of the problem, 4 discovering the error, 4 ethernet hubs, 123 general procedures, 3 host side and service processor side, 18 quiesce IO, 5 Sun StorEdge T
PAGE 162
Index 146 Sun StorEdge 3900 and 6900 Series Troubleshooting Guide • March 2002