Chapter 08 Crash Dumps HP-UX Handbook Revision 13.
Chapter 08 Crash Dumps October 29, 2013 TERMS OF USE AND LEGAL RESTRICTIONS FOR THE HP-UX RECOVERY HANDBOOK ATTENTION: PLEASE READ THESE TERMS CAREFULLY BEFORE USING THE HP-UX HANDBOOK. USING THESE MATERIALS INDICATES THAT YOU ACCEPT THESE TERMS. IF YOU DO NOT ACCEPT THESE TERMS, DO NOT USE THE HP-UX HANDBOOK. THE HP-UX HANDBOOK HAS BEEN COMPILED FROM THE NOTES OF HP ENGINEERS AND CONTAINS HP CONFIDENTIAL INFORMATION.
Chapter 08 Crash Dumps October 29, 2013 TABLE OF CONTENTS A little bit of theory ________________________________________________________________ 4 Crash events ____________________________________________________________________________4 What happens when a system crashes? _______________________________________________________ 6 How to configure dump devices _____________________________________________________ 10 Choosing dump devices __________________________________________________________________
Chapter 08 Crash Dumps October 29, 2013 Did you ever experience a system that was hung or crashed unexpectedly? This chapter explains how to configure a system for crash dump, how to install dump analysis tools, and how to use them in order to quickly isolate the cause of the problem. A little bit of theory When the system crashes, HP-UX tries to save the image of physical memory (core), or certain portions of it, to predefined locations called dump devices.
Chapter 08 Crash Dumps October 29, 2013 There are three different types of crash events: PANIC, TOC and HPMC: PANIC The crash event type panic refers to crashes initiated by the HP-UX operating system (software crash event). We differentiate between direct and indirect panics.
Chapter 08 Crash Dumps October 29, 2013 Getting an HPMC does not always mean that the hardware is at fault. The HPMC tombstone needs to be analyzed to determine if the hardware was really at fault. Software defects can result in HPMC crash events, but are typically very rare in production quality software.
Chapter 08 Crash Dumps October 29, 2013 executing at the time of HPMC/TOC event. Once the state has been saved, the operating system continues to dump physical memory to the dump device. Software crash events A software crash event occurs when panic() routine is called. This can either be a direct or indirect panics. For a software crash event, the PDC and PIM are not involved at all. As such, the first thing that panic() routine does is to save the processor state into the RPB structure.
Chapter 08 Crash Dumps October 29, 2013 reflection of the registers state in PIM since the information was copied from it. There are rare times when rpb values may not seem 'right'. If this is the case then it is better to use the register values in the PIM data as starting point for analysis.
Chapter 08 HP-UX Handbook – Rev 13.
Chapter 08 Crash Dumps October 29, 2013 How to configure dump devices In order to understand the following text you should be familiar with the basic concept of the Logical Volume Manager LVM. I make use of these abbreviations: VG = Volume Group LV = Logical Volume PV = Physical Volume Choosing dump devices Dump devices are volumes on the disk that are used to hold the entire memory image when the system crashes.
Chapter 08 Crash Dumps October 29, 2013 local filesystem (by the rc command savecrash). In the case that the dump device is also the primary swap, savecrash cannot run in the background because the swap area may be used during further startup. 2) Were there any problems with savecrash (lack of space in the crash directory) you still have the possibility to run it again after the system boot completed (-r Option for resave dump).
Chapter 08 Crash Dumps October 29, 2013 /dev/dsk/c0t6d0 (10/0.6.0) -- Boot Disk /dev/dsk/c0t5d0 (10/0.5.
Chapter 08 Crash Dumps October 29, 2013 NOTE: Whenever you have dump devices that are not also used for swap activity, make sure that they are configured last. This will cause them to be used first (dump goes from the end backward), which will minimize the chance of writing into an area shared by swap. Writing into swap space is undesirable because it will slow down your reboot processing (see section above).
Chapter 08 Crash Dumps October 29, 2013 COREDIR (Version 1) This format, used in HP-UX 10.10, 10.20, and 10.30, consists of a core.n directory containing an INDEX file, the kernel (vmunix) file, and numerous core.n.m files, which contain portions of the physical memory image. CRASHDIR (Version 2) This format, used in HP-UX 11.00, consists of a crash.n directory containing an INDEX file, the kernel and all dynamically loaded kernel module files, and numerous image.X.
Chapter 08 Crash Dumps -----------31:0x006000 ---------72544 ---------524288 ---------524288 -----------64:0x000002 October 29, 2013 ----------------/dev/vg00/lvol2 Compressed dumps Even with selective dump feature a Superdome equipped with 256GB RAM would take hours to write the dump to the dump devices. The bottleneck of copying system memory to disk is the I/O path.
Chapter 08 CLASS -------UNUSED USERPG BCACHE KCODE USTACK FSDATA KDDATA KSDATA Crash Dumps PAGES ---------3645411 7113 210990 2670 264 116 68736 259004 INCLUDED IN DUMP ---------------no, by default no, by default no, by default no, by default yes, by default yes, by default yes, by default yes, by default October 29, 2013 DESCRIPTION -----------------------------unused pages user process pages buffer cache pages kernel code pages user process stacks file system metadata kernel dynamic data kernel static
Chapter 08 Crash Dumps October 29, 2013 Saving the dump to the filesystem After the system has finished to write the whole or only parts of the dump to the dump devices, the system reboots and automatically starts up again. When booting up, the system starts a rc script to copy the dump into the file system. As of UX 11.00 the rc script itself is /sbin/init.d/savecrash. The configuration file is stored at /etc/rc.config.d/savecrash. The default location is /var/adm/crash with sub directories named crash.
Chapter 08 Crash Dumps October 29, 2013 There is also the possibility to save the dump directly to a DDS tape: # savecrash -v [-r] -t /dev/rmt/0m HP-UX Handbook – Rev 13.
Chapter 08 Crash Dumps October 29, 2013 Analysis of the dump A complete analysis of a crashdump requires deep internal knowledge and much experience. That would certainly go beyond this document. Here I'd like to explain how to use the utility crashinfo in order to narrow down the cause of the crash. If you like to examine the dump by yourself, please refer to the excellent online webcourse offered by the Expert Center. This course should be considered as starting point for any dump analysis.
Chapter 08 Crash Dumps October 29, 2013 that) they get decompressed automatically during the execution of crashinfo. This can take a while. Be sure to have enough space left in the crash directory. With the help of the webcourse mentioned above it should be possible to solve most of the problems. Anyway in some cases you might need information that is beyond the standard output of crashinfo.
Chapter 08 Crash Dumps October 29, 2013 Provide the following: swlist -l product >swlist.out /var/adm/syslog/OLDsyslog.log (currently installed software & patches) (the syslog from the previous boot) Additionally in case of a TOC, i.e.
Chapter 08 Crash Dumps October 29, 2013 for Serviceguard TOCs: Send_Monarch_TOC+0x58 safety_time_check+0x188 per_spu_hardclock+0x318 clock_int+0x60 mp_ext_interrupt+0x130 ihandler+0x904 the other CPUs are usually spinning on the safety timer lock and have this stack trace: preArbitration+0x2ec wait_for_lock+0x120 sl_retry+0x1c safety_time_check+0xfc per_spu_hardclock+0x4f8 clock_int+0x10c mp_ext_interrupt+0x180 ihandler+0x90c for "kalloc" panics: panic+0x10 kalloc+0x174 kmalloc+0x1a8 or panic+0x10 kal
Chapter 08 crash event was a TOC PCM_wait_for_TOC+0x0 printf+0x6c too_much_time+0x2e0 wait_for_lock+0x14c sl_retry+0x1c unselect+0x1c invoke_callouts_for_self+0xc0 sw_service+0xb0 mp_ext_interrupt+0x144 ivti_patch_to_nop3+0x0 idle+0x6a8 swidle_exit+0x0 Crash Dumps October 29, 2013 crash event was a TOC preArbitration+0x280 wait_for_lock+0x110 sl_retry+0x1c issig+0x64 _sleep_one+0x678 semop+0x304 syscall+0x200 $syscallrtn+0x0 Analysis beyond standard crashinfo output crashinfo’s options crashinfo has som
Chapter 08 Crash Dumps October 29, 2013 flags: bucket= arena= count= leak cor log parse -kmeminfo Refer to the crashinfo homepage in order to get more information on the usage. Working with the P4 debugger From within the dump directory execute p4: $ p4 Send bugs, remarks, ideas, and enhancements regarding ktools at http://ktools.hp.com/~ktools/wrts/bin/wrts_forms.pl?PROD=ktools/dump_access/p4 Web based p4 at http://ktools.france.hp.
Chapter 08 Crash Dumps October 29, 2013 suspicious trap addr, try to resync with ss_rp=0x277a48 sendfile_rele+0x318 ... ...
Chapter 08 $ Time 0x3d2d41de Crash Dumps : Thu Jul 11 10:29:18 2002 $ Crashconf -v CLASS PAGES -------- ---------UNUSED 24611 USERPG 95002 BCACHE 162582 KCODE 1908 USTACK 1440 FSDATA 1258 KDDATA 25286 KSDATA 15593 INCLUDED IN DUMP ---------------no, by default no, by default no, by default no, by default yes, by default yes, by default yes, by default yes, by default Total pages on system: Total pages included in dump: DEVICE -----------28:0x030000 October 29, 2013 OFFSET(Kb) ---------101216 Total
Chapter 08 Crash Dumps Shared Memory: m 0 0x411057d6 m 1 0x4e100002 m 2 0x41142787 m 3 0x5011e167 m 9220 0x0c6629c9 ... --rw-rw-rw--rw-rw-rw--rw-rw-rw--r--r--r---rw-r----- root root root root root October 29, 2013 root root root other root $ Processes Loaded 4116 proc_t entries in 'DefaultView' $ keep p_stat (UX 10.X and 11.
Chapter 08 Crash Dumps October 29, 2013 Print value at address 0x023ff070: $ p i4 0x023ff070 0x023ff070 0x023ff070 : 0x023e95f0 I.e.
Chapter 08 Crash Dumps October 29, 2013 6420 $ d nproc 6420 $ d vxfs_ninode 128000 NOTE: dec, hex and Let are aliases for the p4_let(1) command. HP-UX Handbook – Rev 13.
Chapter 08 Crash Dumps October 29, 2013 crashinfo output example crashinfo (3.
Chapter 08 Crash Dumps October 29, 2013 0x000000000b7e22a0 0x00126348 idle+0x1000 0x000000000b7e2050 0x00128adc swidle+0x20 Stack Traces for other processors ================================= Processor #1 ============== EVENT ============================ = Event #1 is TOC on CPU #1 = p crash_event_t 0x22030 = p rpb_t 0xcac370 = Using pc from pim.wide.
Chapter 08 Crash Dumps Create STCP device files Starting the STREAMS daemons-phase 2 B2352B/9245XB HP-UX (B.11.00) #1: Wed Nov October 29, 2013 5 22:38:19 PST 1997 Memory Information: physical page size = 4096 bytes, logical page size = 4096 bytes Physical: 3145728 Kbytes, lockable: 2374088 Kbytes, available: 2731304 Kbytes ================== = Memory Globals = ================== Physical Memory Free Memory Average Free Memory desfree minfree = = = = = 786432 pages (3.00 GB) 676440 pages (2.
Chapter 08 Crash Dumps October 29, 2013 Name Description Address Link IP Address -------------------------------------------------------------------------------lan0 0 btlan3 100BT PCI Built-in 0x00306e26c1ac UP n/c n/c n/c : means "Not Configured", ifconfig has not been done on this interface If you want more information, you can use : "lanshow -f" ==================== = IOVA Usage Check = ==================== 99% of IOVA still available/free.
Chapter 08 Crash Dumps October 29, 2013 ================= = Syswait Array = ================= cpu iowait --- -----1 1 Note: This shows the number of threads waiting on buffer I/O. First figure out how long the I/O is outstanding. A good way to do so is by searching in the threads list for processes that have a waitchannel like biowait, ogetblk or swbuf. As a rule of thumb, only consider I/O's outstanding longer than 30 seconds (your mileage may vary). For more information go to: "http://teams3.
Chapter 08 Crash Dumps -----------vx_inactive_thread() lvmkd_daemon() wait1() biowait() October 29, 2013 ----50 6 3 2 ---------6271 225 239 0 ---------171 225 0 0 TICKS SINCE MIGR --------171 203 NREADY FR LO AL COMMAND -- -- -- ------0 0 0 0 0 0 Idle Globals ============ candidate_idle_spu = 0 migration_cycles = 0 Running Threads (TSRUNPROC) and idle Processors =============================================== TICKS SINCE TID PID PPID RUN ------- ----- ----- ---------- TICKS SINCE IDLE PRI SPU STAT
Chapter 08 Crash Dumps 487 430 100 226 486 429 1 237 157 100 1 239 52 33 0 274 73 33 0 276 vx_inactive_thread_sv+0x8 50 33 0 277 0 0 0 284 71 33 0 316 vx_inactive_thread_sv+0x8 69 33 0 319 vx_inactive_thread_sv+0x8 ... ... ...
Chapter 08 Crash Dumps October 29, 2013 PHKL_20989 - 11.00 Cumulative dump device, dump size patch PHKL_20173 - 11.00 Include zero page in dumps PHKL_20915 - 11.00 trap-related panics/hangs PHCO_26188 - 11.00 savecrash(1M) cumulative patch PHCO_20196 - 11.00 savecrash startup files cumulative patch PHCO_19726 - 11.00 crashconf(1M) cumulative patch UX 11.11: PHKL_27918 - 11.11 EPIC debug info PHKL_32715 - 11.11 crash,vpars,timeout;SG TOC,nParCnfg,shutdown PHKL_28237 - 11.
Chapter 08 Crash Dumps October 29, 2013 Additional information Dump reading web course: http://teams3.sharepoint.hp.com/teams/esssupport/InsideESSSupport/InsideWTEC/HPUXKERNEL/crash/FirstPassWeb/index.htm (HP internal) Dump reading web course for Itanium systems: http://teams3.sharepoint.hp.com/teams/esssupport/InsideESSSupport/InsideWTEC/HPUXKERNEL/crash/FirstPassWeb/IA/contents.htm (HP internal) Dump reading web course for PA-RISC systems: http://teams3.sharepoint.hp.