Ba sic Syste m Pro b le m A n a ly sis B ill C a d ie r H e w le tt -Pa ck a rd C o.
Table of Contents Commonly Used Macros ................................................................................................ 7 Process Management Structures...................................................................................... 9 Process Management Structures: continued................................................................... 10 Job/Session Management Structures .............................................................................
Case Study: Hang Conclusion.......................................................................................
In tro d u ctio n • • • • • • co m m o n ly use d D A T m a c ro s so m e O S structures a n d the ir ty p e s PA -R IS C R e g iste rs sh o rt v s.
Introduction This paper is being presented at the West Coast HP3000 Solution Symposium in San Jose, 25 April 2003 The purpose of this paper is to try to provide basic information how to diagnose system aborts and hangs. As the HP3000 winds down it will be advantageous for owners of this system to be able to perform as much trouble shooting as possible. The amount of trouble shooting will be limited because source code for the OS is not available outside HP.
C o m m o n ly U sed M a cro s • • • • • • • • • • • • sys_a b o rt p m _ p tree p m _ fa m ily p m _ errors p m _ fp ib ui_ sho w job u i_ cih isto ry ui_ sho w va r fs_o p en _ file s fs_file fs_fin d _ gu fd _ e ntry d cx 3 /1 7 /2 0 0 3 • • • • • • • • • • • io _ios_ d ia g _lo g rm _fo rm a t_ sirs rm _se m a p h ore rm _se m _ d ea d lock p ro ce ss_ d isp a tch e r p ro ce ss_ w a it v s_ p a g e_ in fo m m _p a g e _info m m _a c tive _io m m _c om p le te d _ io tb l_ in fo B asic Sy ste m Pro
Commonly Used Macros This is by no means a comprehensive list of macros available in the OS macro set but these are some of the more commonly used macros. The MACLIST (MACL) command can be used to list all current macros once they have been restored. Many of the macros listed will be second level macros, those called by other macros and so would be of limited value. Use the HELP command to see the source for a given macro, i.e. HELP PM_FPIB.
Pro cess M a na g em e nt Structure s • PIB : p ro c es s in fo rm a tio n b lo c k, typ e “p ib _ typ e” • PIB X : p ro ce ss in fo rm a tio n b lo ck e x te n sio n , ty p e “p ib x_ ty p e ” • PC B : p ro ce ss co n trol b lo ck (C M ), typ e “pc b _ typ e” • PC B X : pro c ess con tro l b lo ck ex te n sio n (C M ), ty p e “pcb x_typ e” 3 /1 7 /2 0 0 3 B asic Sy ste m Pro ble m An aly sis Notes: 8 pa ge 4
Process Management Structures These are the fundamental process management structures and their types. DEBUG, DAT and SAT provide functions that return pointers to these structures.
Process Management Structures: continued Two other fields in the PIB worth noting are the PIB_TRAP_PC and PIB_TRAP_ISM. These two fields are used for certain types of process traps. The PC (program counter) of the trap and the interrupt stack marker (ISM) active at the time of the trap are loaded into these fields.
This page intentionally left blank 11
Jo b / Se ssio n M a na g em e nt Structure s • JM A T : jo b m a ste r ta b le , typ e “jm a t_ e ntry_ ty p e ” • JIT : jo b in fo rm a tio n ta b le , typ e “ jit_e n try_ ty p e ” • JD T: jo b d ire cto ry ta b le , ty p e “jd t_ he a d e r_ typ e” 3 /1 7 /2 0 0 3 B asic Sy ste m Pro ble m An aly sis Notes: 12 pa ge 5
Job/Session Management Structures The JMAT or job master table is what is displayed with the SHOWJOB CI command and there is an equivalent OS macro UI_SHOWJOB. Like its CI counterpart the macro displays all jobs and sessions or will display a specific job or session when a string with the “#Jnnn” or “#Snnn” value is supplied. The JIT and JDT are compatibility mode data segments (DST) but all CM DST’s are objects and have NM virtual addresses.
Job/Session Management Structures: continued The same thing can be accomplished using the native mode types: $1e1 ($21d) nmdat > fv pcbx(pin) 'pcbx_type.
This page intentionally left blank 15
File S ystem Stru ctu res • PLF D : p ro ce ss lo c a l file d esc rip to r (file h a n d le), typ e “p lfd _ ty p e ” • G D P D : g lo b a l d a ta p o in te r d e scrip to r (file p o in te rs), typ e “ g d p d _t” • G U FD : g lo b a l un iq u e file d e scrip to r, ty p e “g ufd _ t” • FLA B : file la b e l, ty p e “fla b _ t” 3 /1 7 /2 0 0 3 B asic Sy ste m Pro ble m An aly sis Notes: 16 pa ge 6
File System Structures These structures are but the tip of the iceberg when it comes to the file system! The PLFD is a file handle, whenever a process has a file or socket or pipe opened that entity will occupy a slot in the PLFD table. The PLFD structure will contain pointers to the GDPD for the file and to the GUFD for the file, if there is one. The PLFD is also where we keep the “type manager control block” which is an area used by the type manager bound to the file at open-time.
File System Structures: continued The GUFD structure is also retained in most cases when a process closes a file. In other words, if a process is the last accessor of a disk file and closes it we do not release the GUFD rather it is appended to a least recently used (LRU) list. If the file is re-opened chances are the GUFD will be on that list and we can simply pull it off the LRU and use it making the file open process quicker. The GUFD structure contains the virtual address of the file.
File System Structures: continued Finding The GUFD of an opened or closed file GUFD’s for opened files are kept on a HASH_LINK from Storage Management Globals (KSO #210, type “SM_GLOBAL_REC”) and GUFD’s for closed files are kept on a least recently used (LRU) list also from SM Globals.
File System Structures: continued :listfile XL.PUB.SYS,-3 ******************** FILE: XL.PUB.SYS FILE CODE : 1032 BLK FACTOR: 1 REC SIZE: 256(BYTES) BLK SIZE: 256(BYTES) EXT SIZE: 0(SECT) NUM REC: 77293 NUM SEC: 77824 NUM EXT: 41 MAX REC: 4096000 NUM LABELS: MAX LABELS: DISC DEV #: SEC OFFSET: VOLNAME : FOPTIONS: BINARY,FIXED,NOCCTL,STD CREATOR : MANAGER.
This page intentionally left blank 21
V irtua l S p a ce M a n a gem en t Structure s • V S O D : virtua l sp a ce o b je ct d escrip to r, typ e “vs_ o d _typ e” • C a ch e e n try : ty p e “c ac he _e ntry_ typ e” • B -tre e’s: – – 3 /1 7 /2 0 0 3 e x te n t b -tre e , ty p e “ b _ t r e e _ r o o t _ t y p e ” e x te n t A / R (va ria b le a c c e ss rig h ts) b-tre e , sam e ty p e B asic Sy ste m Pro ble m An aly sis Notes: 22 pa ge 7
Virtual Space Management Structures There are two VSOD tables, one for files and one for everything else. Everything that has a virtual address has an entry in one of the two VSOD tables. These tables are: VSOD/GUFD table, KSO #201 VSOD table, KSO #53 Both tables consist of entries whose type is “VS_OD_TYPE”. The difference is that the VSOD/GUFD table, KSO #201 contains both VSOD entries and GUFD entries adjacent one another.
Virtual Space Management Structures: continued VSM uses VPN (virtual page number) Cache entries for portions of objects that are either in memory or on the way in to memory. These cache entries provide a fast means of linking the VSM structures such as the VSOD with the physical page addresses that the object occupies in real memory. The VAINFO function is useful for finding out information about objects however certain information is unavailable in DEBUG. See HELP VAINFO for more details.
This page intentionally left blank 25
M e m o ry M a n a g e m e n t S tru ctu res • H PD IR : h a she d p a g e d ire ctory , ty p e “hp d ir_ re c” • IP D IR : ind ex e d p a g e d ire cto ry, kn o w n sy ste m o b je ct (K S O ) 3 , ty p e “ip d ir_re c” • M IB : m e m ory m an ag e m e nt inform a tion b lock, typ e “m ib _ ty p e ” • M e m o ry M a n a g e m e n t G lo b a ls, k n o w n syste m o b je c t (K S O ) 4 , ty p e “m m _ glo b a l_info_re c” 3 /1 7 /2 0 0 3 B asic Sy ste m Pro ble m An aly sis Notes: 26 pa ge 8
Memory Management Structures Memory management dove tails with Virtual Space management and keeps track of the real memory pages in use (among a lot of other things). The hashed page directory or HPDIR is the structure used at the lowest levels of the OS to load the TLB (translation look-aside buffer) with a virtual-to-real translation.
Memory Management Structures: continued Some useful memory management macros are: MM_ACTIVE_IO – lists all active I/O at the time of the dump, this macro probably won’t work very well on a live system although you can try! MM_COMPLETE_IO – lists all completed I/O. Often you may want to set a filter on a specific virtual address, for example ENV FILTER “a.c0000000” because the list can be quite lengthy. MM_PAGE_INFO is like VS_PAGE_INFO and lists memory manager specific information about the virtual address.
This page intentionally left blank 29
D isp a tch e r Structu res • D isp a tch e r G lob a ls, K S O 1 2 7 , typ e “d isp _g lo b a ls_ ty p e ” • TC B , ta sk c o n tro l b lo ck , ty p e “tcb _ typ e” 3 /1 7 /2 0 0 3 B asic Sy ste m Pro ble m An aly sis Notes: 30 pa ge 9
Dispatcher Structures These are not the only dispatcher structures but they are worth mentioning in this context. The TCB, or task control block rates a special mention because it is the structure use to save process state when a process looses the CPU. It is kept in real memory but is “equivalently mapped” meaning that it is given an address in space 0 at the same offset it occupies in real memory. The TCB function will return the real memory address of a TCB for a given PIN.
Ta b le M a n a g e m en t • use d ex ten sive ly in th e O S , ta b le h e a d er typ e “tb l_ h d r” • ch a ra cte ristic o f a “ta b le m a n a g e m e n t” ta b le is th a t th e first tw o w o rd s o f th e ta b le p o int to itself • va rio us ty p e s o f ta b le s a re use d , F IFO , LIFO , m o n o to n ic etc.
Table Management The use of table management in the MPE/iX OS is so pervasive that it warrants this mention. Table management is essentially a centralized method for managing an object. A object is created and then transformed into a table. The table consists of a “header” and a “body”. The header is formatted using the type “TBL_HDR”. Each entry in the body portion will be whatever type the owner decides to use.
Syste m G lo b a ls S y ste m g lo b a ls is A LW A Y S fo u n d a t a d d re ss $ a .c0 0 0 0 0 0 0 Th e ty p e is “S Y S TE M _ G LO B A LS _TY P E ” Th e m a cro S Y S G LO B w ill re tu rn a field w ith in the o b jec t.
System Globals System globals is centralized table of information used by all parts of the OS. As was mentioned earlier, the KSO table is located at the top of System Globals. At the end of the system globals structure is an array of 32 entries, one for each active processor (type “SPSD_ENTRY_TYPE”) that tracks information about the process active on each CPU. The macro “SYSGLOB” is an easy way to get information out of the system globals structure but you need to know the name of the field you want to see.
PA -R ISC G e n era l Re g isters P A -R IS C uses 3 2 g e ne ra l re g iste rs. Th e p ro ce d ure c a lling co n ven tio n d e fin e s: • • • • • • • R 3 0 is “S P ” o r th e sta c k p o in te r R 2 7 is “D P ” o r th e d a ta p o in te r (g lo b a l va ria b le s) R 2 is “R P” o r th e p ro c ed ure return p o in ter R 2 8 a n d R2 9 a re fun c tio n return (ret0 a nd re t1 ) R 2 6 , R 2 5 , R 2 4 a nd R 2 3 c a n c on ta in th e first fo ur a rg um en ts p a ssed to p roce d u re s (a rg0 ..
PA-RISC General Registers The PARISC Instruction Set Reference Manual and the Procedure Calling Convention manual are pretty hard to come by. They are not at the docs.hp.com web site so it is worth spending a little time going over some of the basics of the hardware. DEBUG, DAT and SAT use aliases for certain of the registers, SP, the stack pointer will always be R30. DP, the data pointer (global variables in a program context) will always be R27. RP or the procedure return pointer is R2.
PA -R ISC Sp a ce R eg isters • • • • There a re 8 sp a ce re g isters, S R0 to SR 7 S R0 sa ve s spa ce ID fo r extern a l b ra nch es S R1 to S R3 lo a d ed b y so ftw a re a s n eede d S R4 , S R 5 , S R6 a n d S R 7 a re d efin ed b y th e ca llin g co n ven tio n – – – – 3 /1 7 /2 0 0 3 S R4 is co d e , typ ic a lly the sp ac e ID of you r p rog ra m S R5 is d a ta , the sp a ce ID o f a p ro cess S TA C K S R6 is a lw a y s $ b (# 1 1 ), O S structures an d sh ort m app e d file s S R7 is a lw a y
PA-RISC Space Registers One point that the illustration did not mention is that SR5, 6 and 7 can only be written by code running at the highest privilege level which is 0 (user mode being 3).
Sh o rt v s. Lo n g Po in te rs Loa d a n d S to re in struc tio n s th a t spe cify a sp a ce re g iste r o f z e ro in te n d th a t th e h a rd w a re w ill d e rive the sp a ce re g iste r b y u sin g th e first 2 b its o f the o ffset p ortio n o f th e a d d re ss a n d a d d 4 to th a t g ivin g th e S R n um b e r to u se. LDW -296(0,30),22 If R 3 0 c o n ta in s 4 1 8 4 3 2 f0 th e ‘4 ’ is 0 1 0 0 in b in a ry . T he first 2 b its, a re 0 1 + 4 = 5 .
Short vs. Long Pointers Here’s a table of where various addresses would be resolved using short pointer references: Address Range -------------------00000000 to 3fffffff 40000000 to 7fffffff 80000000 to bfffffff c0000000 to ffffffff Space ID used ------------SR4 SR5 SR6 SR7 These address ranges are also called “QUADS” as they represent ¼ of a 4GB space so each QUAD is 1GB of address space. The OS uses SR6 and SR7 for resident and non-resident OS structures as well as NL.PUB.SYS.
Sh o rt v s. Lo n g Po in te rs In sh ort p o in te r a d d re s sin g th e h igh o rd e r 2 b its o f a n o ffset a re u se d to d e n o te th e sp a c e re g iste r th e re fo re th e y a re N O T use d a s p a rt o f the a d d re ss. T his m e a n s th a t using sh ort p o in te rs lim its a d d re ssa b ility to 2 (3 0 ) -1 o r 1 G B .
Short vs.
Proced u re C a lling C o n v en tio n • S ta ck Fra m es a re b uilt b y n o n -le a f p ro c ed ure s so th a t w h e n th ey ca ll o th e r p ro c ed ure s re g iste rs ca n b e spille d in to th e fra m e a n d re sto re d fro m th e re o n re turn . • G R 3 th ro ug h G R 1 8 a re “ca lle e sa ve ” re giste rs, the y a re sp illed , if n ec es sa ry b y th e p ro ce d ure tha t is b e in g ca lle d .
Procedure Calling Convention There is considerably more to the procedure calling convention than is represented on the previous page but those are some of the more important points. A stack frame only needs to be built if the current procedure will call other procedures. A leaf procedure would be one that makes no calls so there is no need for it to allocate space to spill registers.
Pro ced u re C a llin g C o n v en tio n : R e g isters l e n = F R E A D ( MP E_ f d , & b u f , - 3 27 67 ) ; R 0 = 0 0 0 0 0 0 0 0 4 0 1 0 0 4 8 0 0 1 3 d c 0 9 7 4 1 8 4 5 6 3 0 R 4 = d 6 6 e 8 0 1 8 d 6 6 e a 0 1 8 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 R 8 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 R 1 2 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 R 1 6 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 R 2 0 = 0 0 0 0 0 0 0 a 0 1 3 d c 0 8 c c 01 0
Procedure Calling Convention: Registers The previous page illustrates how parameters for a call to FREAD would be passed. The convention says that R26 to R23 are used for the first 4 arguments. R26 is referred to as “arg0” with R25 “arg1”, R24 “arg2” and R23 “arg3”. The second parameter to FREAD is defined as a long pointer, which takes 64 bits (2 words). The procedure calling convention specifies that 64 bit quantities be passed with the high order word in an ODD argument register.
Pro ced u re C a lling C o n v en tio n: S ta ck Fra m e 013dc0c8 FREAD 6bc23fd9 STW r2 , -2 0(sr0,r30) 013dc0cc FR E A D + $ 4 6fc30200 S TW M r3,256(sr0,r30) 013dc0d0 FREAD+$8 6bc43e09 STW r4, -2 52(sr0,r30 ) 013dc0d4 FREAD+$c 6bc53e11 STW r5, -2 48(sr0,r30 ) 013dc0d8 FREAD+$10 6bd73da1 STW r 2 3 , - 3 0 4 ( s r 0 , r 3 0) 013dc0dc FREAD+$14 6bd83da9 STW r 2 4 , - 3 0 0 ( s r 0 , r 3 0) 013dc0e0 FREAD+$18 08000240 OR r0,r0,r0 FREAD+$98 d35a1ff0 EXTRS r26,31,16,r26 .
Procedure Calling Convention: Stack Frame The illustration shows the first things that FREAD does when it is called. These steps are roughly the same for all OS procedures; 1. the current value of R2 (RP) is saved at SP-#20, that will be picked up at the end of the procedure to return to the caller. The caller will have had to be sure that R2 does contain a pointer back to it! 2. if necessary a stack frame is built.
Pro ced u re C a lling C o n v en tio n: S P & P SP O n c e th e STW M (o r LD O ) in stru ctio n is e x e cu ted to b u ild a n e w sta c k fra m e , a ll re fe re n c e s to a p ro c e d u re ’s p a ram e te rs b e co m e “ PS P” (p re vio u s sta ck p oin te r) re la tive . G R 2 6 to G R 2 3 m a y b e spille d to PS P-$ 2 4 to P S P-$ 3 0 re sp e ctive ly . Y ou c an n ot c oun t o n th at o cc urrin g ! Th ere m ay b e no n ee d to sa ve a re g iste r to m em o ry.
Procedure Calling Convention: SP & PSP SP is a real register, R30 by convention. PSP is not. It is the value of SP with the size of the current frame subtracted. Let’s say you run a program and set a break point at FREAD. At the point before the stack frame is built you could count on the argument registers 26..23 being correct and that SP-negative addresses would give you any additional parameters that might be there. Once the stack frame is built those SP-negative addresses become PSP-negative addresses.
Procedure Calling Convention: SP & PSP This, admittedly very simple example shows how to look for r26 appearing either as the source register or destination register to see whether it has been moved. In this example the only reference to R26 from the beginning of FCLOSE to the current offset is that one instruction. All that is doing is extracting the right 16 bits of the register because file number is defined as a 16 bit value.
This page intentionally left blank 53
C a se S tu d y : S A 6 6 3 P C =a . 0 0 1 9 f e 7 8 sy st em _ a b o r t NM 0 ) S P = 4 1 8 5 62 e0 R P = a . 0 0 a 5 1 bc 8 s m_ q u a r a n t i n e _g uf d+ $ 1 f c NM 1 ) S P = 4 1 8 5 62 e0 R P = a . 0 0 e e 5 a5 c t m_ c l o s e _ c o m m on .t m_ u n l i n k _ pl f d_ an d_ g d p d + $ 1 8 4 NM 2 ) S P = 4 1 8 5 58 e0 R P = a . 0 0 e e 7 5c c t m_ c l o s e _ c o m m on +$ 1a 9 8 NM 3 ) S P = 4 1 8 5 58 60 R P = a .
Case Study: SA663 A system abort 663 occurs when a problem is encountered in a file system structure but the Subsystem Dump facility has not been enabled by running SDUTIL. Had it been enabled the file system and storage management would have been able to quarantine the file preventing it from being accessed until it could be checked and, if necessary restored with a good copy. Since the failure is the result of a problem with a file the first thing to do would be to find out what file that is.
C a se S tu d y : S A 6 6 3 $ 1 8a ( $ 7 0 ) n m d at > l e v 5 $ 1 8b ( $ 7 0 ) n m d at > d v p s p - 6 0 , 10 V I RT $ 8 6 6 . 4 1 8 5 45 00 $ d 6 e f 0 a 9 4 4 1 85 44 1 8 0 5 6 5 0 0 0 3 4f 4b b e c 1 V I RT $ 8 6 6 . 4 1 8 5 45 10 $ c a 1 2 a 9 7 0 0 3 00 00 0 a 8 4 0 0 0 0 0 0 01 3d 2 1 b c V I RT $ 8 6 6 . 4 1 8 5 45 20 $ 0 6 0 2 0 0 0 0 0 0 00 00 0 0 0 0 0 0 0 0 0 3 00 00 0 0 0 0 V I RT $ 8 6 6 .
Case Study: SA663 continued Yup, it did save the file number in the stack. Well, to be honest we would have to assume that the $d is the file number just by looking at the value in PSP-$24. If we wanted to be absolutely certain it is (and absolute certainty is handy a lot of the time) then we would need to examine the code that fclose_nm executed to see if it did spill the file number parameter to the stack.
C a se S tu d y : S A 6 6 3 $ 1 93 ( $ 7 0 ) n m d at > f s _ f i l e ( , d ) F i le n a m e : T E S T FI LE .P U B . A P N a ti v e M o d e f i le A c ce s s o p t i o n s : A P P E ND , NO MR , L O C K , S H R , BU F, NO M U L T I , W AI T ,N OC OP Y A c ce s s m e t h o d : $0 L a st e r r o r n u m be r: $0 . . .
Case Study: SA663 continued Now that we have the file number we can use the FS_FILE macro to display information about this file. The most important thing is the file name because if this file is damaged and could not be quarantined there is a good chance someone else may try to access the file which could cause another system abort. Notice the record limit on the file. That’s pretty large.
C a se S tu d y : S A 6 6 3 $ 1 9 d ( $ 7 0 ) n m d at > l e v 2 $ 1 9 e ( $ 7 0 ) n m d at > d c p c SY S $a.ee5a5c 00ee5a5c t m _ c lo se _ c o m m o n .
Case Study: SA663 continued Recognizing that we do not have the source code and cannot just go look at what may have caused the problem we can try to find out some relevant information. The 2nd level procedure tm_unlink_plfd_and_gdpd of tm_close_common made the call to sm_quarantine_gufd. The question is, why did it want to quarantine the GUFD? We can dump out the code for that 2nd level routine from its beginning up to the location of the PC counter in that routine as shown in the illustration above.
C a se S tu d y : S A 6 6 3 . . . t m _c l o s e _ c o m m o n. tm _u n l i n * + $ 1 1 8 4b d a 3 e b 1 LDW - 1 6 8 ( s r0 , r3 0) ,r 2 6 t m _c l o s e _ c o m m o n. tm _u n l i n * + $ 1 1 c 28 7 f e f f f A D DI L $ f f f f f 00 0 ,r 3, 1 1 0 0 ( r 1 ), r 25 t m _c l o s e _ c o m m o n. tm _u n l i n * + $ 1 2 0 34 3 9 0 0 c 8 LDO t m _c l o s e _ c o m m o n. tm _u n l i n * + $ 1 2 4 4b d 8 3 e a 9 LDW - 1 7 2 ( s r0 , r3 0) ,r 2 4 t m _c l o s e _ c o m m o n.
Case Study: SA663 continued Illustrated above is a small portion of the code that would be displayed by the DCX macro call. From this we can see that the 2nd level procedure made a call to the procedure tm_unlink_gdpd. On return from that procedure a value a SP-#172 was loaded into R22. Then R22 was used to load a value into R1.
C a se S tu d y : S A 6 6 3 $ 1 a1 ( $ 7 0 ) n m d at > d v s p - # 1 7 2 V I RT $ 8 6 6 . 4 1 8 5 58 34 $ 4 1 8 5 4 5 a 8 $ 1 a2 ( $ 7 0 ) n m d at > d v [ s p - # 1 7 2] V I RT $ 8 6 6 . 4 1 8 5 45 a8 $ f c 0 e 0 0 8 f $ 1 a3 ( $ 7 0 ) n m d at > w l e r r m s g ( S1 6 (f c0 e ) , 8 f ) T y pe m a n a g e r ; un ab le t o u n l i n k t he G D P D .
Case Study: SA663 continued The illustration shows how we would mimic the actions of the instructions to find out what this value was. The command $1a2 ($70) nmdat > dv [sp-#172] Employs indirection ([ and ]) so that rather than loading the value at SP-#172 we are instead saying take the value at SP-#172 and show me what it points to. The value that it points to looks like it could be an HPE_STATUS.
C a se S tu d y : S A 6 6 3 $ 1 a6 ( $ 7 0 ) n m d at > f v f s _ g u f d (f s _p lf d ( , d ) ) ' g u fd _t ' R E CO R D . . . F IL E _ V I R _ A D D R : 2e4.0 GDPD_PTR : 0 . . . Q UA R A N T I N E _ R E AS ON : A L L : f c0 e0 0 8 f Q U A R A N T I N E _ TI ME : 3 b 1 6 7 8 7 7d 44 53 EOF_OFFSET : 94dd300 . . .
Case Study: SA663 continued Finally we can format the GUFD for file $d and see that it agrees with what was found, the bad status was $fc0e008f. There are some other interesting things to be seen; the GDPD_PTR is zero. This should be a pointer to the last GDPD in a linked list. Since the call to tm_unlink_gdpd failed we could assume (and it would be correct!) that the failure was due to the fact that this value is null. Something else seems to have either cleared the value or unlinked the GDPD erroneously.
H angs • h a n g s a re usua lly d ifficult to d ia g n o se . • d e term ine the sco p e o f the ha ng , w h a t is a ffec te d • ga th er a s m uch inform a tion a s p o ssib le B E FO RE d e c id in g to g e t a m e m o ry d u m p .
Hangs Hangs do tend to be more difficult to diagnose than aborts. Often what is called a “hang” is really a performance slow-down. It can often be limited to a particular application or area of the OS. If the sole function of a system is to run account’s payable and the accounts payable yet anyone trying to do so hangs then it is technically correct that the “system” is hung. But telling that to a support engineer might mislead them badly! If is important to determine the scope of the problem.
B e fo re a M e m o ry D u m p • rep e a tin g th e S H O W P R O C c o m m a n d c a n tell if p roc esse s are usin g C P U tim e or n ot • S H O W JO B w ill tell w ha t is p re se ntly run nin g • S H O W Q / S H O W W G w ill sh o w th e p re se nt q ue ue & w o rk g rou p se ttin g s • a re d isks a ctive o r id le • use d eb ug to tra ce su sp e ct p ro ce sse s – 3 /1 9 /2 0 0 3 m a c ro s suc h a s p m _ se m a p h o re , rm _ se m a p h o re c a n h e lp B asic Sy ste m Pro ble m An aly
Before Memory Dump Gather as much information as possible! If you are able to log on or if a session logged on as MANAGER.SYS is already logged on you should try to gather as much information as possible. SHOWPROC is extremely useful in cases where the system is not completely hung up but people are complaining of problems. For example, SHOWPROC PIN=1;TREE;SYSTEM Will display all processes on the system. Use this to locate processes that may be blocked.
C a se S tu d y : H a n g M em o ry D u m p $ 1 50 ( $ 0 ) n m d a t > pr o c e s s _ w a i t = == == D I S P A T C H ER IN FO R M A T I O N F O R A PR O C E S S = == = = c PI N # State W a i t Ev e nt Pri C la ss - -- - - - ----- - - - - - -- - -- --- - -- -- B l o c k e d R e as on S $1 L O N G _ WA IT IPC $7918 AS K n o w n P or t f ff ff f e d S $2 L O N G _ WA IT IPC $38ff BS J U N K _ W A IT L O N G _ WA IT C o n t r ol Bl oc k $33ff CS C N T L _ B L OC K _W AI T T E R M I
Case Study: Hang Memory Dump If you have a memory dump of a hang it is not so important to begin looking at stack traces as it is finding “interesting processes”. These would be processes blocked in ways that would not be normal. Now, without having had the complete MPE/iX internals training and a few years of reading memory dumps, knowing what is “normal” is not quite that simple. For example, “JUNK_WAIT” doesn’t look all that normal but it is.
C a se S tu d y : H a n g M em o ry D u m p $151 ($0) nmdat > pin 86 $ 1 5 2 ( $ 8 6 ) n m d at > p m _ s e m a p h o r es A D D R E S S O F S E M AP HO R E W A I T E D O N : $b . 8 8 a e 1 9 b 0 $ 1 5 4 ( $ 8 6 ) n m d at > r m _ s e m a p h o r e b. 8 8 a e 1 9 b 0 L i s t o f p i n s w ai ti n g o n s e m a p h or e a t $ b .
Case Study: Hang Memory Dump continued What we do once we find an interesting process is to switch to that pin and have a look at the trace. That isn’t shown in the illustration on the prior page: $14a ($0) nmdat > pin 35 $150 ($35) nmdat > tr,d,i PC=a.0017099c enable_int+$2c NM* 0) SP=41853ef0 RP=a.00786004 notify_dispatcher.block_current_process+$338 NM 1) SP=41853ef0 RP=a.00787e44 notify_dispatcher+$268 NM 2) SP=41853e70 RP=a.001b6034 sem_block.wait_for_resource+$1bc NM 3) SP=41853d70 RP=a.
Case Study: Hang Memory Dump continued $152 ($86) nmdat > pm_semaphores ADDRESS OF SEMAPHORE WAITED ON: $b.88ae19b0 $154 ($86) nmdat > rm_semaphore b.88ae19b0 List of pins waiting on semaphore at $b.88ae19b0 $60 $6e $76 $7e $86 $8e $92 $96 $9a $9e $a2 $a6 $aa $ae Pin $35 has an exclusive lock on shareable semaphore at $b.
Case Study: Hang Memory Dump continued Now that you have the address of the semaphore you could, for example, use the function VAINFO to get the BASE_VA (or address) of that semaphore. It is very likely to be a part of some larger structure. VAINFO could also be used to tell you the OBJ_CLASS (object class) of the structure it is in. These would be useful data points. $169 ($35) nmdat > wl vainfo(b.88ae18d0, 'BASE_VA') $b.88ae0000 $16a ($35) nmdat > wl vainfo(b.
C a se S tu d y : H a n g M em o ry D u m p $ 1 56 ( $ 8 6 ) n m d at > p i n 3 5 $ 1 57 ( $ 3 5 ) n m d at > t r , d , i P C = a . 0 0 17 09 9c e n a b l e _ i nt + $2 c N M * 0 ) S P = 4 1 8 5 3e f0 R P = a . 0 0 7 8 6 00 4 n ot i f y _ d i s p a t ch er .b l o c k _ c u rr e nt _p ro c e s s + $ 3 3 8 NM 1 ) S P = 4 1 8 5 3e f0 R P = a . 0 0 7 8 7 e4 4 n ot i f y _ d i s p a t ch er +$ 2 6 8 NM 2 ) S P = 4 1 8 5 3e 70 R P = a . 0 0 1 b 6 03 4 s em _ b l o c k .
Case Study: Hang Memory Dump continued At this point we know that Pin 86 was blocked on a semaphore owned by Pin 35. We want to go look at what Pin 35 is doing and we find that this process has called DBLOCK and is also blocked. Note that it also has “SEM_BLOCK” in its stack trace. You only see this when a process blocks on a semaphore. We need to see what semaphore this process is waiting on.
C a se S tu d y : H a n g M em o ry D u m p $ 1 5 8 ( $ 3 5 ) n m d at > p m _ s e m a p h o r e s A D D R E S S O F S E M AP HO R E W A I T E D O N : $b . 8 8 a e 1 8 d 0 $ 1 5 9 ( $ 3 5 ) n m d at > r m _ s e m a p h o r e b. 8 8 a e 1 8 d 0 L i s t o f p i n s w ai ti n g o n s e m a p h or e a t $ b .
Case Study: Hang Memory Dump This is exactly what was done with Pin 86. Here we see that pin 60 owns the semaphore that pin 35 is blocked on.
C a se S tu d y : H a n g M em o ry D u m p $ 1 5a ( $ 3 5 ) n m d at > p i n 6 0 $ 1 5b ( $ 6 0 ) n m d at > t r , d , i P C = a . 0 0 17 09 9c e n a b l e _ i nt + $2 c N M * 0 ) S P = 4 1 8 5 3e f0 R P = a . 0 0 7 8 6 00 4 n ot i f y _ d i s p a t ch er .b l o c k _ c u rr e nt _p ro c e s s + $ 3 3 8 NM 1 ) S P = 4 1 8 5 3e f0 R P = a . 0 0 7 8 7 e4 4 n ot i f y _ d i s p a t ch er +$ 2 6 8 NM 2 ) S P = 4 1 8 5 3e 70 R P = a . 0 0 1 b 6 03 4 s em _ b l o c k .
Case Study: Hang Memory Dump continued We seem to have a pattern developing here… Pin 60 owns a semaphore but it also has both a DBLOCK call and a SEM_BLOCK call in its stack. So it is also waiting on a semaphore while holding one just like pin 35. We will use the same two macros to look at what pin 60 is waiting for.
C a se S tu d y : H a n g M em o ry D u m p $ 1 5 c ( $ 6 0 ) n m d at > p m _ s e m a p h o r es A D D R E S S O F S E M AP HO R E W A I T E D O N : $b . 8 8 a e 1 9 b 0 $ 1 5 d ( $ 6 0 ) n m d at > r m _ s e m a p h o r e b. 8 8 a e 1 9 b 0 L i s t o f p i n s w ai ti n g o n s e m a p h or e a t $ b . 8 8 a e 1 9 b0 $ 6 0 $ 6 e $ 7 6 $ 7 e $8 6 $ 8 e $ 9 2 $ 9 6 $9 a $ 9 e $ a 2 $ a 6 $a a $ a e P i n $ 3 5 h a s a n e xc l u s i v e l o c k on s h a r e a b l e s e m ap ho r e a t $ b .
Case Study: Hang Memory Dump continued Pin 60 is waiting on the semaphore that pin 35 holds. Pin 35 is waiting on the semaphore that pin 60 holds. The classic deadly embrace. The macro RM_SEM_DEADLOCK would actually have been a much better choice here, as it would have detected this and displayed the two pins involved: $161 ($0) nmdat > rm_sem_deadlock ********************** * Deadlock detected.
C ase S tud y: H a ng C o nclusio n • th e h a n g is d ue to a d a ta b a se loc king p rob le m • th e m e m o ry d u m p p r o b a b ly w a s n o t n e c e ss a ry, D B U T IL “S H O W LO C K S ” w o uld p ro b a b ly h a ve he lp ed d e term in e w h a t th e p ro b lem w a s • th e T E LE S U P u tility “U N D E D LO C K ” m ig h t eve n h a v e b e e n a b le to co rre ct it 3 /1 9 /2 0 0 3 B asic Sy ste m Pro ble m An aly sis Notes: 86 pag e 3 6
Case Study: Hang Conclusion This is pretty obviously an application problem. Two programs are locking datasets in a database in the opposite order. It is also very likely that some far less drastic measure could have been taken to diagnose this short of taking the system down and dumping it. Unfortunately most system hang’s are not as easy and obvious as this one was to diagnose. Any information that can be gather beforehand will help.