Tandem Failure Data System (TFDS) Manual Abstract This manual describes the HP Tandem Failure Data System (TFDS), including operating procedures, interface commands, messages, and automation of most tasks associated with failure data collection and resource recovery in the event of software-related processor failure. Product Version TFDS G05 Supported Release Version Updates (RVUs) This manual supports G05.00 and subsequent RVUs until otherwise indicated in a new edition.
Document History Part Number Product Version Published 424045-001 TFDS D40 (with T6523AAW) TFDS G05 (with T6523AAY) February 2000 427561-001 TFDS D41 (with T6523ABA) TFDS G05 (with T6523ABE) July 2001 520628-001 TFDS D41 (with T6523ABA) TFDS G05 (with T6523ABE) November 2001 520628-002 TFDS G05 May 2002 520628-003 TFDS G05 December 2002
Tandem Failure Data System (TFDS) Manual Glossary Index What’s New in This Manual v Manual Information v New and Changed Information Figures Tables vi About This Manual vii Your Comments Invited viii Notation Conventions viii 1. Introduction to TFDS Overview of TFDS 1-1 What TFDS Can Do for You 1-1 How to Find Out More About TFDS 1-2 2.
3. Starting and Configuring TFDS (continued) Contents 3. Starting and Configuring TFDS (continued) DISALLOWED-VOLUMES 3-7 4.
4. Using TFDSCOM Commands (continued) Contents 4. Using TFDSCOM Commands (continued) IGNORECPUS 4-27 IGNOREOPERATORHALTS 4-28 INCIDENT 4-28 MAXCONDUMPS 4-29 MAX-DB-ENTRIES 4-29 PROCESSINGDELAY 4-30 PURGE 4-30 REMOTENOTIFY 4-31 REPORT 4-32 RETRY-DUMP 4-34 RETRY-RELOAD 4-35 SAVE 4-35 STATUS 4-36 TAPE 4-37 TASKS 4-38 TFDS STOP 4-38 A. Disabling Other Software Programmatic Network Administrator (PNA) Rule Management Services (RMS) A-2 A-1 B. TFDSCOM Command Migration Table C.
F. Fast Memory Dump Contents F. Fast Memory Dump FMD Processing Steps F-1 CPU HALT Incident Analysis F-1 Fast Memory Dumping Commands F-2 Glossary Index Figures Figure 2-1. TFDS Operational Model 2-2 Tables Table 3-1. Table 4-1. Table F-1.
What’s New in This Manual Manual Information Tandem Failure Data System (TFDS) Manual Abstract This manual describes the HP Tandem Failure Data System (TFDS), including operating procedures, interface commands, messages, and automation of most tasks associated with failure data collection and resource recovery in the event of software-related processor failure. Product Version TFDS G05 Supported Release Version Updates (RVUs) This manual supports G05.
What’s New in This Manual New and Changed Information New and Changed Information This revision of the Tandem Failure Data System (TFDS) Manual includes the following new information regarding TFDS: • • • The BURSTNOTIFY command parameter ranges have been corrected. The REPORT command option overrides have been corrected.
About This Manual TFDS, a component of the HP NonStop™ Kernel operating system, is a software isolation tool that automates most tasks associated with failure data collection and resource recovery in the event of software-related processor failure.
Your Comments Invited About This Manual Your Comments Invited After using this manual, please take a moment to send us your comments. You can do this by: • • • Completing the online Contact NonStop Publications form if you have Internet access. Faxing or mailing the form, which is included as a separate file in Total Information Manager (TIM) collections and located at the back of printed manuals. Our fax number and mailing address are included on the form.
General Syntax Notation About This Manual computer type. Computer type letters within text indicate C and Open System Services (OSS) keywords and reserved words; enter these items exactly as shown. Items not enclosed in brackets are required. For example: myfile.c italic computer type. Italic computer type letters within text indicate C and Open System Services (OSS) variable items that you supply. Items not enclosed in brackets are required. For example: pathname [ ] Brackets.
Change Bar Notation About This Manual … Ellipsis. An ellipsis immediately following a pair of brackets or braces indicates that you can repeat the enclosed sequence of syntax items any number of times. For example: M address-1 [ , new-value ]... [ - ] {0|1|2|3|4|5|6|7|8|9}... An ellipsis immediately following a single syntax item indicates that you can repeat that syntax item any number of times. For example: "s-char..." Punctuation.
Change Bar Notation About This Manual The CRE has many new message types and some new message type codes for old message types. In the CRE, the message type SYSTEM includes all messages except LOGICAL-CLOSE and LOGICAL-OPEN.
Change Bar Notation About This Manual Tandem Failure Data System (TFDS) Manual—520628-003 xii
1 Introduction to TFDS This section presents a brief overview of what TFDS is and what it can do for you. Section 2, Using TFDS, provides more detail on how TFDS works. Overview of TFDS TFDS is a key automation and problem-management component of the NonStop Kernel operating system. It automates most tasks associated with data collection and resource recovery in the event of software-related processor or subsystem failure. TFDS monitors processors in HP NonStop servers for software failure notifications.
Introduction to TFDS • How to Find Out More About TFDS Software developers and your service provider use the data TFDS captures to speed the repair of the root cause of the failure. How to Find Out More About TFDS For an illustration and description of a typical TFDS operating process, see How TFDS Works on page 2-1. TFDS is highly configurable to meet your needs. To learn how to create a custom configuration file, see Section 3, Starting and Configuring TFDS.
2 Using TFDS This section covers the following topics: TFDS Components on page 2-1 How TFDS Works on page 2-1 When to Use TFDS on page 2-3 System Resource Usage on page 2-3 Monitoring TFDS Status on page 2-4 What to Do After a Software Failure on page 2-6 TFDS Components Component Description TFDS monitor A process pair that constantly watches system messages for notification of software-related processor halts or software failure. (TFDS takes no processor time for this monitoring.
How TFDS Works Using TFDS • • Instrumentation calls embedded in code to indicate a detected internal software failure. When a failure is detected, TFDRTL collects the failure specification and notifies the TFDS monitor. TFDSCOM requests for configuration changes or actions such as manual dumps or reloads. Figure 2-1.
When to Use TFDS Using TFDS 4. TFDS automatically initiates a processor dump (if AUTODUMP is enabled). The dump can be backed up to tape (optional). 5. TFDS automatically reloads the processor (if AUTORELOAD is enabled). 6. After the failure data is captured, TFDS builds a software failure event and writes it to the Event Management Service (EMS) log (whether the failure event was unique or recurrent). For required follow-up actions, see What to Do After a Software Failure on page 2-6.
Monitoring TFDS Status Using TFDS • If no DUMPVOLUME was specified or there was insufficient space on the volume, TFDS searched all allowed volumes for the volume with the most available space. The dump files and incident-related data were stored there. TFDS operation in FMD mode changes only slightly from this model. The ALTERNATE-VOLUMES command allows you to specify secondary volumes in addition to the primary volume specified in the DUMPVOLUME command.
Monitoring TFDS Status Using TFDS STATUS Example TFDS current environment: @01/06/11 08:41:31 Failures are recorded since TFDS started at 01/06/11 CPU No -0 1 2 3 CPU State ------Up Up Up Dumping AUTORELOAD AUTOSTIFLE • CPU Status -------Enabled Enabled Enabled N/A CPU Dn R1 ----- ----0 0 0 0 0 0 1 0 Time of Last Failure ----------------N/A N/A N/A 99/05/14 07:25:56 08:40:28 Time of Last Load ----------------99/05/12 16:59:23 99/05/12 17:01:44 99/05/12 17:01:52 N/A OFF 03 Times in 24 Hours To
What to Do After a Software Failure Using TFDS What to Do After a Software Failure A software failure event in the EMS log notifies you of a software failure. This event contains information such as the product number and halt code related to the incident. Notification is also available through your system-management tools. For critical failures, a dial-out feature is available through your system-management tools to forward the event to your service provider.
3 Starting and Configuring TFDS This section contains the procedures for starting and configuring TFDS on your NonStop system. TFDSCOM commands are the same for systems running either Gseries or D-series RVUs. This section covers these topics: Starting TFDS on a System Running a G-Series RVU 3-1 Starting TFDS Manually 3-2 Configuring TFDS Options 3-3 TFDS Configuration File Commands 3-6 Caution.
Starting and Configuring TFDS Starting TFDS Manually SCF 2> start process $zzkrn.#tfds Note. You must be logged on with a valid super.xxxxx user name. The add process command specifies that SCF automatically starts TFDS during future startups. It also specifies the primary and backup processors in which TFDS will run. The $system.sys00.tfds option is the typical location where TFDS is installed. Your system number might be different. The start process command starts TFDS.
Starting and Configuring TFDS Starting TFDS Automatically With a Startup File n represents the primary processor in which TFDS will run. Although you can select any available processor as the primary processor for TFDS, HP recommends that you use either processor 0 or 1 (and designate the other as backup-CPU). If both of these processors fail, the system fails also. backup-CPU represents the processor in which TFDS creates a backup process.
Starting and Configuring TFDS Viewing and Changing TFDS Configuration Options Using TFDSCOM Commands. For the current equivalent for obsolete command names, see Appendix B, TFDSCOM Command Migration Table. To configure TFDS configuration options to be persistent—that is, to still be valid after a system load—enter the commands through TFDSCOM and execute the SAVE command.
Viewing and Changing TFDS Configuration Options Starting and Configuring TFDS This table lists each TFDSCOM command that affects TFDS configuration settings. Table 3-1.
Starting and Configuring TFDS • Changing the TFDS Default to Enable Automatic Dumping If you do not specify a DUMPVOLUME location, TFDS searches for the volume with the most space available each time. This action could result in scattering TFDS incident information across many volumes. For more information on the commands that control configuration settings and the consequences of the default settings, see Section 4, Using TFDSCOM Commands.
Starting and Configuring TFDS • • • DISALLOWED-VOLUMES You can specify up to ten alternate volumes. This command is available only in the configuration file. You cannot use this command in the TFDSCOM conversational interface. The search order for alternate volumes is the order that alternate volumes are added in the configuration files.
Starting and Configuring TFDS DISALLOWED-VOLUMES Tandem Failure Data System (TFDS) Manual—520628-003 3 -8
4 Using TFDSCOM Commands Use the commands in this section in the TFDSCOM conversational interface to: • • • • Display or modify configuration values Save a specific configuration Start or cancel activities Request help You must have both TFDS and the TFDSCOM process running to use these commands. All numeric parameters must be of integer type. If you enter a command incorrectly, an error message is issued, and the previous command value remains in effect.
Online Help for TFDSCOM Commands Using TFDSCOM Commands Online Help for TFDSCOM Commands To view online help for TFDSCOM commands, enter one of the following at the TFDSCOM prompt: Enter To get...
TFDSCOM Commands Using TFDSCOM Commands Table 4-1.
Using TFDSCOM Commands TFDSCOM Command Descriptions TFDSCOM Command Descriptions The TFDSCOM commands appear in alphabetic order. The descriptions include: • • • • • A summary of each command function The command syntax, including descriptions of the parameters and variables Usage guidelines An example of output if the command produces any Applicable examples Caution. Requesting certain actions while TFDS is processing an event can result in TFDS processing errors.
Using TFDSCOM Commands ° ° ° ° ° ANALYZE TSYSDP2 $ZLOG file (example: ZLOG0273) $0 log file (example: ZSER0644) CPUnn (only when the dump is triggered by a CPU DOWN message) PROCIMAG (only when the dump is triggered by instrumented program calls) Note. Use the ACQUIREFILES command to specify only files other than those automatically backed up. Using ACQUIREFILES to specify any of the files listed here terminates the file collection process.
Using TFDSCOM Commands ANALYZEPRIORITY CPU num instructs TFDS to process a halted CPU; num specifies a processor number. Valid processor numbers range from 0 through 15. Guidelines • • • Invoking the ANALYZE command causes an event (6000 for D-series, 6001 for Gseries) to be issued and an entry to be added to the TFDS incident database. If TSM is configured for automatic dial-out, the event generated is dialed out to the Global Customer Support Center (GCSC).
Using TFDSCOM Commands AUTOBACKUP value might lock out other tasks from starting. Lowering the parameter slows the time for the complete processor memory to be returned to system operation. AUTOBACKUP The AUTOBACKUP command specifies whether the dump and auxiliary files are automatically backed up to tape. AUTOBACKUP { ON | OFF } ON indicates that the dump and auxiliary files are automatically backed up. OFF indicates that the dump and auxiliary files are not automatically backed up.
Using TFDSCOM Commands AUTODUMP AUTODUMP The AUTODUMP command specifies whether an enabled processor should be dumped if a CPU DOWN message occurs or an ANALYZE command is issued. AUTODUMP { ON | OFF } ON instructs TFDS to dump an enabled processor automatically if a CPU DOWN message occurs. The default AUTODUMP setting is ON. OFF instructs TFDS not to generate a dump, but the incident is still recorded. Guideline You can use the abbreviation ad in place of AUTODUMP.
Using TFDSCOM Commands AUTOSTIFLE AUTOSTIFLE Use the AUTOSTIFLE command to create a stifling window in which a repeatedly failing processor is no longer automatically reloaded. The window is used to limit the impact of a continuously failing processor. AUTOSTIFLE num-times [,] num-hours num-times is the number of reloads allowed within num-hours. The reload range is 0 through 10. Setting the number to zero disables the AUTOSTIFLE function. The default value is 3.
Using TFDSCOM Commands BACKUP BACKUP The BACKUP command displays the status of backup requests and lets you back up specific subvolumes related to database incidents. BACKUP { STATUS } { Inc-rec-num [ Tape-Unit [ TIMEOUT= time ] ] } STATUS lets you query the status of backup requests. Inc-rec-num specifies an incident number retrieved from the REPORT command incidents list. Tape-Unit is the tape drive used for the backup. time is the number of minutes a backup request will delay.
Using TFDSCOM Commands BACKUPDELAY This example shows the output of the BACKUP STATUS command: 04 - Active 05 - Active 06 - Active The BACKUP STATUS command displays the status of the pending and active TFDS backup requests. The example shows that the logical backup requests 04, 05, and 06 are being processed. It appears from the example that backup requests 01, 02, and 03 have already been attended, but the example does not indicate whether the associated backup requests succeeded.
Using TFDSCOM Commands BURSTNOTIFY Guidelines • • • • You can use the abbreviation bi in place of BURSTINTERVAL. You must set BURSTSUPPRESSION to ON for BURSTINTERVAL to start. BURSTINTERVAL does not override the BURSTSUPPRESSION setting. For a suppressed incidents report, use the TASKS command. BURSTNOTIFY Use the BURSTNOTIFY command to create a rediscovery window that creates a summary event in the maintenance log ($ZLOG) and a dial-out to the GCSC (if TSM is configured for automatic dial-out).
Using TFDSCOM Commands BURSTSUPPRESSION BURSTSUPPRESSION The BURSTSUPPRESSION command directs the processing of duplicate incidents shortly after initial incidents. BURSTSUPPRESSION { ON | OFF} ON enables burst suppression. The default BURSTSUPPRESSION setting is ON. OFF disables burst suppression. Note. This feature suppresses only duplicate TFDS RTL incidents, not CPU DOWN incidents. Guidelines • • • You can use the abbreviation bs in place of BURSTSUPPRESSION.
Using TFDSCOM Commands CLOSE CLOSE The CLOSE command deletes an incident from the TFDS incident database as well as any related dump file or event log. CLOSE rec-num rec-num represents the record number of a database record to be deleted. To derive record numbers, use the REPORT command. Note. Do not use this command unless there are many duplicate incidents in the TFDS database.
CONFIG Using TFDSCOM Commands and displays the STATUS of the processors with a two-letter code. Possible STATUS code values are: ON DS DF DD DO • • = = = = = Online Dump starting Dump failed Dump done Dumping DP NR RL NC UK = = = = = Dump pending Not reloaded Reloading Not configured Unknown processor status The CONFIG command output displays the file names assigned to the ACQUIREFILES parameter (if applicable) and indicates if these files are permanent or temporary.
Using TFDSCOM Commands CPUS CPUS The CPUS command makes the transition from TACL to TFDS easier. It is an alias for the STATUS command. CPUS For guidelines and examples, see STATUS on page 4-36. DB-SUBVOL The DB-SUBVOL command specifies the location in which TFDS creates or finds the incident database. DB-SUBVOL volume.[ subvolume ] volume is the name of a volume. The default volume is $system. subvolume is the name of a subvolume. The default subvolume is ztfds.
Using TFDSCOM Commands DETAIL Guidelines • • You can use the abbreviation d in place of DETAIL. For information about the format of the detail record, see REPORT DETAIL on page 4-32.
Using TFDSCOM Commands • DISABLECPUS Possible states for the stack values: ° ° ° ° Stack from DUMPME Stack from GARTH Stack from SAVEFILE Stack has no information DISABLECPUS The DISABLECPUS command disables the dumping or reloading of specific processors if a software-related CPU DOWN event occurs. DISABLECPUS { n [, n, ... ] | ALL } n specifies a valid processor number. Processor numbers range from 0 through 15. ALL specifies all processors. Caution.
Using TFDSCOM Commands DUMPOVERRIDE DUMPOVERRIDE The DUMPOVERRIDE command instructs TFDS to always collect a processor dump after a processor has halted due to a software failure, overriding the normal TFDS operation that suppresses processor dumps for duplicate incidents. DUMPOVERRIDE { ON | OFF } ON instructs TFDS to collect a processor dump after a processor has halted. OFF instructs TFDS not to collect a processor dump after a processor has halted. The default DUMPOVERRIDE setting is OFF.
Using TFDSCOM Commands ENABLECPUS specified. If alternate volumes are specified, the dump will be taken on an alternate volume. If no alternate volumes are specified, TFDS selects the disk drive with the largest amount of free space and generates the dump.) ALTERNATE VOLUMES DISALLOWED does not let TFDS select another drive if the first one does not have the specified amount of free space. In this case, no dump is taken, and the processor remains down.
Using TFDSCOM Commands EXIT ALL specifies all processors. Caution. If the first physical processor number configured is not processor 0, you must add a DISABLECPUS option for each processor number lower than the number of the first physical processor (down to 0) configured to avoid receiving an incorrect list of enabled processors with the TFDSCOM CONFIG command. An incorrect list of enabled processors does not have an adverse affect on the normal operation of TFDS software.
FMDINITCPU Using TFDSCOM Commands To edit the command, enter one of these FC subcommands: R | r replacement-string Replaces one or more characters on a one-for-one basis. I | i insertion-string Inserts one or more characters. D | d delete character Deletes one character. Repeat to erase more characters. The subcommand begins its operation at the character positioned directly above it. To stop editing the line, press Enter.
Using TFDSCOM Commands FMDINITPRIORITY FMDINITPRIORITY The FMDINITPRIORITY command sets the priority to the program that will take the initial processor dump of the halted processor. FMDINITPRIORITY priority priority is the process priority. The maximum value is 190. The minimum value is 150. The default value is 180. Guidelines • • You can use the abbreviation fip in place of FMDINITPRIORITY.
Using TFDSCOM Commands FMDSIZE starting. Lowering the parameter increases the time for the complete processor memory to be returned to system operation. Example This example sets the FMD post-reload PRDUMP priority to 150: FMDPOSTPRIORITY 150 FMDSIZE The FMDSIZE command sets the initial dump size in megabytes. FMDSIZE Initial-Partial-Dump-Size Initial-Partial-Dump-Size is the initial partial dump size in megabytes. The default is 2 gigabytes. The maximum value is 3 gigabytes.
HELP Using TFDSCOM Commands ON turns on FMD. This value is the default if your system has a processor size greater than 2 gigabytes. OFF turns off FMD. This value is the default if your system does not have a processor size greater than 2 gigabytes. Guidelines • • • • You can use the abbreviation fd in place of FMDUMP. As the name implies, the FMD feature speeds up the dumping of halted processors.
HISTORY Using TFDSCOM Commands HELP events displays information for all TFDS events.
IGNORECPUS Using TFDSCOM Commands Example Sample HISTORY output: 06/13 09:23 06/13 09:23 06/13 09:23 06/13 09:23 ********End Instrumented subsystem triggered TFDS instument (17) No Saveabend Requested (17) --Closing Incident-(17) --Processing Completed for incident-of data******** The incident number is in parentheses. This number enhances the readability of HISTORY output with multiple simultaneous processing events.
Using TFDSCOM Commands IGNOREOPERATORHALTS Examples • This example sets the ignore flag for processors 2 and 3: IGNORECPUS 2 3 • This example causes all processors to be ignored: IGNORECPUS ALL • This example cancels the ignore flag for processor 0: IGNORECPUS OFF 0 IGNOREOPERATORHALTS Use the IGNOREOPERATORHALTS command to demonstrate the TFDS AUTODUMP and AUTORELOAD features. IGNOREOPERATORHALTS { ON | OFF } ON directs TFDS to ignore operator halts.
Using TFDSCOM Commands MAXCONDUMPS rec-num is the record number in the incident database. Guidelines • • You can use the abbreviation inc in place of INCIDENT. The DUMPOVERRIDE flag is cleared after a duplicate incident is processed and the dump is acquired. MAXCONDUMPS Use the MAXCONDUMPS command to indicate the number of concurrent processor dumps that TFDS should be allowed to initiate. MAXCONDUMPS num-condumps num-condumps is the maximum number of concurrent dumps that TFDS is allowed to perform.
Using TFDSCOM Commands PROCESSINGDELAY PROCESSINGDELAY The PROCESSINGDELAY command instructs TFDS to wait a specific number of seconds before it starts the analysis and possible dump of a down processor due to software failure. PROCESSINGDELAY time time is the delay in seconds. Seconds can range from 0 through 86400. The default value is 1. Guidelines • • You can use the abbreviation pd in place of PROCESINGDELAY.
REMOTENOTIFY Using TFDSCOM Commands REMOTENOTIFY The REMOTENOTIFY command is useful if you are managing multiple systems from one workstation. With this option enabled, any software-related processor failure that results in a dump creates a zero-length file on the remote system as a notification mechanism. REMOTENOTIFY { ON system.volume.subvolume { OFF } } system represents a system name. volume represents a volume name. subvolume represents a subvolume name. OFF turns off REMOTENOTIFY.
Using TFDSCOM Commands REPORT REPORT The REPORT command displays records from the incident database. REPORT [ rec-num ] [, [ DETAIL ], [ TOTAL-REC n-recs ] ] [ DATE "date" ] , [ TOTAL-REC n-recs ] [ STATUS ( inc-stat ) ] rec-num specifies the first record to display. If not specified, REPORT starts with the first incident in the incident database. DETAIL gives detailed record information.
REPORT Using TFDSCOM Commands Examples • This example lists all records in sequential order in the database: REPORT • This example starts at record 1 to list detail information records to EOF: REPORT DETAIL • This example starts at record 1 and displays three detail information records: REPORT DETAIL TOTAL-REC 3 • This example starts at record 20 and lists detail information records to EOF: REPORT 20 DETAIL • This example lists all records that were created through the date specified: REPORT DAT
RETRY-DUMP Using TFDSCOM Commands Halt Processor halt code. Product Product number; available only for TFDS instrumented programs (NA if not available). Symptom String Information collected during error processing, used to uniquely identify the location of a fault; composed of Company Identifier, Product Identifier, Program Name, Source Filename, DIP id, and optionally up to five descriptive text message insert fields (ASCII text) and up to five binary fields.
Using TFDSCOM Commands RETRY-RELOAD RETRY-RELOAD The RETRY-RELOAD command sets the number of attempts TFDS makes to reload a processor if the initial reload is unsuccessful. RETRY-RELOAD retry-number retry-number represents the maximum number of times TFDS attempts to reload a failed processor. The reload range is 0 through 2. The default value is 2. SAVE The SAVE command saves the current configuration to the TFDS configuration file.
STATUS Using TFDSCOM Commands STATUS The STATUS command returns a TFDS view of the processors. It shows the processor status and reload states.
TAPE Using TFDSCOM Commands cpust is the processor state. It can be: Disabled Ignored Stifling Enabled N/A Waiting fs is the number of successive failures of the processor since TFDS was started. Successive means one following the other within AUTOSTIFLE hours. ft is the total number of failures of the processor since TFDS was started. The time TFDS was started is noted at the bottom of the screen. dateNtime is a timestamp. If no timestamp applies, this field contains N/A.
Using TFDSCOM Commands TASKS Example This example specifies $TAPE1 as the tape drive: TAPE $TAPE1 TASKS The TASKS command displays current running tasks within TFDS. TASKS Guideline TASKS displays burst-related tasks only. Other tasks do not have output. TFDS STOP The TFDS STOP command lets you stop the TFDS process pair. TFDS STOP Guidelines • • • For G-series RVUs, this command is not normally used.
A Disabling Other Software This appendix contains instructions for disabling the processor dump and reload capabilities of: • • Programmatic Network Administrator (PNA) Rule Management Services (RMS) Programmatic Network Administrator (PNA) To disable the processor dump and reload capabilities within PNA: 1. Use the INFO EVENT and INFO RULE commands from the Rule Management Utility Program (RMUP) to determine any active rule and version number or numbers that respond to the CPU DOWN message. Note.
Rule Management Services (RMS) Disabling Other Software Rule Management Services (RMS) To disable the processor dump and reload capabilities within RMS: 1. If you are not already in NonStop NET/MASTER NARS, log on to it. After a successful logon, a primary menu panel lets you select Rule Maintenance within RMS. 2. Select Rule Maintenance by typing R.6 (at any => prompt) and press Enter. The RMS: Rule Maintenance Panel appears. 3.
B TFDSCOM Command Migration Table TFDS commands were simplified for T6523AAW (D-series)and T6523AAX (G-series)s. Instead of having two sets of commands, TFDSCOM commands and TFDS configuration file commands, one set of TFDSCOM commands now serves both purposes.
TFDSCOM Command Migration Table Tandem Failure Data System (TFDS) Manual—520628-003 B- 2
C File-Naming Conventions Subvolume Names The files generated by TFDS activity are loaded under specific subvolumes. Each incident created in the database has a specific subvolume that uses these conventions: ZDMPnnnn nnnn represents the TFDS incident number. The maximum incident number is 9999. File Names TFDS generates a number of files in the system while running. The naming conventions for these files are described in this appendix.
Complementary Information File-Naming Conventions nnnn corresponds to the event file name. • ESLOG files ($ZLOG) These are the primary files that contain information related to the HP Tandem Maintenance and Diagnostic System (TMDS) logs on systems running D-series RVUs: ZLOGnnnn nnnn corresponds to the event file name.
D EMS Messages and Templates The Event Management Service (EMS) receives and logs information about important events that occur in TFDS and the TFDSCOM user interface. This section describes each standard TFDS event reported through EMS. The message descriptions provide: • • • • Message text Cause of the message Effect on the system Suggested recovery strategy TFDS event messages can be identified by the subsystem name DMP.
EMS Messages and Templates EMS Messages EMS Messages 1 TFDS *0001* $ZDMP : DUMP configured off; No DUMP for CPU : nn Cause. TFDS received a CPU DOWN message, and the AUTODUMP flag is set to OFF. Effect. A dump is not generated. The processor status for TFDS is CPU DOWN, NO DUMP. Recovery. Either force the dump through TFDSCOM (set the AUTODUMP flag to ON and use the ANALYZE command) or reload the processor manually. Note.
EMS Messages and Templates EMS Messages Recovery. Mount a tape (on the tape drive specified through the TFDSCOM TAPE command) and put the tape drive online, or issue a CANCELBACKUP command. 4 TFDS *0004* $ZDMP : BACKUP Load Next Tape on: xxxxxxxx Cause. The backup operation requires more tapes. Effect. The backup is waiting for the next tape. Recovery. Mount a tape and put the tape drive online.
EMS Messages and Templates EMS Messages RETRY-DUMP, RELOAD-ON-FAILURE, AUTORELOAD, and RETRY-RELOAD configuration settings. Recovery. Informational message only; no corrective action is needed. 8 TFDS *0008* $ZDMP : RELOAD; Starting a RELOAD for CPU : nn Cause. A reload process has been started. Effect. If followed by TFDS event message #10, Successful RELOAD, the processor is operational. If not, TFDS continues attempts to reload the processor according to your RETRY-RELOAD configuration setting.
EMS Messages and Templates EMS Messages command determines whether TFDS automatically reloads the processor after exhausting the specified number of retries, or the processor remains down (which allows you to find disk space and request the dump through TFDSCOM). Recovery. If the RELOAD-ON-FAILURE option described previously has not been specified, you can enable more disk space and request the dump through TFDSCOM, or reload the processor manually.
EMS Messages and Templates EMS Messages 15 TFDS *0015* $ZDMP : Product #nnnn, Halt Code %nnnnnn CPU #nn Recurrent Problem Logged, File(s) Located at xxxxxxxx Symptom String : xxxxxxxx Version Info : xxxxxxxx Resource Name : xxxxxxxx Source File Name : xxxxxxxx Company Name : xxxxxxxx Cause. A recurrent problem was discovered. Effect. Data collection is not performed on this problem, and the number of occurrences of this problem is tabulated. The affected processors are reloaded automatically. Recovery.
EMS Messages and Templates EMS Messages 18 TFDS *0018* $ZDMP: Too many concurrent CPU dumps. Reloading CPU: nn Cause. The number of processor failures is greater than the number of concurrent dumps allowed by the current TFDSCOM MAXCONDUMPS configuration setting. Effect. A reload process is started for the specified processor. It is not dumped. If the reload is successful, the processor is operational. Recovery. Informational message only; no corrective action is needed.
EMS Messages and Templates EMS Messages 21 TFDS *0021* Processing Anomoly: xxxxxxxx Cause. A noncritical error occurred during TFDS processing. For example, if a RELOAD command is issued while TFDS is analyzing a halted processor, an error results. Effect. Normally, TFDS terminates incident processing at the point the error is encountered. Recovery. Determine the cause of the error and correct it. 22 TFDS *0022* Processing Status: xxxxxxxx Cause. This message indicates routine TFDS operational status.
EMS Messages and Templates EMS Messages 25 TFDS *0025* Fup Processing Status: xxxxxxxx Cause. TFDS is in the process of duplicating files for a incident. Effect. To determine the current status of this process, view these messages in the event log. Recovery. Informational message only; no corrective action is needed. 26 TFDS *0026* Stop message from Tfdscom: TFDS stopping Cause. A TFDS STOP command was received from TFDSCOM. Effect. TFDS terminates. Recovery.
EMS Messages and Templates EMS Messages Effect. N.A. Recovery. N.A. 30 TFDS *0030* TFDS Config Status: xxxxxxxx Cause. As TFDS starts, it reads the configuration file and communicates with the Nonstop Kernel to determine information about the hardware environment. Effect. To determine the current status of the TFDS configuration, view these messages in the EMS log. Recovery. Informational message only; no corrective action is needed. 31 TFDS *0031* Garth Status: xxxxxxxx Cause.
EMS Messages EMS Messages and Templates • • • Descriptions of the problem and accompanying symptoms Details from the message or messages generated Supporting documentation such as Event Management Service (EMS) logs, trace files, and a processor dump, if applicable If your local operating procedures require contacting the Global Customer Support Center (GCSC), supply your system number and the numbers and product versions of all related products as well.
EMS Messages EMS Messages and Templates bit 1 = TFDS_CAPTURE_STACK_TRACE - unused - always get stack trace bit 0 = TFDS_CAPTURE_SELECTIVE Cause. For D-series RVUs, TFDS and the TFDS run-time library issue this event to notify you about an internal software error within the NonStop system. TFDS could generate this event as a result of a halted processor or any NonStop Kernel subsystem instrumented with the TFDS run-time library.
EMS Messages EMS Messages and Templates bit bit bit bit bit 4 3 2 1 0 = = = = = TFDS_PROCESS_STOP TFDS_PROCESS_CONTINUE TFDS_CAPTURE_ALL TFDS_CAPTURE_STACK_TRACE - unused - always get stack trace TFDS_CAPTURE_SELECTIVE Cause. For G-series RVUs, TFDS and the TFDS run-time library issue this event to notify you about an internal software error within the NonStop system.
EMS Messages and Templates Templates for EMS Support Templates for EMS Support Use these TFDS templates if you are developing or customizing operator messages for TFDS events.
EMS Messages and Templates Templates for EMS Support MSG: ZEMS-TKN-EVENTNUMBER, ZDMP-EVT-INCIDENT-RECURRENT "TFDS *0006* <1>: " " Product #<2>, Halt Code %<3>" " Recurrent Problem Logged, " " Symptom String : <4>" 1: 2: 3: 4: ZDMP-TKN-CRTPIDN, FILE ZDMP-TKN-PRODUCT-NUM ZDMP-TKN-HALT-CODE, 06 ZDMP-TKN-SYMPTOM-STRING MSG: ZEMS-TKN-EVENTNUMBER, ZDMP-EVT-RCVDUMP-STARTING "TFDS *0007* <1>: DUMP; Starting" " a ReceiveDump for CPU: <2>" 1: ZDMP-TKN-CRTPIDN, FILE 2: ZDMP-TKN-CPU-NUM, ZI2 MSG: ZEMS-TKN-EVENTNUMB
EMS Messages and Templates Templates for EMS Support MSG: ZEMS-TKN-EVENTNUMBER, ZDMP-EVT-TFDS-STARTING "TFDS *0013* <1>: TFDS starting" " in CPU: <2>" 1: ZDMP-TKN-CRTPIDN, FILE 2: ZDMP-TKN-CPU-NUM, ZI2 MSG: ZEMS-TKN-EVENTNUMBER, ZDMP-EVT-HALT-INC-NEW "TFDS *0014* <1>:" " Product #<2>, Halt Code %<3> CPU #<4>" " New Problem Logged, File(s) Located at <5>" " Symptom String : <6>" " Version Info : <7>" " Resource Name : <8>" " Source File Name : <9>" " Company Name : <10>" 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: ZDM
EMS Messages and Templates Templates for EMS Support MSG: ZEMS-TKN-EVENTNUMBER, ZDMP-EVT-RCVDUMP-RETRYING "TFDS *0016* <1>: RCVDUMP of CPU <2> Failed.
EMS Messages and Templates Templates for EMS Support MSG: ZEMS-TKN-EVENTNUMBER, ZDMP-EVT-PROCESSING-RELOAD "TFDS *0024* Reloading Status: <1>." 1: ZDMP-TKN-JUST-TEXT, FILE MSG: ZEMS-TKN-EVENTNUMBER, ZDMP-EVT-PROCESSING-FUP "TFDS *0025* FUP Processing Status: <1>." 1: ZDMP-TKN-JUST-TEXT, FILE MSG: ZEMS-TKN-EVENTNUMBER, ZDMP-EVT-TFDS-STOPPING "TFDS *0026* <1>: TFDS Stopping" 1: ZDMP-TKN-SYMPTOM-STRING, FILE MSG: ZEMS-TKN-EVENTNUMBER, ZDMP-EVT-PROCESSING-PRDUMP "TFDS *0027* PRDump Status: <1>.
EMS Messages and Templates Templates for EMS Support MSG: ZEMS-TKN-EVENTNUMBER, ZDMP-EVT-DATA-CAPTURE "Software Data Capture *6000* : Company <1>" " Product <2>," " Program <3>, File name: <4>, DIPID <5>," " Severity <6>, VPROC <7>," " Stack trace <8>" 1: 2: 3: 4: 5: 6: 7: 8: ZDMP-TKN-COMPANY, FILE ZDMP-TKN-PRODUCT, FILE ZDMP-TKN-PROGRAM, FILE ZDMP-TKN-FILENAME, FILE ZDMP-TKN-DIPID, O6 ZDMP-TKN-PERCEIVED-SEVERITY, I1 ZDMP-TKN-VPROC, FILE ZDMP-TKN-PROC-CALL-TRACE MSG: ZEMS-TKN-EVENTNUMBER, ZDMP-EVT-DATA-
EMS Messages and Templates Templates for EMS Support Tandem Failure Data System (TFDS) Manual—520628-003 D -20
E Debugging TFDS Errors This appendix is intended for developers or qualified support personnel who use failure data to debug TFDS errors. Methods for finding the cause of TFDS errors are suggested in sequential order, from simple to more complex. If the first method does not apply to your situation or help you identify the problem, proceed to the next method. 1. Look for TFDS event messages to get a general flow of the TFDS processing and an idea of where the failure occurred.
Debugging TFDS Errors 03 33:28:878 1 end trace init -01 33:44:341 1 USR_SYSMESSAGE 2001:08:22:09:33:44:341:260: mes=-2 -01 33:44:341 1 ===============////////////============ -01 33:44:341 1 USR_SYSMESSAGE MSG_CPUDOWN=3 -01 33:44:341 1 ===============////////////============ -01 08/22 09:33 0 CPUDOWN=3 -01 33:44:341 2 Write_event_ems: mes# 22 -01 08/22 09:33 0 BDFGONR/0111001 AS10 ASH24 ARD100 MEM4000 RD1 MCD1 RR1 TNS G06 T6523G05_31OCT2001_TFDSABF -01 33:44:365 2 Write_event_ems: mes# 22 13 33:44:398 2 BD
Online Help for TFDS Debugging Commands Debugging TFDS Errors Online Help for TFDS Debugging Commands To get online help for TFDS debugging commands, in the TFDSCOM interface, enter: help trace If TFDS does not find Garth, or if you want to find a different Garth, use the GARTH_FILE command. GARTH_FILE Use this command to specify the file TFDS is to use to run the GARTH process (for post-G05.00 RVUs). GARTH_FILE filename filename represents a Guardian filename.
GARTH_FILE Debugging TFDS Errors Tandem Failure Data System (TFDS) Manual—520628-003 E- 4
F Fast Memory Dump Fast Memory Dump (FMD) speeds processor dumping and reloading after a software halt has halted a processor. FMD is primarily for the NonStop S76000 and S86000 with processor memory size up to and including 16 gigabytes. The amount of time to take a CPU dump if a processor has halted is significant. The major benefit of FMD is a reduction in the time that a processor remains unavailable for use after halting. Automating the process through TFDS simplifies its use.
Fast Memory Dumping Commands Fast Memory Dump TFDS is designed to process software halts. After that determination is made, TFDS starts a dialog with Garth to extract incident data from the halted processor. Using this data, TFDS determines if this incident is a duplicate of an earlier incident or a new incident. If TFDS deems this incident to be a root incident, the initial dump process is activated. Initial Dump When a processor halts, TFDS dumps only a subset of physical memory.
Fast Memory Dumping Commands Fast Memory Dump Note. Current values for parameters associated with these commands can be obtained by typing CONFIG (which displays the current configuration values). Caution. Disabling AUTORELOAD or AUTODUMP forces TFDS into non-FMD mode. NonFMD mode creates one large dump file (which is the same total size as the two dump files taken in FMD mode), but the processor is not reloaded until the complete dump is taken.
Fast Memory Dump Fast Memory Dumping Commands The following event log (generated via the HP WebViewPoint product) illustrates the processing step described previously for a halted processor using FMD processing. 0001 16:20 \MS13 TFDS *0022* Processing Status: CPUDOWN=2. 0002 16:20 \MS13 TFDS *0022* Processing Status: BDFONR/011001 AS10 ASH24 ARD100 MEM2000 RD1 MCD1 RR1 TNS G06 T6523G05_08MAY2002_. 0003 16:20 \MS13 TFDS *0022* Processing Status: Software Halt.
Fast Memory Dump Fast Memory Dumping Commands 0028 16:21 \MS13 TFDS *0023* Dump Status: (1)RCVDUMP mes= 90% (63224 pages) done. 0029 16:21 \MS13 TFDS *0023* Dump Status: (1)RCVDUMP COMPLETED ok. 0030 16:21 \MS13 TFDS *0023* Dump Status: (1)RCVDUMP mes=Priming CPU 2. 0031 16:21 \MS13 TFDS *0009* $ZDMP: DUMP; Successful ReceiveDump for CPU: 02 0032 16:21 \MS13 TFDS *0022* Processing Status: Lauching RELOAD. 0033 16:21 \MS13 TFDS *0024* Reloading Status: (1)Create; cpu=1, priority=180.
Fast Memory Dumping Commands Fast Memory Dump 0061 16:22 \MS13 CPU: 02 TFDS *0009* $ZDMP: DUMP; Successful ReceiveDump for 0062 16:22 \MS13 TFDS *0022* Processing Status: Lauching DUMPUTIL for FMD. 0063 16:22 \MS13 TFDS *0028* Dumputil Status: (1) Startup message 'link $D2301E.ZDMP0001.CPU02A $D2301E.ZDMP0001.CPU02B'. 0064 16:22 \MS13 TFDS *0028* Dumputil Status: (1)DUMPUTIL mes=GUARDIAN DUMPUTIL - T9070G07 - (02MAY02). 0065 16:22 \MS13 TFDS *0028* Dumputil Status: (1) DUMPUTIL COMPLETED ok.
Glossary burst. Multiple duplicate First Failure Data Capture (FFDC) events issued by an FFDC instrument program or subsystem. CONFLIST. The system configuration data file. core services. The portion of the operating system that consists of the low-level functions, including interprocess communication; I/O interface procedures; and memory, time, and process management. CPU DOWN files. These files are products of the TFDS activity that occurs each time a CPU DOWN message is being processed.
Glossary HP Tandem Advanced Command Language (TACL) HP Tandem Advanced Command Language (TACL). A utility of the NonStop Kernel. The TACL tool is used as the primary general-purpose interface for command and control for HP NonStop systems. HP Tandem Failure Data System (TFDS). A component of the NonStop Kernel. This tool isolates software problems and provides automatic processor failure data collection, diagnosis, and recovery services. HP Tandem Maintenance and Diagnostic System (TMDS).
TMDS Glossary TMDS. See HP Tandem Maintenance and Diagnostic System (TMDS). TMF. See HP NonStop Transaction Management Facility (TMF). TMFCOM. A TMF utility for communicating commands and information between TMF and a system manager or system operator using TMFCOM commands.
TMFCOM Glossary Tandem Failure Data System (TFDS) Manual—520628-003 Glossary -4
Index A Abort process (SCF command) 3-1 ACQUIREFILES command (TFDSCOM) 4-4 Add process (SCF command) 3-1 ALTERNATE-VOLUMES command (TFDSCOM configuration) 3-6 ANALYZE command (TFDSCOM) 4-5 ANALYZEPRIORITY command (TFDSCOM) 4-6 AUTOBACKUP command (TFDSCOM) 4-7 AUTODUMP command (TFDSCOM) 4-8 AUTORELOAD command (TFDSCOM) 4-8 AUTOSTIFLE command (TFDSCOM) 4-9 B BACKUP command (TFDSCOM) 4-10 BACKUPDELAY command (TFDSCOM) 4-11 Burst Glossary-1 BURSTINTERVAL command (TFDSCOM) 4-11 BURSTNOTIFY command (TFDSCOM) 4-1
G Index FMDINITPRIORITY command (TFDSCOM) 4-23 FMDPOSTPRIORITY command (TFDSCOM) 4-23 FMDSIZE command (TFDSCOM) 4-24 FMDUMP command (TFDSCOM) 4-24 G GARTH Glossary-1 GARTH_FILE command (TFDSCOM) E-3 H HELP command (TFDSCOM) 4-25 HISTORY command (TFDSCOM) description 4-26 example 2-5 HP NonStop Kernel operating system Glossary-1 HP NonStop Transaction Management Facility (TMF) Glossary-1 HP Tandem Advanced Command Language (TACL) Glossary-2 HP Tandem Failure Data System (TFDS) description Glossary-2 See
S Index S SAVE command (TFDSCOM) 4-35 SCF (Subsystem Control Facility) 3-1 Start process (SCF command) 3-2 Starting TFDS, G-series 3-1 STATUS command (TFDSCOM) description 4-36 example 2-5 Stopping TFDS (G-series) 3-1 Subsystem Control Facility (SCF) 3-1 System services Glossary-2 T TACL Glossary-2 TAPE command (TFDSCOM) 4-37 TASKS command (TFDSCOM) 4-38 TFDRTL (TFDS run-time library) 2-1 TFDS components 2-1 description Glossary-2 event messages D-1/D-13 installing 1-2 overview 1-1 run-time library (TFDR
Special Characters Index TFDSCOM commands (continued) STATUS 4-36 TAPE 4-37 TASKS 4-38 TFDS STOP 4-38 TFDSCOM configuration command ALTERNATE-VOLUMES 3-6 DISALLOWED-VOLUMES 3-7 TMDS Glossary-3 TMF Glossary-3 TMFCOM Glossary-3 Special Characters ! (comment character) 4-15 Tandem Failure Data System (TFDS) Manual—520628-003 Index -4