Troubleshooting guide
•  Did the firmware (MX or GM) load properly on all nodes in the cluster? Were 
there any error messages in the system log (dmesg or /var/log/messages) output 
on any of the nodes when you loaded the firmware? Sections V, VI, and VII 
address software installation and troubleshooting issues. Run-time diagnostic 
error messages are also explained in the Myrinet FAQ 
(http://www.myri.com/scs/FAQ/). 
•  Were there any error messages in the system log (dmesg or /var/log/messages) 
output on any of the nodes after loading the firmware? 
•  Were there software run-time error messages while running the application? A 
number of these run-time messages are explained in the Myrinet FAQ 
(http://www.myri.com/scs/FAQ/). 
Further Details 
If there are host computer hardware or software problems, these problems will most 
likely be encountered as a failure during the Myrinet hardware or software installation 
phase (Section III and Section VIII Testing/Validation). Or, these types of problems 
may also be exhibited/revealed as an unexplained performance degradation or 
performance inconsistency on the nodes. Refer to the subsection entitled “3. Run 
mx_dmabench or gm_debug to test the PCI bandwidth” (page 30) in Section VIII 
Testing/Validation for further details. 
If there are any faulty Myrinet hardware components, these components are most easily 
isolated with the Fabric Management System (FMS) as described in Section VIII 
Testing/Validation. If you are unable to install FMS, you can use the troubleshooting 
procedures outlined in Appendix A and Appendix B. 
There are two sources of hardware counters available for Myrinet: 
•  host counters, reported by the MX test program mx_counters or the GM test 
program gm_counters; and 
•  switch counters and traps, reported by the web interface to the Myrinet switch(es). 
These hardware counters reveal important information about the health of the Myrinet 
hardware and the interactions of the hardware and the software. A detailed explanation 
of each of these hardware counters can be found in the Myrinet FAQ 
(http://www.myri.com/scs/FAQ/), and in the M3-CLOS-ENCL/M3-SPINE-ENCL switch 
tutorial (http://www.myri.com/scs/14U_switches/). If you are using the M3-CLOS-
ENCL/M3-SPINE-ENCL switches, you can use the Log feature of the web interface 
(http://www.myri.com/scs/14U_switches/index-overview-web.html#log) to monitor 
switch traps in real-time. If you are using the M3-E* switches, Mute 
(
http://www.myri.com/scs/mute/) can be used to monitor the switch traps in real time. 
Note that Mute has been replaced by the Fabric Management System (FMS). 
© 2007 Myricom, Inc. DRAFT 
35










