Troubleshooting guide
If the badcrc_cnt (reported in gm_counters) increased significantly after the test on any 
of the hosts, then you have identified a possible hardware trouble spot in your cluster and 
you must now isolate if the badcrc_cnt is coming from the Myrinet NIC, the cable, or 
the port on the Myrinet switch. 
B.1. How do I determine if a cable has failed? 
In most cases, the Bad CRC8 or badcrc__invalid (or badcrc_cnt) is caused by a 
damaged cable. As a first step, if you have some extra cables, we suggest that you first try 
replacing the suspect cable, and then rerunning the above mx_pingpong 
"loopback_test" or gm_allsize "loopback test" to see if the value of Bad CRC8 or 
badcrc__invalid (or badcrc_cnt) continues to increase. If this does not eliminate the 
badcrcs then the cable is not the cause of the hardware failure, and you must now 
determine if the failure is due to the Myrinet NIC or the port on the Myrinet switch to 
which it is connected. 
If the Bad CRC8 or badcrc__invalid (or badcrc_cnt) does not increase after replacing 
the cable, then you have isolated the damaged hardware component. 
Contact help@myri.com to return the cable for repair/replacement, and you will be 
assigned a "Return Material Authorization" (RMA) number.  The information required 
for an RMA is outlined in the Myrinet FAQ (http://www.myri.com/scs/FAQ/). 
B.2. How do I determine if a port on a switch line card has failed? 
To determine if a port on a Myrinet switch has failed, do the following: 
With a known good cable, try connecting the NIC port to a different port on the switch 
line card, and rerun the mx_pingpong "loopback test" or gm_allsize "loopback test". 
If the badcrc count no longer increases, then the old switch port is the cause of the 
hardware failure. Please note that if a cable is moved from one switch port to another 
switch port (or from one NIC to another NIC), the topology of the network has changed. 
Each MX/GM process has a relative address to each other process (something like “go to 
the first switch, jump 3 ports, go to the next switch, jump -2 ports”), and if the cabling of 
the network has changed, then the mapper must be re-run so that these relative addresses 
can be updated. 
If you’re using MX or GM-2, this change in topology will be automatically detected by 
the MX/GM-2 mapper. However, if you’re using GM-1, the GM-1 mapper must be re-
run before any communication over the Myrinet network can occur. 
If the port on a switch line card is identified as the point of failure, contact 
help@myri.com to return this switch line card for repair/replacement. You will be 
assigned a "Return Material Authorization" (RMA) number. The information required for 
an RMA is outlined in the Myrinet FAQ (http://www.myri.com/scs/FAQ/). 
© 2007 Myricom, Inc. DRAFT 
39










