Troubleshooting guide
mpicc to compile mx/unit_test/src/mpi/mpi_stress.c. The executable mpi_stress can 
then be run like any other MPI program using mpirun.ch_mx or mpirun.ch_gm. 
If the GM firmware is installed on the cluster, the GM-specific stress program, 
gm_stress.c, can also be used to stress the network. Full details of how to run gm_stress 
can be found on the FAQ entry (http://www.myri.com/cgi-bin/fom?file=53). 
8. Run fm_show_alerts for diagnostic information on any damaged/failing hardware 
component. 
Are there any “un-ACKed alerts” listed in the output of fm_status? 
If yes, run fm_show_alerts to print a list of all active alerts, signaling possible hardware 
error conditions. 
Alerts are created when certain exceptional events occur and are reported to the fms. 
Alerts persist within the fms until they are cleared. Clearing usually requires the alert to 
be acknowledged (ACKed) and for the condition which caused the alert to have cleared. 
Once the alert has been acknowledged, it is marked as "ACKed". Once the condition that 
caused the alert has cleared, we mark it as a "relic". Most alerts are deleted only after they 
have been both relic-ed and ACKed. 
By default, fm_show_alerts prints only alerts which have not been ACKed and are not 
relics. Each alert has a unique index which can be passed to fm_ack_alert to 
acknowledge the alert. 
Refer to http://www.myri.com/scs/fms/#alerts as well as the file libfma/alert.def in the 
FMS distribution, for a detailed listing of all possible alerts. 
Example output of fm_show_alerts can also be found on the FMS webpage, 
http://www.myri.com/scs/fms/#examples. 
© 2007 Myricom, Inc. DRAFT 
32










