HP Insight Cluster Management Utility v7.1 User Guide

.
.
Running /opt/cmu/bin/cmu_config_nvidia adds a list of predefined GPU metrics to
ActionAndAlertsFile.txt. To monitor these metrics using the GUI, select the desired metrics
from the Monitoring sensors list as described in Figure 33 (page 69).
NOTE: Not all metrics are supported by all NVIDIA GPUs and some lesser used metrics may be
commented out within ActionAndAlertsFile.txt. To introduce/remove metrics from the
Monitoring sensors list, you can uncomment/comment out the associated lines inside
ActionAndAlertsFile.txt as described in Action and alert files” (page 77).
NOTE: HP Insight CMU dynamically determines if a client has working GPUs when monitoring
is initially started after installation on the client. This monitoring process allows for configurations
that have clients with GPUs and clients without GPUs. If the GPUs are not working when monitoring
is started (or GPUs are added at a later date), redeploy monitoring to the client (see “Installing the
HP Insight CMU monitoring client” (page 66)) and restart monitoring to ensure the GPUs are
recognized.
5.5.7.2 Monitoring AMD GPUs
If your client nodes contain AMD GPUs and are running version 8.83.5 or newer of the AMD GPU
driver, you can monitor your GPUs with HP Insight CMU.
If you haven’t done so already, install the AMD GPU driver version 8.83.5 or newer on your client
nodes. This can be done two ways:
1. Install the AMD GPU driver manually on one of the client nodes, backing up the client image,
and cloning the remaining clients with this new image.
2. Use the script /opt/cmu/contrib/cmu_install_amd to install the AMD GPU driver on
all running clients. For details, see the file /opt/cmu/contrib/
cmu_install_amd.README.
To enable GPU monitoring, the /opt/cmu/etc/ActionAndAlertsFile.txt file must be
updated with entries for HP Insight CMU GPU monitoring. This is done by running the script /opt/
cmu/bin/cmu_config_amd. This script takes the number of GPUs on each client as an argument.
The following example updates ActionAndAlertsFile.txt to monitor clients that have 2
GPUs each. Monitoring must be restarted for the updates to take effect.
# cmu_config_amd 2
You are about to update the CMU ActionsAndAlerts file with metrics for monitoring AMD GPUs.
Continue? [y/n] y
Configuring GPU monitoring in CMU...
GPU monitoring configured successfully.
Copy of orignial /opt/cmu/etc/ActionAndAlertsFile.txt can found in
/opt/cmu/etc/ActionAndAlertsFile.txt_before_cmu_config_amd_config
Please restart CMU ('/etc/init.d/cmu restart') to enable these changes.
# /etc/init.d/cmu restart
.
.
Running /opt/cmu/bin/cmu_config_amd adds a list of predefined GPU metrics to
ActionAndAlertsFile.txt. To monitor these metrics using the GUI, select the desired metrics
from the Monitoring sensors list as described in Figure 33 (page 69).
NOTE: Not all metrics are supported by all AMD GPUs and some metrics may be commented
out within ActionAndAlertsFile.txt. To introduce/remove metrics from the Monitoring
sensors list, you can uncomment/comment out the associated lines inside
ActionAndAlertsFile.txt as described in Action and alert files” (page 77).
NOTE: HP Insight CMU dynamically determines if a client has working GPUs when monitoring
is initially started after installation on the client. This monitoring process allows for configurations
that have clients with GPUs and clients without GPUs. If the GPUs are not working when monitoring
is started (or GPUs are added at a later date), redeploy monitoring to the client (see “Installing the
HP Insight CMU monitoring client” (page 66)) and restart monitoring to ensure the GPUs are
recognized.
86 Monitoring a cluster with HP Insight CMU