HP Insight Cluster Management Utility v7.1 User Guide

ManualsBrandsHP ManualsSoftwareHP Insight Cluster Management Utility

Running /opt/cmu/bin/cmu_config_nvidia adds a list of predefined GPU metrics to

ActionAndAlertsFile.txt. To monitor these metrics using the GUI, select the desired metrics

from the Monitoring sensors list as described in Figure 33 (page 69).

NOTE: Not all metrics are supported by all NVIDIA GPUs and some lesser used metrics may be

commented out within ActionAndAlertsFile.txt. To introduce/remove metrics from the

Monitoring sensors list, you can uncomment/comment out the associated lines inside

ActionAndAlertsFile.txt as described in “Action and alert files” (page 77).

NOTE: HP Insight CMU dynamically determines if a client has working GPUs when monitoring

is initially started after installation on the client. This monitoring process allows for configurations

that have clients with GPUs and clients without GPUs. If the GPUs are not working when monitoring

is started (or GPUs are added at a later date), redeploy monitoring to the client (see “Installing the

HP Insight CMU monitoring client” (page 66)) and restart monitoring to ensure the GPUs are

recognized.

5.5.7.2 Monitoring AMD GPUs

If your client nodes contain AMD GPUs and are running version 8.83.5 or newer of the AMD GPU

driver, you can monitor your GPUs with HP Insight CMU.

If you haven’t done so already, install the AMD GPU driver version 8.83.5 or newer on your client

nodes. This can be done two ways:

1. Install the AMD GPU driver manually on one of the client nodes, backing up the client image,

and cloning the remaining clients with this new image.

2. Use the script /opt/cmu/contrib/cmu_install_amd to install the AMD GPU driver on

all running clients. For details, see the file /opt/cmu/contrib/

cmu_install_amd.README.

To enable GPU monitoring, the /opt/cmu/etc/ActionAndAlertsFile.txt file must be

updated with entries for HP Insight CMU GPU monitoring. This is done by running the script /opt/

cmu/bin/cmu_config_amd. This script takes the number of GPUs on each client as an argument.

The following example updates ActionAndAlertsFile.txt to monitor clients that have 2

GPUs each. Monitoring must be restarted for the updates to take effect.

# cmu_config_amd 2

You are about to update the CMU ActionsAndAlerts file with metrics for monitoring AMD GPUs.

Continue? [y/n] y

Configuring GPU monitoring in CMU...

GPU monitoring configured successfully.

Copy of orignial /opt/cmu/etc/ActionAndAlertsFile.txt can found in

/opt/cmu/etc/ActionAndAlertsFile.txt_before_cmu_config_amd_config

Please restart CMU ('/etc/init.d/cmu restart') to enable these changes.

# /etc/init.d/cmu restart

Running /opt/cmu/bin/cmu_config_amd adds a list of predefined GPU metrics to

ActionAndAlertsFile.txt. To monitor these metrics using the GUI, select the desired metrics

from the Monitoring sensors list as described in Figure 33 (page 69).

NOTE: Not all metrics are supported by all AMD GPUs and some metrics may be commented

out within ActionAndAlertsFile.txt. To introduce/remove metrics from the Monitoring

sensors list, you can uncomment/comment out the associated lines inside