Users Guide

Table Of Contents
GPU Shutdown Temperature
Maximum Memory Operating temperature
Maximum GPU Operating Temperature
Thermal Alert State
Power Brake State
Power Metrics:
Power Supply Status
Board Power Supply Status
Telemetry All GPU telemetry reports data
NOTE: GPU properties will not be listed for Embedded GPU cards and the Status is marked as Unknown.
GPU has to be in ready state before the command fetches the data. GPUStatus field in Inventory shows the availability of the
GPU and whether GPU device is responding or not. If the GPU status is ready, GPUStatus shows OK, otherwise the status
shows Unavailable.
The GPU offers multiple health parameters which can be pulled through the SMBPB interface of the NVIDIA controllers. This
feature is limited only to NVIDIA cards. Following are the health parameters retrieved from the GPU device:
Power
Temperature
Thermal
NOTE: This feature is only limited to NVIDIA cards. This information is not available for any other GPU that the server may
support. The interval for polling the GPU cards over the PBI is 5 seconds.
The host system must have the NVIDIA driver installed and running for the Power consumption, GPU target temperature,
Min GPU slowdown temperature, GPU shutdown temperature, Max memory operating temperature, and Max GPU operating
temperature features to be available. These values are shown as N/A if the GPU driver is not installed.
In Linux, when the card is unused, the driver down-trains the card and unloads in order to save power. In such cases, the
Power consumption, GPU target temperature, Min GPU slowdown temperature, GPU shutdown temperature, Max memory
operating temperature, Max memory operating temperature, and Max GPU operating temperature features are not available.
Persistent mode should be enabled for the device to avoid unload. You can use nvidia-smi tool to enable this using the command
nvidia-smi -pm 1.
You can generate GPU reports using Telemetry. For more information on telemetry feature, see
on page 215
NOTE:
In Racadm, You may see dummy GPU entries with empty values. This may happen if device is not ready to respond
when iDRAC queries the GPU device for the information. Perform iDRAC racrest operation to resolve this issue.
FPGA Monitoring
Field-programmable Gate Array (FPGA) devices needs real-time temperature sensor monitoring as it generates significant heat
when in use. Perform the following steps to get FPGA inventory information:
Power off the server.
Install FPGA device on the riser card.
Power on the server.
Wait until POST is complete.
Login to iDRAC GUI.
Navigate to System > Overview > Accelerators. You can see both GPU and FPGA sections.
Expand the specific FPGA component to see the following sensor information:
Power consumption
Temperature details
NOTE: You must have iDRAC Login privilege to access FPGA information.
NOTE: Power consumption sensors are available only for the supported FPGA cards and is available only with Datacenter
license.
126 iDRAC テム