Nov 29, 2022 3 min read

NVIDIA GPU Monitoring with Elasticsearch and Metricbeat

Elastic Cloud paired with Elastic Observability can be used to monitor NVIDIA GPUs to report utilization, temperature, and other host metrics to a dashboard. Along with the prior integration with Google Cloud and Amazon Web Services (AWS), there is more recent integration between Elastic Cloud and Microsoft Azure. In this article, we’ll walk through updated instructions for deploying metricbeat with NVIDIA’s GPU Monitoring Tools, as the prior instructions from Elasticsearch are based on an older and deprecated NVIDIA code repository. If you’ve tried deploying based on those instructions and receive the following error when running the dcgm-exporter command (Error getting device information: API version mismatch), this tutorial will guide you through deploying a version-matched set of GPU Monitoring Tools and dcgm-exporter on a bare metal system.

> dcgm-exporter --address localhost:9090
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: Error getting supported metrics: API version mismatch
FATA[0000] Error getting device information: API version mismatch

Already Followed the Elasticsearch Instructions?

If you already deployed with the instructions from the Elastic blog post and got the API version mismatch error, jump to #below to apply the specific fix as the other install steps will already be done.

Prerequisites

The only prerequisite is to have appropriate permissions to install and deploy the tools on an Ubuntu host system with NVIDIA GPU and a working Elastic cloud instance (or other Elastic deployment, but this will require configuring the beat to authenticate with the appropriate endpoint). Instructions for RHEL or similar systems can be put together by following the DCGM Installation instructions from NVIDIA.

DCGM/Elastic Installation

Install DCGM

Remove any prior nv-hostengine installations

sudo nv-hostengine -t

Delete any prior GPG keys

sudo apt-key del 7fa2af80

Determine the distribution name and download the package metadata from the repository

distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb

Install the current GPG key, update the repository, and install DCGM

sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get install -y datacenter-gpu-manager

Install Metricbeat

cd /tmp 
wget https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.5.2-amd64.deb 
sudo dpkg -i metricbeat-8.5.2-amd64.deb

Once installed, follow the Metricbeat instructions to configure authentication to your Elastic instance by editing /etc/metricbeat/metricbeat.yml

If using Elastic Cloud, you’ll edit:

cloud.id
cloud.auth

Enable monitoring

sudo metricbeat modules enable prometheus
sudo metricbeat setup

The setup command can take a while to run and configure the remote instance and should print a confirmation message when complete (“Loaded dashboards”).

Install Golang

Golang is needed to build the monitoring tools:

cd /tmp 
wget https://golang.org/dl/go1.15.7.linux-amd64.tar.gz
sudo mv go1.15.7.linux-amd64.tar.gz /usr/local/ 
cd /usr/local/ 
sudo tar -zxf go1.15.7.linux-amd64.tar.gz 
sudo rm go1.15.7.linux-amd64.tar.gz

Build and Deploy Monitoring Tools

First, check your nv-hostengine version with:

nv-hostengine --version

Then find the appropriate monitoring tools version from the repository tags at https://github.com/NVIDIA/dcgm-exporter/tags and fill the appropriate tag in the below command for <tag_name>

cd /tmp 
git clone --branch <tag_name> https://github.com/NVIDIA/dcgm-exporter.git
cd dcgm-exporter/ 
sudo env "PATH=$PATH:/usr/local/go/bin" make install

Export Metrics and Send via Metricbeat

In one terminal or screen, start dcgm-exporter:

dcgm-exporter --address localhost:9090

And in a second terminal or screen, start Metricbeat:

sudo metricbeat -e

Metrics should now be populating in Elastic Observability!

References

Original Elastic blog post: https://www.elastic.co/blog/how-to-monitor-nvidia-gpu-metrics-with-elastic-observability
Old NVIDIA repository: https://github.com/NVIDIA/gpu-monitoring-tools
New NVIDIA DCGM Exporter repository: https://github.com/NVIDIA/dcgm-exporter
NVIDIA Data Center Manager/nv-hostengine installation instructions: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html#installation