NVIDIA GPU Monitoring with Elasticsearch and Metricbeat
Elastic Cloud paired with Elastic Observability can be used to monitor NVIDIA GPUs to report utilization, temperature, and other host metrics to a dashboard. Along with the prior integration with Google Cloud and Amazon Web Services (AWS), there is more recent integration between Elastic Cloud and Microsoft Azure. In this article, we’ll walk through updated instructions for deploying metricbeat with NVIDIA’s GPU Monitoring Tools, as the prior instructions from Elasticsearch are based on an older and deprecated NVIDIA code repository. If you’ve tried deploying based on those instructions and receive the following error when running the dcgm-exporter command (Error getting device information: API version mismatch), this tutorial will guide you through deploying a version-matched set of GPU Monitoring Tools and dcgm-exporter on a bare metal system.
> dcgm-exporter --address localhost:9090
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: Error getting supported metrics: API version mismatch
FATA[0000] Error getting device information: API version mismatch
Already Followed the Elasticsearch Instructions?
If you already deployed with the instructions from the Elastic blog post and got the API version mismatch error, jump to #below to apply the specific fix as the other install steps will already be done.
Prerequisites
The only prerequisite is to have appropriate permissions to install and deploy the tools on an Ubuntu host system with NVIDIA GPU and a working Elastic cloud instance (or other Elastic deployment, but this will require configuring the beat to authenticate with the appropriate endpoint). Instructions for RHEL or similar systems can be put together by following the DCGM Installation instructions from NVIDIA.
DCGM/Elastic Installation
Install DCGM
- Remove any prior nv-hostengine installations
sudo nv-hostengine -t
- Delete any prior GPG keys
sudo apt-key del 7fa2af80
- Determine the distribution name and download the package metadata from the repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
- Install the current GPG key, update the repository, and install DCGM
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get install -y datacenter-gpu-manager
Install Metricbeat
cd /tmp
wget https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.5.2-amd64.deb
sudo dpkg -i metricbeat-8.5.2-amd64.deb
Once installed, follow the Metricbeat instructions to configure authentication to your Elastic instance by editing /etc/metricbeat/metricbeat.yml
If using Elastic Cloud, you’ll edit:
cloud.id
cloud.auth
Enable monitoring
sudo metricbeat modules enable prometheus
sudo metricbeat setup
The setup command can take a while to run and configure the remote instance and should print a confirmation message when complete (“Loaded dashboards”).
Install Golang
Golang is needed to build the monitoring tools:
cd /tmp
wget https://golang.org/dl/go1.15.7.linux-amd64.tar.gz
sudo mv go1.15.7.linux-amd64.tar.gz /usr/local/
cd /usr/local/
sudo tar -zxf go1.15.7.linux-amd64.tar.gz
sudo rm go1.15.7.linux-amd64.tar.gz
Build and Deploy Monitoring Tools
First, check your nv-hostengine version with:
nv-hostengine --version
Then find the appropriate monitoring tools version from the repository tags at https://github.com/NVIDIA/dcgm-exporter/tags and fill the appropriate tag in the below command for <tag_name>
cd /tmp
git clone --branch <tag_name> https://github.com/NVIDIA/dcgm-exporter.git
cd dcgm-exporter/
sudo env "PATH=$PATH:/usr/local/go/bin" make install
Export Metrics and Send via Metricbeat
In one terminal or screen, start dcgm-exporter:
dcgm-exporter --address localhost:9090
And in a second terminal or screen, start Metricbeat:
sudo metricbeat -e
Metrics should now be populating in Elastic Observability!
References
- Original Elastic blog post: https://www.elastic.co/blog/how-to-monitor-nvidia-gpu-metrics-with-elastic-observability
- Old NVIDIA repository: https://github.com/NVIDIA/gpu-monitoring-tools
- New NVIDIA DCGM Exporter repository: https://github.com/NVIDIA/dcgm-exporter
- NVIDIA Data Center Manager/nv-hostengine installation instructions: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html#installation