How to monitor NVIDIA GPU metrics with Elastic Observability

Graphical processing units, or GPUs, aren’t just for PC gaming. Today, GPUs are used to train neural networks, simulate computational fluid dynamics, mine Bitcoin, and process workloads in data centers. And they are at the heart of most high-performance computing systems, making the monitoring of GPU performance in today's data centers just as important as monitoring CPU performance. With that in mind, let's take a look at how to use Elastic Observability together with NVIDIA’s GPU monitoring tools to observe and optimize GPU performance. DependenciesTo get NVIDIA GPU metrics up and running, we will need to build NVIDIA GPU monitoring tools from source code (Go). And, yes, we’ll need an NVIDIA GPU. AMD and other GPU types use different Linux drivers and monitoring tools, so we’ll have to cover them in a separate post. NVIDIA GPUs are available from many cloud providers like Google Cloud and Amazon Web Services (AWS). For this post, we are using an instance running on Genesis Cloud.

Let’s start by installing the NVIDIA Datacenter Manager per the installation section of NVIDIA’s DCGM Getting Started Guide for Ubuntu 18.04. Note: While following the guide, pay special attention replacing the parameter with our own. We can find our architecture using the uname command. uname -a

https://www.elastic.co/blog/how-to-monitor-nvidia-gpu-metrics-with-elastic-observability

Created 4y | Feb 23, 2021, 4:20:32 PM