We used to use Zenoss as our health and alerting monitor (Zenoss).
Because of a research project needing quick insight into resource consumations on compute nodes we first quickly installed Ganglia. Not developed anymore but a great tool. You can quickly download centos 8 packages and grab centos 7 packages. For the latter you need to change the yum repo URLs to (and uncomment the mirrorlist URLs)
baseurl=http://vault.centos.org/centos/$releasever/os/$basearch/
The only change I made obvious to the needed ones was specifying that the agent gmond
reports in every 60 seconds (send_metadata interval = 60). I love abstract graphs like this, you know all is humming along in one view. And you can obtain gpu metrics finding templates here
Here is what it looks like (either select Grid > Wesleyan HPC > Server or after selecting Wesleyan HPC scroll down the page to view all nodes and pick a metric.
But Ganglia does not provide for alerting so we added Zabbix.
We set up agent monitoring using Zabbix Agent (both centos 7 and 8 - centos or rocky) and added the gpu templates from these links. The XML loads as Template on the zabbix_server, the others go on compute nodes. Of course you first install the zabbix server, then zabbix agent on compute nodes. All fairly well documented on zabbix web site
And that looks like this
* http://hpcmon.wesleyan.edu/zabbix/
Log in as guest. Then you can go to “Global View” or any of the queue based dashboards for cpu only or cpu + gpu compute nodes. Pretty flexible. You can change the date.time interval of dashboards in top tight.