\\
**[[cluster:0|Back]]**

===== HPC Monitoring =====

We used to use Zenoss as our health and alerting monitor ([[cluster:183|Zenoss]]).

Because of a research project needing quick insight into resource consumations on compute nodes we first quickly installed Ganglia.  Not developed anymore but a great tool. You can quickly download centos 8 packages and grab centos 7 packages. For the latter you need to change the yum repo URLs to (and uncomment the mirrorlist URLs)

<code>
 baseurl=http://vault.centos.org/centos/$releasever/os/$basearch/
</code>

The only change I made obvious to the needed ones was specifying that the agent ''gmond'' reports in every 60 seconds (send_metadata interval = 60).  I love abstract graphs like this, you know all is humming along in one view. And you can obtain gpu metrics finding templates here

  * https://developer.nvidia.com/ganglia-monitoring-system

Here is what it looks like (either select Grid > Wesleyan HPC > Server or after selecting Wesleyan HPC scroll down the page to view all nodes and pick a metric.

  * http://sharptail2.wesleyan.edu/ganglia/

{{:cluster:screenshot_2024-10-15_090857.png?400|}}{{:cluster:screenshot_2024-10-23_081233.png?400|}}

But Ganglia does not provide for alerting so we added **Zabbix**.

We set up agent monitoring using Zabbix Agent (both centos 7 and 8 - centos or rocky) and added the gpu templates from these links. The XML loads as Template on the zabbix_server, the others go on compute nodes. Of course you first install the zabbix server, then zabbix agent on compute nodes. All fairly well documented on zabbix web site

  * setup data collection with Zabbix agent, setup monitoring with Zabbix agent
  * enable discovery on both with 192.168.102.1-254
  * https://github.com/plambe/zabbix-nvidia-smi-multi-gpu/blob/master/zbx_nvidia-smi-multi-gpu.xml
  * https://github.com/plambe/zabbix-nvidia-smi-multi-gpu

And that looks like this

 * http://hpcmon.wesleyan.edu/zabbix/

Log in as guest. Then you can go to "Global View" or any of the queue based dashboards for cpu only or cpu + gpu compute nodes. Pretty flexible.  You can change the date.time interval of dashboards in top tight.


\\
**[[cluster:0|Back]]**