User Tools

Site Tools


cluster:227

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:227 [2024/10/15 19:30]
hmeij07 [HPC Monitoring]
cluster:227 [2024/10/23 12:16] (current)
hmeij07
Line 12: Line 12:
 </code> </code>
  
-The only change I made obvious to the needed ones was specifying that the agent ''gmond'' reports in every 60 seconds (send_metadata interval = 60).  I love abstract graphs like this, you know all is humming along in one view. And you can obtain gpu metrics (for centos 7 nodes) finding templates here+The only change I made obvious to the needed ones was specifying that the agent ''gmond'' reports in every 60 seconds (send_metadata interval = 60).  I love abstract graphs like this, you know all is humming along in one view. And you can obtain gpu metrics finding templates here
  
   * https://developer.nvidia.com/ganglia-monitoring-system   * https://developer.nvidia.com/ganglia-monitoring-system
  
-{{:cluster:screenshot_2024-10-15_090857.png?400|}}+Here is what it looks like (either select Grid > Wesleyan HPC > Server or after selecting Wesleyan HPC scroll down the page to view all nodes and pick a metric. 
 + 
 +  * http://sharptail2.wesleyan.edu/ganglia/ 
 + 
 +{{:cluster:screenshot_2024-10-15_090857.png?400|}}{{:cluster:screenshot_2024-10-23_081233.png?400|}} 
 + 
 +But Ganglia does not provide for alerting so we added **Zabbix**. 
 + 
 +We set up agent monitoring using Zabbix Agent (both centos 7 and 8 - centos or rocky) and added the gpu templates from these links. The XML loads as Template on the zabbix_server, the others go on compute nodes. Of course you first install the zabbix server, then zabbix agent on compute nodes. All fairly well documented on zabbix web site 
 + 
 +  * setup data collection with Zabbix agent, setup monitoring with Zabbix agent 
 +  * enable discovery on both with 192.168.102.1-254 
 +  * https://github.com/plambe/zabbix-nvidia-smi-multi-gpu/blob/master/zbx_nvidia-smi-multi-gpu.xml 
 +  * https://github.com/plambe/zabbix-nvidia-smi-multi-gpu 
 + 
 +And that looks like this 
 + 
 + * http://hpcmon.wesleyan.edu/zabbix/ 
 + 
 +Log in as guest. Then you can go to "Global View" or any of the queue based dashboards for cpu only or cpu + gpu compute nodes. Pretty flexible.  You can change the date.time interval of dashboards in top tight.
  
  
cluster/227.1729020614.txt.gz · Last modified: 2024/10/15 19:30 by hmeij07