This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
cluster:227 [2024/10/15 19:14] hmeij07 created |
cluster:227 [2024/10/23 12:16] (current) hmeij07 |
||
---|---|---|---|
Line 4: | Line 4: | ||
===== HPC Monitoring ===== | ===== HPC Monitoring ===== | ||
- | We used to use ZenOSS | + | We used to use Zenoss |
+ | |||
+ | Because of a research project needing quick insight into resource consumations on compute nodes we first quickly installed Ganglia. | ||
+ | |||
+ | < | ||
+ | | ||
+ | </ | ||
+ | |||
+ | The only change I made obvious to the needed ones was specifying that the agent '' | ||
+ | |||
+ | * https:// | ||
+ | |||
+ | Here is what it looks like (either select Grid > Wesleyan HPC > Server or after selecting Wesleyan HPC scroll down the page to view all nodes and pick a metric. | ||
+ | |||
+ | * http:// | ||
+ | |||
+ | {{: | ||
+ | |||
+ | But Ganglia does not provide for alerting so we added **Zabbix**. | ||
+ | |||
+ | We set up agent monitoring using Zabbix Agent (both centos 7 and 8 - centos or rocky) and added the gpu templates from these links. The XML loads as Template on the zabbix_server, | ||
+ | |||
+ | * setup data collection with Zabbix agent, setup monitoring with Zabbix agent | ||
+ | * enable discovery on both with 192.168.102.1-254 | ||
+ | * https:// | ||
+ | * https:// | ||
+ | |||
+ | And that looks like this | ||
+ | |||
+ | * http:// | ||
+ | |||
+ | Log in as guest. Then you can go to " | ||