Warning: Undefined array key "DOKU_PREFS" in /usr/share/dokuwiki/inc/common.php on line 2082
cluster:189 [DokuWiki]

User Tools

Site Tools


cluster:189

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cluster:189 [2020/02/27 13:10]
hmeij07 [Funding Policy]
cluster:189 [2020/02/27 13:12]
hmeij07 [Funding Policy]
Line 39: Line 39:
 A gpu hour of usage is 3x the cpu hourly rate.\\ A gpu hour of usage is 3x the cpu hourly rate.\\
  
-We currently have about 1,450 physical cpu cores (all Xeon), 72 gpus (20x K20, 4x GTX2018Ti, 48x RTX2080S), 520 gb of gpu memory and 8,560 gb of cpu memory. Provided by about 120 compute nodes and login nodes. Scratch spaces are provided local to compute nodes (2-5 tb) or over the network via NFS (55 tb), consult [[cluster:142|Scratch Spaces]]. Home directories are under quota (10 tb) but these will disappear in the future with the TrueNAS/ZFS appliance (190 tb, 475 tb effective assuming a compression rate of 2.5x, consult [[cluster:186|Home Dir Server]] deploy in 2020). A HPCC guide can be found here [[cluster:126|Brief Guide to HPCC]] and the (endless!) software list is located here [[cluster:73|Software Page]].  We run CentOS 6.10 or 7.6 flavors of OS.+We currently have about 1,450 physical cpu cores (all Xeon), 72 gpus (20x K20, 4x GTX2018Ti, 48x RTX2080S), 520 gb of gpu memory and 8,560 gb of cpu memory. Provided by about 120 compute nodes and login nodes. Scratch spaces are provided local to compute nodes (2-5 tb) or over the network via NFS (55 tb), consult [[cluster:142|Scratch Spaces]]. Home directories are under quota (10 tb) but these will disappear in the future with the TrueNAS/ZFS appliance (190 tb, 475 tb effective assuming a compression rate of 2.5x, consult [[cluster:186|Home Dir Server]] deploy in 2020). A HPCC guide can be found here [[cluster:126|Brief Guide to HPCC]] and the (endless!) software list is located here [[cluster:73|Software Page]].  We run CentOS 6.10 or 7.[6|7] flavors of OS.
  
  
Line 58: Line 58:
 The second principle grants priority access to certain resource(s) for a limited time to a limited group. The same PI/users relationship will be used as is used in the CPU/GPU Usage Contribution scheme. Priority access specifically means: If during the priority period the priority members' jobs go into pending mode for more than 24 hours the hpcadmin will clear compute nodes of running jobs and force those pending jobs to run. This by now is an automated process via cron that checks every 2 hours. Steps involved are; find priority members' jobs pending for more than 24 hours, find a node with no priority members jobs running in that queue, close target node, requeue all jobs on that node, force pending job(s) to run, wait 5 mins, reopen node. The second principle grants priority access to certain resource(s) for a limited time to a limited group. The same PI/users relationship will be used as is used in the CPU/GPU Usage Contribution scheme. Priority access specifically means: If during the priority period the priority members' jobs go into pending mode for more than 24 hours the hpcadmin will clear compute nodes of running jobs and force those pending jobs to run. This by now is an automated process via cron that checks every 2 hours. Steps involved are; find priority members' jobs pending for more than 24 hours, find a node with no priority members jobs running in that queue, close target node, requeue all jobs on that node, force pending job(s) to run, wait 5 mins, reopen node.
  
-All users should be aware this may happen so please checkpoint your jobs with a checkpoint interval of 24 hours. Please consult  [[cluster:147|BLCR Checkpoint in OL3]] (serial jobs) and [[cluster:148|BLCR Checkpoint in OL3]] (parallel jobs).+All users should be aware this may happen so please checkpoint your jobs with a checkpoint interval of 24 hours. Please consult  [[cluster:190|DMTCP]].
  
 ==== General ==== ==== General ====
cluster/189.txt ยท Last modified: 2024/02/12 11:47 by hmeij07