This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:140 [2015/06/17 19:04] hmeij |
cluster:140 [2015/12/04 18:53] (current) hmeij [tinymem] |
||
---|---|---|---|
Line 12: | Line 12: | ||
===== tinymem ===== | ===== tinymem ===== | ||
+ | |||
+ | Note: the tinymem queue is now only n46-n59 and the hp12 queue is back as before | ||
+ | --- // | ||
+ | |||
Since the hp12 nodes also have a small memory foot print we can merge this into the '' | Since the hp12 nodes also have a small memory foot print we can merge this into the '' | ||
Line 27: | Line 31: | ||
* #BSUB -R " | * #BSUB -R " | ||
- | You need to request a consumable for each job slot, so if using say #BSUB -n 4 the ' | + | You need to request a consumable for each job slot, so if using say #BSUB -n 4 the ' |
- | Doing nothing, that is not using requesting | + | Doing nothing, that is not using or requesting |
Today queues '' | Today queues '' | ||
+ | --- // | ||
+ | |||
+ | ===== Monitor ===== | ||
+ | |||
+ | Monitor your jobs and assess if they are performing as you expect. To find out your memory foot print from one the " | ||
+ | |||
+ | * ssh node_name top -u user_name -b -n 1 | ||
+ | |||
+ | **Use top, remotely, in batch.** Look at the VIRT for memory usage the OS thinks you need, if that exceeds the node capacity go to other queues. For example (not picking on anybody!) | ||
+ | |||
+ | < | ||
+ | |||
+ | # this is an appropriate use of tinymem barely 1 MB per a.out | ||
+ | |||
+ | [hmeij@greentail ~]$ ssh n55 top -u dkweiss -b -n 1 | ||
+ | Warning: Permanently added ' | ||
+ | top - 15:42:03 up 4 days, 1:19, 0 users, | ||
+ | Tasks: 829 total, | ||
+ | Cpu(s): | ||
+ | Mem: 32767008k total, | ||
+ | Swap: 0k total, | ||
+ | |||
+ | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | ||
+ | 11594 dkweiss | ||
+ | 11597 dkweiss | ||
+ | 11638 dkweiss | ||
+ | 11639 dkweiss | ||
+ | ... | ||
+ | |||
+ | # and here is a problem, at 10.8 GB per java process | ||
+ | # the new hardware can only handlew 3 of these jobs (32gb max) and | ||
+ | # the old hardware can only run 1 of these (12 gb max) | ||
+ | # so you need to either reserve memory in advance (see below) or | ||
+ | # (better) move to another queue like mw256 (and also reserve memory) | ||
+ | |||
+ | [hmeij@greentail ~]$ ssh n58 top -u cjustice -b -n 1 | ||
+ | 5:41:15 up 4 days, 1:19, 0 users, | ||
+ | Tasks: 697 total, | ||
+ | Cpu(s): | ||
+ | Mem: 32767004k total, | ||
+ | Swap: 0k total, | ||
+ | |||
+ | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | ||
+ | 12733 cjustice | ||
+ | 12722 cjustice | ||
+ | 12729 cjustice | ||
+ | 12732 cjustice | ||
+ | |||
+ | </ | ||
+ | |||
+ | Also count your processes, does it match the -n value (the main processes, don't worry about the startup shells etc) | ||
+ | |||
+ | |||
+ | **lsload node_name** | ||
+ | |||
+ | the r??m load values should not exceed the JL/H value - (jobs per host) found via bqueues, if it does there is something wrong | ||
+ | |||
+ | < | ||
+ | |||
+ | [hmeij@greentail ~]$ lsload n58 | ||
+ | HOST_NAME | ||
+ | n58 -ok | ||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | ** Reserve your memory in advance of running the job** | ||
+ | |||
+ | < | ||
+ | #BSUB -R " | ||
+ | </ | ||
+ | |||
+ | where X in MB, meaning run my job if that much is available | ||
+ | |||
\\ | \\ |