cluster:140
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| cluster:140 [2015/06/17 19:08] – [tinymem] hmeij | cluster:140 [2015/12/04 18:53] (current) – [tinymem] hmeij | ||
|---|---|---|---|
| Line 12: | Line 12: | ||
| ===== tinymem ===== | ===== tinymem ===== | ||
| + | |||
| + | Note: the tinymem queue is now only n46-n59 and the hp12 queue is back as before | ||
| + | --- // | ||
| + | |||
| Since the hp12 nodes also have a small memory foot print we can merge this into the '' | Since the hp12 nodes also have a small memory foot print we can merge this into the '' | ||
| Line 27: | Line 31: | ||
| * #BSUB -R " | * #BSUB -R " | ||
| - | You need to request a consumable for each job slot, so if using say #BSUB -n 4 the ' | + | You need to request a consumable for each job slot, so if using say #BSUB -n 4 the ' |
| - | Doing nothing, that is not using requesting | + | Doing nothing, that is not using or requesting |
| Today queues '' | Today queues '' | ||
| --- // | --- // | ||
| + | |||
| + | ===== Monitor ===== | ||
| + | |||
| + | Monitor your jobs and assess if they are performing as you expect. To find out your memory foot print from one the " | ||
| + | |||
| + | * ssh node_name top -u user_name -b -n 1 | ||
| + | |||
| + | **Use top, remotely, in batch.** Look at the VIRT for memory usage the OS thinks you need, if that exceeds the node capacity go to other queues. For example (not picking on anybody!) | ||
| + | |||
| + | < | ||
| + | |||
| + | # this is an appropriate use of tinymem barely 1 MB per a.out | ||
| + | |||
| + | [hmeij@greentail ~]$ ssh n55 top -u dkweiss -b -n 1 | ||
| + | Warning: Permanently added ' | ||
| + | top - 15:42:03 up 4 days, 1:19, 0 users, | ||
| + | Tasks: 829 total, | ||
| + | Cpu(s): | ||
| + | Mem: 32767008k total, | ||
| + | Swap: 0k total, | ||
| + | |||
| + | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | ||
| + | 11594 dkweiss | ||
| + | 11597 dkweiss | ||
| + | 11638 dkweiss | ||
| + | 11639 dkweiss | ||
| + | ... | ||
| + | |||
| + | # and here is a problem, at 10.8 GB per java process | ||
| + | # the new hardware can only handlew 3 of these jobs (32gb max) and | ||
| + | # the old hardware can only run 1 of these (12 gb max) | ||
| + | # so you need to either reserve memory in advance (see below) or | ||
| + | # (better) move to another queue like mw256 (and also reserve memory) | ||
| + | |||
| + | [hmeij@greentail ~]$ ssh n58 top -u cjustice -b -n 1 | ||
| + | 5:41:15 up 4 days, 1:19, 0 users, | ||
| + | Tasks: 697 total, | ||
| + | Cpu(s): | ||
| + | Mem: 32767004k total, | ||
| + | Swap: 0k total, | ||
| + | |||
| + | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | ||
| + | 12733 cjustice | ||
| + | 12722 cjustice | ||
| + | 12729 cjustice | ||
| + | 12732 cjustice | ||
| + | |||
| + | </ | ||
| + | |||
| + | Also count your processes, does it match the -n value (the main processes, don't worry about the startup shells etc) | ||
| + | |||
| + | |||
| + | **lsload node_name** | ||
| + | |||
| + | the r??m load values should not exceed the JL/H value - (jobs per host) found via bqueues, if it does there is something wrong | ||
| + | |||
| + | < | ||
| + | |||
| + | [hmeij@greentail ~]$ lsload n58 | ||
| + | HOST_NAME | ||
| + | n58 -ok | ||
| + | |||
| + | </ | ||
| + | |||
| + | |||
| + | ** Reserve your memory in advance of running the job** | ||
| + | |||
| + | < | ||
| + | #BSUB -R " | ||
| + | </ | ||
| + | |||
| + | where X in MB, meaning run my job if that much is available | ||
| + | |||
| \\ | \\ | ||
cluster/140.1434568085.txt.gz · Last modified: (external edit)
