Fourteen Supermicro 1U servers were purchased each with dual 10 core processors. With hyper threading turned on that yields us 40 logical cores per 1U rack space or a total of 560 new logical cores. However, we maximized on cores and minimized our spending on memory. Each node has 32 GB memory per 40 logical cores, and average of 0.8 GB/core. Tiny!
Hence we create queue tinymem
out of this hardware. They also have a tiny 16 GB DOM (device on motherboard, non spinning hard disk) so do not use /localscratch. The other scratch, /sanscratch, can be used, it is an NFS mount on this disk as is /home.
Below are instructions on how to monitor your jobs and make sure your jobs fit the purpose of this queue. It was acquired to accomodate all the “swarming” serial jobs, thousands of them. But parallel jobs can also be run if you fit the small memory foot print usage.
Note: the tinymem queue is now only n46-n59 and the hp12 queue is back as before — Meij, Henk 2015/12/04 13:52
Since the hp12 nodes also have a small memory foot print we can merge this into the tinymem
queue as an experiment. If it does not work, we'll bring it back in original configuration. With hyper threading on these nodes have 12 GB of memory for 16 logical cores or 0.75 GB/core.
So the tinymem
queue consists of two types of nodes; lets call them mwtmnodes for the new hardware (2015) and hptmnodes for the old hp12 nodes (2006). The new hardware will be faster (1.3x without hyper threading, and 1.35x with hyper threading) and on top of that will be able to handle 2.5x more jobs per unit of time.
In light of that I have created node specific resources
In addition to this I have set preferences within the tinymem
queue to first use the mwtmnodes then the hptmnodes. So if you do nothing and just submit to the queues that is what will happen. But you can control this if you wish, you can “consume” these node specific resources. There 40 consumables for the mwtmnodes and 16 consumables for the hptmnodes. Synatx is like this:
You need to request a consumable for each job slot, so if using say #BSUB -n 4 the '1' becomes a '4'. And your job will go PENDing when consumables are exhausted. When would you do this? For example if you do not wish to run on the hptmnodes and are ok with waiting, or if the fabulous new hardware is clogged full of jobs and you wish to immediately bypass those. Or, for now, the old hardware is running redhat 5 and the new centos 6.
Doing nothing, that is not using or requesting consumables, is a perfect strategy too.
Today queues hp12
is closed while we wait for it to empty out. Then it disappears.
— Meij, Henk 2015/06/17 15:07
Monitor your jobs and assess if they are performing as you expect. To find out your memory foot print from one the “tail” login nodes try
Use top, remotely, in batch. Look at the VIRT for memory usage the OS thinks you need, if that exceeds the node capacity go to other queues. For example (not picking on anybody!)
# this is an appropriate use of tinymem barely 1 MB per a.out [hmeij@greentail ~]$ ssh n55 top -u dkweiss -b -n 1 Warning: Permanently added 'n55,192.168.102.65' (RSA) to the list of known hosts. top - 15:42:03 up 4 days, 1:19, 0 users, load average: 33.99, 33.97, 33.94 Tasks: 829 total, 35 running, 794 sleeping, 0 stopped, 0 zombie Cpu(s): 3.1%us, 0.0%sy, 0.0%ni, 96.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 32767008k total, 1464224k used, 31302784k free, 189544k buffers Swap: 0k total, 0k used, 0k free, 812704k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11594 dkweiss 20 0 14216 1156 812 R 100.0 0.0 143:54.57 a.out 11597 dkweiss 20 0 14216 1156 812 R 100.0 0.0 143:55.82 a.out 11638 dkweiss 20 0 14216 1160 812 R 100.0 0.0 143:45.35 a.out 11639 dkweiss 20 0 14216 1160 812 R 100.0 0.0 143:43.59 a.out ... # and here is a problem, at 10.8 GB per java process # the new hardware can only handlew 3 of these jobs (32gb max) and # the old hardware can only run 1 of these (12 gb max) # so you need to either reserve memory in advance (see below) or # (better) move to another queue like mw256 (and also reserve memory) [hmeij@greentail ~]$ ssh n58 top -u cjustice -b -n 1 5:41:15 up 4 days, 1:19, 0 users, load average: 1.22, 1.26, 1.61 Tasks: 697 total, 1 running, 696 sleeping, 0 stopped, 0 zombie Cpu(s): 3.3%us, 0.0%sy, 0.0%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 32767004k total, 6222120k used, 26544884k free, 188512k buffers Swap: 0k total, 0k used, 0k free, 3460540k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12733 cjustice 20 0 10.8g 2.1g 9964 S 98.2 6.6 30:32.73 java 12722 cjustice 20 0 17420 1560 960 S 0.0 0.0 0:00.00 res 12729 cjustice 20 0 103m 1232 1016 S 0.0 0.0 0:00.00 1434568248.4531 12732 cjustice 20 0 104m 1244 1044 S 0.0 0.0 0:00.00 1434568248.4531
Also count your processes, does it match the -n value (the main processes, don't worry about the startup shells etc)
lsload node_name
the r??m load values should not exceed the JL/H value - (jobs per host) found via bqueues, if it does there is something wrong
[hmeij@greentail ~]$ lsload n58 HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem n58 -ok 1.1 1.1 1.4 2% 0.0 0 2e+08 7808M 0M 29G
Reserve your memory in advance of running the job
#BSUB -R "rusage[mem=X]"
where X in MB, meaning run my job if that much is available