User Tools

Site Tools


cluster:140

This is an old revision of the document!



Back

2015 Summer Expansion

Fourteen Supermicro 1U servers were purchased each with dual 10 core processors. With hyper threading turned on that yields us 40 logical cores per 1U rack space or a total of 560 new logical cores. However, we maximized on cores and minimized our spending on memory. Each node has 32 GB memory per 40 logical cores, and average of 0.8 GB/core. Tiny!

Hence we create queue tinymem out of this hardware. They also have a tiny 16 GB DOM (device on motherboard, non spinning hard disk) so do not use /localscratch. The other scratch, /sanscratch, can be used, it is an NFS mount on this disk as is /home.

Below are instructions on how to monitor your jobs and make sure your jobs fit the purpose of this queue. It was acquired to accomodate all the “swarming” serial jobs, thousands of them. But parallel jobs can also be run if you fit the small memory foot print usage.

tinymem

Since the hp12 nodes also have a small memory foot print we can merge this into the tinymem queue as an experiment. If it does not work, we'll bring it back in original configuration. With hyper threading on these nodes have 12 GB of memory for 16 logical cores or 0.75 GB/core.

So the tinymem queue consists of two types of nodes; lets call them mwtmnodes for the new hardware (2015) and hptmnodes for the old hp12 nodes (2006). The new hardware will be faster (1.3x without hyper threading, and 1.35x with hyper threading) and on top of that will be able to handle 2.5x more jobs per unit of time.

In light of that I have created node specific resources

  • “tmfast” for the mwtmnodes (n46-n59)
  • “tmslow” for the hptmnodes (n1-n32)

In addition to this I have set preferences within the tinymem queue to first use the mwtmnodes then the hptmnodes. So if you do nothing and just submit to the queues that is what will happen. But you can control this if you wish, you can “consume” these node specific resources. There 40 consumables for the mwtmnodes and 16 consumables for the hptmnodes. Synatx is like this:

  • #BSUB -R “rusage[tmfast=1]”
  • #BSUB -R “rusage[tmslow=1]”

You need to request a consumable for each job slot, so if using say #BSUB -n 4 the '1' becomes a '4'. And your job will go PENDing when consumables are exhausted. When would you do this? For example if you do not wish to run on the hptmnodes and are ok with waiting, or if the fabulous new hardware is clogged full of jobs and you wish to immediately bypass those. Or, for now, the old hardware is running redhat 5 and the new centos 6.

Doing nothing, that is not using or requesting consumables, is a perfect strategy too.

Today queues hp12 is closed while we wait for it to empty out. Then it disappears. — Meij, Henk 2015/06/17 15:07

Monitor

Monitor your jobs and assess if they are performing as you expect. To find out your memory foot print from one the “tail” login nodes try

  • ssh node_name top -u user_name -b -n 1

Use top, remotely, in batch. Look at the VIRT for memory usage the OS thinks you need, if that exceeds the node capacity go to other queues. For example (not picking on anybody!)

# this is an appropriate use of tinymem barely 1 MB per a.out

[hmeij@greentail ~]$ ssh n55 top -u dkweiss -b -n 1
Warning: Permanently added 'n55,192.168.102.65' (RSA) to the list of known hosts.
top - 15:42:03 up 4 days,  1:19,  0 users,  load average: 33.99, 33.97, 33.94    
Tasks: 829 total,  35 running, 794 sleeping,   0 stopped,   0 zombie             
Cpu(s):  3.1%us,  0.0%sy,  0.0%ni, 96.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st   
Mem:  32767008k total,  1464224k used, 31302784k free,   189544k buffers         
Swap:        0k total,        0k used,        0k free,   812704k cached          

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
11594 dkweiss   20   0 14216 1156  812 R 100.0  0.0 143:54.57 a.out             
11597 dkweiss   20   0 14216 1156  812 R 100.0  0.0 143:55.82 a.out             
11638 dkweiss   20   0 14216 1160  812 R 100.0  0.0 143:45.35 a.out             
11639 dkweiss   20   0 14216 1160  812 R 100.0  0.0 143:43.59 a.out 
...

# and here is a problem, at 10.8 GB per java process
# the new hardware can only handlew 3 of these jobs (32gb max) and 
# the old hardware can only run 1 of these (12 gb max)
# so you need to either reserve memory in advance (see below) or
# (better) move to another queue like mw256 (and also reserve memory)

[hmeij@greentail ~]$ ssh n58 top -u cjustice -b -n 1                                                          
5:41:15 up 4 days,  1:19,  0 users,  load average: 1.22, 1.26, 1.61                                          
Tasks: 697 total,   1 running, 696 sleeping,   0 stopped,   0 zombie   
Cpu(s):  3.3%us,  0.0%sy,  0.0%ni, 96.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32767004k total,  6222120k used, 26544884k free,   188512k buffers  
Swap:        0k total,        0k used,        0k free,  3460540k cached 

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
12733 cjustice  20   0 10.8g 2.1g 9964 S 98.2  6.6  30:32.73 java               
12722 cjustice  20   0 17420 1560  960 S  0.0  0.0   0:00.00 res                
12729 cjustice  20   0  103m 1232 1016 S  0.0  0.0   0:00.00 1434568248.4531    
12732 cjustice  20   0  104m 1244 1044 S  0.0  0.0   0:00.00 1434568248.4531 

Also count your processes, does it match the -n value (the main processes, don't worry about the startup shells etc)

lsload node_name

the r??m load values should not exceed the JL/H value - (jobs per host) found via bqueues, if it does there is something wrong

[hmeij@greentail ~]$ lsload n58
HOST_NAME       status  r15s   r1m  r15m   ut    pg  ls    it   tmp   swp   mem
n58                -ok   1.1   1.1   1.4   2%   0.0   0 2e+08 7808M    0M   29G

Reserve your memory in advance of running the job

#BSUB -R "rusage[mem=X]"

where X in MB, meaning run my job if that much is available


Back

cluster/140.1434720203.txt.gz · Last modified: 2015/06/19 09:23 by hmeij