User Tools

Site Tools


cluster:140

Warning: Undefined array key 9 in /usr/share/dokuwiki/inc/html.php on line 1453

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
cluster:140 [2015/06/17 14:46]
hmeij created
cluster:140 [2015/12/04 13:53] (current)
hmeij [tinymem]
Line 12: Line 12:
  
 ===== tinymem ===== ===== tinymem =====
 +
 +Note: the tinymem queue is now only n46-n59 and the hp12 queue is back as before
 + --- //[[hmeij@wesleyan.edu|Meij, Henk]] 2015/12/04 13:52//
 +
  
 Since the hp12 nodes also have a small memory foot print we can merge this into the ''tinymem'' queue as an experiment. If it does not work, we'll bring it back in original configuration. With hyper threading on these nodes have 12 GB of memory for 16 logical cores or 0.75 GB/core. Since the hp12 nodes also have a small memory foot print we can merge this into the ''tinymem'' queue as an experiment. If it does not work, we'll bring it back in original configuration. With hyper threading on these nodes have 12 GB of memory for 16 logical cores or 0.75 GB/core.
Line 27: Line 31:
   * #BSUB -R "rusage[tmslow=1]"   * #BSUB -R "rusage[tmslow=1]"
  
 +You need to request a consumable for each job slot, so if using say #BSUB -n 4 the '1' becomes a '4' And your job will go PENDing when consumables are exhausted. When would you do this? For example if you do not wish to run on the hptmnodes and are ok with waiting, or if the fabulous new hardware is clogged full of jobs and you wish to immediately bypass those. Or, for now, the old hardware is running redhat 5 and the new centos 6.
 +
 +Doing nothing, that is not using or requesting consumables, is a perfect strategy too.
 +
 +Today queues ''hp12'' is closed while we wait for it to empty out. Then it disappears.
 + --- //[[hmeij@wesleyan.edu|Meij, Henk]] 2015/06/17 15:07//
 +
 +===== Monitor =====
 +
 +Monitor your jobs and assess if they are performing as you expect. To find out your memory foot print from one the "tail" login nodes try 
 +
 +  * ssh node_name top -u user_name -b -n 1
 +
 +**Use top, remotely, in batch.** Look at the VIRT for memory usage the OS thinks you need, if that exceeds the node capacity go to other queues. For example (not picking on anybody!)
 +
 +<code>
 +
 +# this is an appropriate use of tinymem barely 1 MB per a.out
 +
 +[hmeij@greentail ~]$ ssh n55 top -u dkweiss -b -n 1
 +Warning: Permanently added 'n55,192.168.102.65' (RSA) to the list of known hosts.
 +top - 15:42:03 up 4 days,  1:19,  0 users,  load average: 33.99, 33.97, 33.94    
 +Tasks: 829 total,  35 running, 794 sleeping,   0 stopped,   0 zombie             
 +Cpu(s):  3.1%us,  0.0%sy,  0.0%ni, 96.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st   
 +Mem:  32767008k total,  1464224k used, 31302784k free,   189544k buffers         
 +Swap:        0k total,        0k used,        0k free,   812704k cached          
 +
 +  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 +11594 dkweiss   20   0 14216 1156  812 R 100.0  0.0 143:54.57 a.out             
 +11597 dkweiss   20   0 14216 1156  812 R 100.0  0.0 143:55.82 a.out             
 +11638 dkweiss   20   0 14216 1160  812 R 100.0  0.0 143:45.35 a.out             
 +11639 dkweiss   20   0 14216 1160  812 R 100.0  0.0 143:43.59 a.out 
 +...
 +
 +# and here is a problem, at 10.8 GB per java process
 +# the new hardware can only handlew 3 of these jobs (32gb max) and 
 +# the old hardware can only run 1 of these (12 gb max)
 +# so you need to either reserve memory in advance (see below) or
 +# (better) move to another queue like mw256 (and also reserve memory)
 +
 +[hmeij@greentail ~]$ ssh n58 top -u cjustice -b -n 1                                                          
 +5:41:15 up 4 days,  1:19,  0 users,  load average: 1.22, 1.26, 1.61                                          
 +Tasks: 697 total,   1 running, 696 sleeping,   0 stopped,   0 zombie   
 +Cpu(s):  3.3%us,  0.0%sy,  0.0%ni, 96.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 +Mem:  32767004k total,  6222120k used, 26544884k free,   188512k buffers  
 +Swap:        0k total,        0k used,        0k free,  3460540k cached 
 +
 +  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 +12733 cjustice  20   0 10.8g 2.1g 9964 S 98.2  6.6  30:32.73 java               
 +12722 cjustice  20   0 17420 1560  960 S  0.0  0.0   0:00.00 res                
 +12729 cjustice  20    103m 1232 1016 S  0.0  0.0   0:00.00 1434568248.4531    
 +12732 cjustice  20    104m 1244 1044 S  0.0  0.0   0:00.00 1434568248.4531 
 +
 +</code>
 +
 +Also count your processes, does it match the -n value (the main processes, don't worry about the startup shells etc)
 +
 + 
 +**lsload node_name**
 +
 +the r??m load values should not exceed the JL/H value - (jobs per host) found via bqueues, if it does there is something wrong
 +
 +<code>
 +
 +[hmeij@greentail ~]$ lsload n58
 +HOST_NAME       status  r15s   r1m  r15m   ut    pg  ls    it   tmp   swp   mem
 +n58                -ok   1.1   1.1   1.4   2%   0.0   0 2e+08 7808M    0M   29G
 +
 +</code>
 +
 +
 +** Reserve your memory in advance of running the job**
 +
 +<code>
 +#BSUB -R "rusage[mem=X]"
 +</code>
 + 
 +where X in MB, meaning run my job if that much is available
  
  
cluster/140.1434566819.txt.gz ยท Last modified: 2015/06/17 14:46 by hmeij