Lava/LSF works via a variety of daemon processes that communicate with each other.
/dev/kmem
bjobs
, bqueues
, bsub
etc)
So an eLIM is a custom defined resource monitor. An “external LIM”. Hence eLIM. In order to make it work, the following configuration files need to be edited defining the resource. Here is my example in which i set up resource monitors for the available disk space (in MB) for the filesystems /sanscratch
and /localscratch
. It's noteworthy to mention that one is shared by all nodes, the other is not.
Once defined, users may use the output of these monitors in the resource request string of bsub
. For example, run my job on a host in the heavy weight queue if more than 300G of scratch space in /sanscratch is available and 8 GB of memory can be allocated. Monitor “mem” is internal (provided by LSF), monitor “sanscratch” is external.
bsub -q 04-hwnodes -R “sanscratch>300000 & mem>8000” …
If you need a custom monitor, define it in english terms and email the request to hpcadmin@wesleyan.edu
.
lsf.cluster.lava
Begin ResourceMap RESOURCENAME LOCATION ... # one shared instance for all hosts -hmeij sanscratch [all] # one instance local for each host -hmeij localscratch [default] End ResourceMap
lsf.shared
Begin Resource ... # below are custom resources -hmeij sanscratch Numeric 30 N (Available Disk Space in M) localscratch Numeric 30 N (Available Disk Space in M) End Resource
Now we write a simple perl or bash program which reports the values we are interested in to standard output.
/share/apps/scripts/elim
#!/usr/bin/perl # elim to report available disk space -hmeij while (1) { $tmp = `df -B M /sanscratch /localscratch`; @tmp = split(/\n/,$tmp); $tmp[2] =~ s/\s+/ /g; $tmp[2] =~ s/^\s+//g; @f = split(/ /,$tmp[2]); chop($f[2]); $sanscratch = $f[2]; $tmp[3] =~ s/\s+/ /g; $tmp[3] =~ s/^\s+//g; @f = split(/ /,$tmp[3]); chop($f[3]); $localscratch = $f[3]; # nr_of_args name1 value1 name2 value2 ... $string = "2 sanscratch $sanscratch localscratch $localscratch"; # you need the \n to flush -hmeij print "$string \n"; # or use #syswrite(OUT,$string,1); # specified in lsf.shared sleep 30; }
/opt/lava/6.1/linux2.6-glibc2.3-ia32e/etc/elim
on each node that needs to report these values.Test to make sure it works …
[root@nfs-2-2 ~]# /opt/lava/6.1/linux2.6-glibc2.3-ia32e/etc/elim 2 sanscratch 979605 localscratch 232989 2 sanscratch 979605 localscratch 232989 2 sanscratch 979605 localscratch 232989 ...
Restart LIMs and MBD …
[root@swallowtail ~]# lsadmin reconfig ... [root@swallowtail ~]# badmin mbdrestart ...
You can now query the monitors. Use lsload
to view the collected information by host. Either brute force with the -l
option (see below) or with the -R
option. Once your -R
option works with lsload
, you can also use it with bsub
too.
It's unfortunate we can't get the display to list the full value of all monitors (instead of 7e+04 etc) but it's just a display issue.
For example:
[root@swallowtail conf]# lsload -R "sanscratch>300000 & mem>8000" HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem nfs-2-2 ok 0.0 0.0 0.1 0% 3.9 0 1382 7140M 4000M 16G nfs-2-4 ok 1.0 2.4 7.7 36% 484.2 0 10080 7128M 3998M 9944M
To expand the localscratch column use (that's a CAPS 'I')
[root@swallowtail ~]# lsload -w -Ilocalscratch HOST_NAME status localscratch swallowtail ok - compute-1-6 ok 71248.0 compute-1-2 ok 71248.0 compute-1-9 ok 71248.0 compute-1-13 ok 71248.0 compute-1-14 ok 71248.0 compute-1-8 ok 71248.0 ...
…brute force…
[root@swallowtail conf]# lsload -l HOST_NAME status r15s r1m r15m ut pg io ls it tmp swp mem localscratch sanscratch swallowtail ok 0.0 0.0 0.0 0% 57.4 961 7 2 4424M 3996M 865M - 979606 compute-1-6 ok 0.0 0.0 0.0 0% 3.1 58 0 4608 7136M 4000M 3770M 7e+04 979606 compute-1-12 ok 0.0 0.0 0.0 0% 3.5 64 0 17616 7136M 4000M 3772M 7e+04 979606 compute-1-15 ok 0.0 0.0 0.0 0% 3.8 69 0 4192 7136M 4000M 3770M 7e+04 979606 nfs-2-2 ok 0.0 0.0 0.2 0% 4.2 82 0 1380 7140M 4000M 16G 2e+05 979606 compute-1-2 ok 0.0 0.0 0.0 0% 4.4 82 0 4608 7136M 3782M 3604M 7e+04 979606 compute-1-14 ok 0.0 0.0 0.0 0% 4.4 77 0 17616 7136M 4000M 3768M 7e+04 979606 compute-1-11 ok 0.0 0.0 0.0 0% 4.7 89 0 4608 7136M 3866M 3684M 7e+04 979606 compute-1-10 ok 0.0 0.0 0.0 0% 7.0 133 0 17616 7136M 3968M 3824M 7e+04 979606 ionode-1 ok 0.0 0.6 0.2 0% 3e+03 5e+04 1 1757 6932M 4000M 1709M - 979606 compute-1-9 ok 0.0 0.0 0.0 0% 3.8 70 0 81 7136M 4000M 3770M 7e+04 979606 compute-1-13 ok 0.0 0.0 0.0 0% 3.6 67 0 17616 7148M 4000M 3766M 7e+04 979606 compute-1-7 ok 0.0 0.2 0.9 3% 2e+03 3e+04 0 4608 7136M 4000M 3768M 7e+04 979606 compute-1-3 ok 0.0 0.0 0.0 0% 4.0 72 0 4608 7136M 4000M 3774M 7e+04 979606 compute-1-8 ok 0.0 0.2 0.0 0% 4.4 84 0 2940 6416M 4000M 3812M 7e+04 979606 compute-1-4 ok 0.0 0.0 0.0 0% 4.9 91 0 17616 7136M 4000M 3770M 7e+04 979606 compute-1-1 ok 0.3 0.0 0.0 0% 4.8 84 0 1731 7136M 3822M 3640M 7e+04 979606 compute-2-32 ok 1.0 1.0 1.0 13% 5.4 97 0 1447 7144M 4000M 3738M 7e+04 979606 nfs-2-1 ok 1.0 8.7 7.3 100% 6.7 127 1 30 7140M 3958M 3614M 2e+05 979606 nfs-2-3 ok 1.0 5.6 8.0 60%1014.0 2e+04 0 11704 7136M 3958M 3548M 2e+05 979606 compute-1-16 ok 1.0 1.0 1.0 13% 5.8 108 0 4604 7136M 4000M 3734M 7e+04 979606 nfs-2-4 ok 1.2 8.5 8.6 94% 105.4 1974 0 10072 7128M 3998M 3544M 2e+05 979606 compute-1-23 ok 2.0 2.0 2.0 25% 5.5 103 0 17616 7140M 4000M 3658M 7e+04 979606 compute-1-27 ok 2.0 2.0 2.0 25% 5.4 104 0 17616 7140M 4000M 3550M 7e+04 979606 compute-2-30 ok 2.0 2.0 2.0 25% 5.5 103 0 17616 7148M 4000M 3644M 7e+04 979606 compute-1-26 ok 2.0 2.0 2.0 25% 5.4 99 0 17616 7140M 4000M 3644M 7e+04 979606 compute-1-17 ok 2.0 2.0 2.0 25% 5.9 106 0 17616 7140M 4000M 3636M 7e+04 979606 compute-2-29 ok 2.0 2.0 2.0 25% 6.2 114 0 17616 7148M 4000M 3634M 7e+04 979606 compute-1-25 ok 2.0 2.1 2.0 25% 5.3 99 0 4604 7140M 4000M 3642M 7e+04 979606 compute-1-19 ok 2.0 2.0 2.0 25% 4.8 88 0 4604 7140M 4000M 3636M 7e+04 979606 compute-2-31 ok 3.0 3.0 3.0 38% 5.5 109 0 4196 7144M 4000M 3488M 7e+04 979606 compute-1-20 ok 3.0 3.0 3.0 38% 4.8 93 0 4604 7140M 4000M 3562M 7e+04 979606 compute-1-21 ok 3.0 3.0 3.0 38% 5.8 110 0 4604 7140M 4000M 3638M 7e+04 979606 compute-1-22 ok 3.0 3.0 3.0 38% 6.0 110 0 4604 7140M 4000M 3636M 7e+04 979606 compute-1-18 ok 4.0 4.0 4.0 50% 5.6 100 0 4604 7140M 4000M 3552M 7e+04 979606 compute-1-24 ok 4.0 4.0 4.0 50% 5.8 105 0 4604 7140M 4000M 3546M 7e+04 979606 compute-2-28 ok 5.0 5.0 5.0 62% 4.3 83 0 17616 7140M 4000M 3460M 7e+04 979606 compute-1-5 ok 7.0 7.0 7.0 88% 4.4 79 0 4608 7136M 4000M 2838M 7e+04 979606