\\ **[[cluster:0|Home]]** ===== Cluster Steering Committee 05/09/2007 ===== Present: James Taft, Jolee West, Henk Meij, Francis Starr, David Beveridge, Eric Aaron, Tsampikos Kottos, George Petersson ==== ToDos ==== * fix PE2950s (dell issued) * dm-multipath failover (fiber channel) done! 05/14/07 * filesystem fsck tests (affirm 1T LUN size) * rebuild a node done that! hoosed the ionode 05/08/07 ==== Next Steps ==== * create accounts (all members of cluster_advisory_group); by end week of 05/18/07 * open for test & development (no home dir backups, few snapshots); approach seemed reasonable especially since we can snapshot more frequent initially when the filesystem is relatively unused. * install software ... prioritization led to: amber, delphi, imsl. others suggested were fortran 90 open-source compiler (does it exist? appears so [[http://www.g95.org|G95]] ) and xmgrace & ddd. | group to proritize => | portland compilers, Matlab, charm, amber namd, gromax, gaussian + linda, R, Stata | * adjust queues, currently have (in order of descending priority); it was decide to bring a few general purpose queues up with little or no restrictions. | name | description (lw=light weight, hw=heavy weight, i=infiniband)| | => specialty queues || | priority | urgent jobs, limited by users allowed, 8 hrs of cpu time, max cores=8, any lw node| | checkpoint | jobs will be checkpointed, max queued jobs=16, max jobs/user=2, cpu time=10 hrs, lw nodes. jobs are rerunnable| | icheckpoint | same as above but ilw nodes| | debug | hosts login1 & login2 (ethernet), max queued jobs=16, max jobs/user=8, cpu time=1 hr. scheduled with relatively high priority. lw nodes| | idebug | same as above but ilw nodes only (hosts ilogin & ilogin2)| | => production queues || | 16-lwnodes | for normal jobs, lw nodes, max cores=32| | 16-ilwnodes | for normal jobs, ilw nodes, max cores=128| | 04-hwnodes | for large memory jobs, hw nodes, max cores=32, fast /localscratch| | => default queue || | idle | jobs run on any host lightly loaded, max cores=8, max jobs/user=1, cpu time 10 hrs.| other items * get a quote for an LSF upgrade for all nodes * initiate TSM tape backups while the filesystem is rather small by adding some tapes to empty slots. \\ **[[cluster:0|Home]]**