\\ **[[cluster:0|Back]]** ===== Running Gaussian ===== To run Gaussian jobs on the cluster, read this page. \\ It may help in identifying some errors you may encounter\\ getting your jobs to run. It may also give you ideas to\\ increase your overall job throughput rate. ==== Access ==== You must be a member of the group ''gaussian'' in order to execute the Gaussian binaries. That also implies that you have read the license, which is located at ''/share/apps/gaussian/License.pdf''. Request for access should be emailed to ''hpcadmin@wesleyan.edu''. ==== Env ==== In order to run your job add these lines to your script that submits your job. The scheduler copies your submission environment ... so issue the ''newgroup gaussian'' command before submitting a job in your current shell. Putting that command in the script has no effect (it invokes a subshell). Alternatively you can add it your ''~/.bash_profile'' startup script. * for bash shell export g03root="/share/apps/gaussian/g03root" . $g03root/g03/bsd/g03.profile # set scratch dir inside your job script export GAUSS_SCRDIR="$MYLOCALSCRATCH" * for csh shell setenv g03root "/share/apps/gaussian/g03" source $g03root/g03/bsd/g03.login # set scratch dir inside your job script setenv GAUSS_SCRDIR "$MYLOCALSCRATCH" ==== Submit ==== Job submissions of Gaussian jobs follow the normal procedure. Below is an example which we'll use for performance testing later on. This test procedure is a compute bound problem so your mileage may vary especially if your job is IO bound. Our example is "test397" found in ''/share/apps/gaussian/g03root/g03/tests/com/''. * input file: ''t397a.com'' % Mem=12GB % NProc=8 #p rb3lyp/3-21g force test scf=novaracc Gaussian Test Job 397: Valinomycin force 0,1 O,-1.3754834437,-2.5956821046,3.7664927822 O,-0.3728418073,-0.530460483,3.8840401686 O,2.3301890394,0.5231526187,1.7996834334 O,0.2842272248,2.5136416005,-0.2483875054 O,2.3870396194,3.3004808604,0.2860546915 O,3.927241841,1.9677029583,-2.7261655162 O,2.2191878407,-1.0673859692,-2.0338343532 etc,etc,etc * job file: ''t397a.run'' #!/bin/bash #BSUB -q gaussian #BSUB -m nfs-2-4 #BSUB -n 8 #BSUB -o t397a.out #BSUB -e t397a.err #BSUB -J t397a input=t397a # unique job scratch dirs MYSANSCRATCH=/sanscratch/$LSB_JOBID MYLOCALSCRATCH=/localscratch/$LSB_JOBID export MYSANSCRATCH MYLOCALSCRATCH # cd to remote working dir cd $MYLOCALSCRATCH # environment export GAUSS_SCRDIR="$MYLOCALSCRATCH" export g03root="/share/apps/gaussian/g03root" . $g03root/g03/bsd/g03.profile cp ~/gaussian/test397/$input.com . time g03 < $input.com > output cp ./output ~/gaussian/test397/$input.$LSB_JOBID.out You would submit this in the typical way: ''bsub < t397a.run'' You may have noticed that we force the job to run on one of the specific hosts in the gaussian queue with the ''#BSUB -m'' option. More about this later. You can find idle hosts either at the web address http://swallowtail.wesleyan.edu/clumon or via the command ''bjobs -m host_name -u all'' ==== Threads ==== Gaussian is a threaded application. So our submission above results in the typical output of ''bjobs'' and we see that we are occupying 8 job slots. [hmeij@swallowtail test397]$ bsub is submitted to queue <04-hwnodes>. [hmeij@swallowtail test397]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 14131 hmeij RUN gaussian swallowtail 8*nfs-2-4 t397a Aug 31 14:33 However, when we log into the compute node (as root) and run ''top'', only one process reveals itself. The load hovers around 9. In other parallel jobs, like Amber, multiple copies of the same binaries run, tied together with MPI (Message Passing Interface). But in this case only one process is running. top - 14:44:12 up 5 days, 5:45, 1 user, load average: 8.82, 7.37, 4.90 Tasks: 112 total, 4 running, 108 sleeping, 0 stopped, 0 zombie Cpu(s): 99.8% us, 0.2% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 16415160k total, 13783092k used, 2632068k free, 53088k buffers Swap: 4096564k total, 1500k used, 4095064k free, 955908k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29725 hmeij 24 0 12.1g 12g 3140 R 99.9 76.7 37:05.97 l502.exe ... To reveal the threads used by Gaussian, we can run the following command: [root@nfs-2-4 ~]# ps -Hm 29725 PID TTY MAJFLT MINFLT TRS DRS SIZE SWAP RSS SHRD LIB DT COMMAND 29725 ? 129 25630522 5371 12687868 3173310 - 12597124 - - - /share/apps/gaussian/g03root/g03/l502.exe 1610612736 /localscratch/14131/Gau-29725.chk 0 /localscratch/14131/Gau-29725.int 0 /localscratch/14131/Gau-29725.rwf 0 /localscratch/14131/Gau-29725.d2e 0 /localscratch/14131/Gau-29725.scr 0 /localscratch/14131/Gau-29724.inp 0 junk.out 0 29725 ? 1 393745 5371 12687868 3173310 - 12597124 - - - /share/apps/gaussian/g03root/g03/l502.exe 1610612736 /localscratch/14131/Gau-29725.chk 0 /localscratch/14131/Gau-29725.int 0 /localscratch/14131/Gau-29725.rwf 0 /localscratch/14131/Gau-29725.d2e 0 /localscratch/14131/Gau-29725.scr 0 /localscratch/14131/Gau-29724.inp 0 junk.out 0 ... 6 more threads will be listed ... ==== Nprocs vs Mem ==== In the submission above we requested the scheduler to reserve 8 cores on a single host: #BSUB -n nfs-2-4 #BSUB -n 8 and we instructed Gaussian to launch 8 threads and allocate 12 GB of memory: % Mem=12GB % NProc=8 But what is our best option regarding job throughput? Good question. ==== Matrix ==== The table below is probably a bit excessive but i strongly urge you to build a similar table. Try several combinations of ''threads * mem'' values and observe which suits you the best. In the table below, we start with requesting ''#BSUB -n 8'' and play with the memory footprint. You must remember that when you request 8 cores via the scheduler, your job can only be submitted if a machine is idle. However, the option may exist to run with less cores (threads) and possibly a smaller memory footprint. Your job may take longer to run, but the possiblity of your jobs running on hosts ,or multiple jobs on a single host, that meet your requirements increases. So here is what i did: | ||||||||| ^ Heavy Weight Node 16 Gb Memory ^^^^^^^^ ^BSUB^NProcs^GbMem^RealTime^Node^Load^Status^Job^ | first-block ||||||||| | 8 | 8 | 12 | 020m12s | n2-4 | 8+ | Full | t397a | | 8 | 8 | 06 | 018m38s | n2-4 | 8+ | Full | t397aa | | 8 | 8 | 03 | 018m01s | n2-4 | 4+ | Full | t397aa | | second-block ||||||||| | 4 | 4 | 06 | 037m01s | n2-4 | 5+ | Normal | t397f | | 2 | 2 | 03 | 066m53s | n2-4 | 3 | Normal | t397g | | 1 | 1 | 01 | 139m49s | n2-1 | 2 | Normal | t397g | | third block ||||||||| | 8 | 12 | 12 | 027m25s | n2-2 | 12+ | Full | t397a | | 4 | 8 | 06 | 047m53s | n2-3 | 12+ | Normal | t397c | | 2 | 4 | 06 | 049m11s | n2-3 | 12+ | Normal | t397d | | ||||||||| ^ Light Weight Node 4 Gb Memory ^^^^^^^^ ^BSUB^NProcs^GbMem^RealTime^Node^Load^Status^Job^ | ||||||||| | 8 | 8 | 03 | 017m34s | c1-3 | 4+ | Full | t397b | | 4 | 4 | 03 | 063m22s | c1-1 | 5+ | Normal | t397c | | 2 | 2 | 03 | 075m02s | c1-10 | 3 | Normal | t397d | | 1 | 1 | 03 | 135m37s | c1-11 | 2 | Normal | t397e | | ||||||||| First Observation: In the table above we try to keep BSUB == NProcs ... that is, we ask the scheduler for a certain number of job slots (read processor cores) and instruct Gaussian to spawn an equal number of threads. The first block of results for the heavy weight nodes reveals that we need not ask for huge amounts of memory to get the job done. In fact, 3 Gb will do, which would imply we could also run these jobs on light weight nodes. In all of these scenarios though we ask for ''BSUB=8'' meaning we do need exclusive access to the compute host. Second Observation: In the next block, we reduce our request for job slots (cores) and simultaneously our request for memory. The idea is that __if__ we only needed 4 out of 8 cores and 6 out of 12 gb of memory ... we could then run 2 jobs concurrently. And more as we decrease our requirements. The results show that in this case we could run with half the cores and half the memory in about twice the run time. That may not seem like an improvement. But we have removed the requirement to obtain exclusive access to a host, meaning we could run our jobs on any heavy weight node if 4 or more cores were available. Third Observation: In the third block of the heavy weight nodes, we attempt to "overload" the processors. We ask for more threads than there are cores (by a factor of 1.5). Gaussian now logs warning messages and the hosts starts to lightly swap. In some cases, this may show an improvement but not in our case. Fourth Observation: Since we can run with 3 Gb of memory, our jobs do not have to go to the gaussian queue. In fact any host would do, so lets submit to the idle queue (i hardcoded some host names for reference). But again, 8 cores/threads is the best performance while reducing the core requirements does induce a performance hit. ==== Throughput ==== So now lets calculate job throughput. If i needed to run 8 jobs then ... | |||||| ^ BSUB ^ NProcs ^ GbMem ^ Bench ^ Total Time ^ Assume ^ | 8 | 8 | 03 | 018m01s/job | 2 hrs 24 mins | one host: 8 x (1 job/cycle) | | 4 | 4 | 06 | 037m01s/job | 2 hrs 28 mins | one host: 4 x (2 jobs/cycle) | | 4 | 4 | 03 | 063m22s/job | 4 hrs 15 mins | one host: 4 x (2 jobs/cycle) | | 2 | 2 | 03 | 066m53s/job | 2 hrs 14 mins | one host: 2 x (4 jobs/cycle) | | |||||| So if my math is right, going with a low requirement option of only asking for 2 cores, 2 threads and 3 Gb of memory allows me to push an equal amount of jobs through in about the same amount of time (the bottom option). But now add the flexibility of running those request on any node on the cluster if 2 cores and 3 Gb of memory are available. Ofcourse, we still could only run one of these on a light weight node because of the memory requirements. ==== IO Example ==== Here is nother example i ran. This time it is an IO bound piece of code. % Mem=xxGB % NProc=x #P CCSD(T)/cc-pV6Z tran=(semidirect,iabc) C2 0 1 C 0. 0. 0.6236170105 C 0. 0. -0.6236170105 The matrix i developed running some combination of threads and memory requirements: | ||||||||| ^ Heavy Weight Node 16 Gb Memory ^^^^^^^^ ^BSUB^NProcs^GbMem^RealTime^Node^Load^Status^Job^ | first-block ||||||||| | 8 | 8 | 12 | 060m18s | n2-4 | 8+ | Full | test2 | | 8 | 1 | 12 | 124m15s | n2-4 | 1 | Full | test2 | | 4 | 4 | 06 | 078m24s | n2-4 | 4 | Normal | test2a | | 2 | 2 | 03 | 106m01s | n2-1 | 2 | Normal | testb | | ||||||||| So now lets calculate job throughput. If i needed to run 8 jobs then ... | |||||| ^ BSUB ^ NProcs ^ GbMem ^ Bench ^ Total Time ^ Assume ^ | 8 | 8 | 12 | 060m18s/job | 8 hrs 02 mins | one host: 8 x (1 job/cycle) | | 4 | 4 | 06 | 078m24s/job | 5 hrs 13 mins | one host: 4 x (2 jobs/cycle) | | 2 | 2 | 03 | 106m01s/job | 3 hrs 32 mins | one host: 2 x (4 jobs/cycle) | | ||||||||| Now we observe an advantage in our job throughput by reducing the core (thread) request and memory requirement. And we obtain flexibility in the number of hosts that potentially could run our jobs. Ofcourse IO bound jobs may compete with each other for bandwidth to the disks. So lets run the examples above and test our assumptions. We submit on one idle host 2 jobs each requesting 4 cores (=threads) and 6 GB memory. On another idle host we submit 4 jobs each requesting 2 cores(=threads) and 3 GB memory. Both hosts are thus handling total requests of 8 cores (=threads) and 12 GB memory. We use the ''bsub -m //hostname//'' option to route our jobs to target hosts. The proof is in the pudding. JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 14389 hmeij RUN gaussian swallowtail 4*nfs-2-4 test2a Sep 6 14:14 14390 hmeij RUN gaussian swallowtail 4*nfs-2-4 test2b Sep 6 14:14 14392 hmeij RUN gaussian swallowtail 2*nfs-2-1 test6.1 Sep 6 14:17 14393 hmeij RUN gaussian swallowtail 2*nfs-2-1 test6.2 Sep 6 14:17 14394 hmeij RUN gaussian swallowtail 2*nfs-2-1 test6.3 Sep 6 14:17 14396 hmeij RUN gaussian swallowtail 2*nfs-2-1 test6.4 Sep 6 14:17 | ||||||||| ^ Heavy Weight Node 16 Gb Memory ^^^^^^^^ ^BSUB^NProcs^GbMem^RealTime^Node^Load^Status^Job^ | concurrent runs depicted above 10% penalty ||||||||| | 4 | 4 | 06 | 089m05s | n2-4 | 8 | Full | test2a | | 4 | 4 | 06 | 086m38s | n2-4 | 8 | Full | test2b | | concurrent runs depicted above, also a 10% penalty ||||||||| | 2 | 2 | 03 | 112m49s | n2-1 | 9 | Full | test6.1 | | 2 | 2 | 03 | 116m57s | n2-1 | 9 | Full | test6.2 | | 2 | 2 | 03 | 121m52s | n2-1 | 9 | Full | test6.3 | | 2 | 2 | 03 | 120m24s | n2-1 | 9 | Full | test6.4 | | |||||||||| So our assumptions (of running 2 or 4 jobs/cycle on same host) incurrs a penalty of slightly more than 10% when running on the same host. Still that means that if we submitted 4 jobs each requesting 2 cores(=threads) and 3 GB memory on the same host, they would finish in just under 4 hours. ==== Notes ==== - The heavy weight nodes have fast (15K RPM disks providing /localscratch) which may be a requirement for your jobs. - If you wish to run jobs and request for example 2 cores only with 3 Gg memory, submit your jobs with the "resource requirement" string ... ''bsub -n 2 -R "mem>3000" ...'', consult the manual page for ''bsub'' for more information. - Unfortunately, our scheduler does not support ''bsub -X ..." meaning submit my job for exclusive use by the host. However you can obtain the same effect with the resource requirement string. \\ **[[cluster:0|Back]]**