DokuWiki

Running Gaussian

To run Gaussian jobs on the cluster, read this page.
It may help in identifying some errors you may encounter
getting your jobs to run. It may also give you ideas to
increase your overall job throughput rate.

Access

You must be a member of the group gaussian in order to execute the Gaussian binaries. That also implies that you have read the license, which is located at /share/apps/gaussian/License.pdf.

Request for access should be emailed to hpcadmin@wesleyan.edu.

Env

In order to run your job add these lines to your script that submits your job. The scheduler copies your submission environment … so issue the newgroup gaussian command before submitting a job in your current shell. Putting that command in the script has no effect (it invokes a subshell). Alternatively you can add it your ~/.bash_profile startup script.

for bash shell

        export g03root="/share/apps/gaussian/g03root"
        . $g03root/g03/bsd/g03.profile

        # set scratch dir inside your job script
        export GAUSS_SCRDIR="$MYLOCALSCRATCH"

for csh shell

        setenv g03root "/share/apps/gaussian/g03"
        source $g03root/g03/bsd/g03.login

        # set scratch dir inside your job script
        setenv GAUSS_SCRDIR "$MYLOCALSCRATCH"

Submit

Job submissions of Gaussian jobs follow the normal procedure. Below is an example which we'll use for performance testing later on. This test procedure is a compute bound problem so your mileage may vary especially if your job is IO bound. Our example is “test397” found in /share/apps/gaussian/g03root/g03/tests/com/.

input file: t397a.com

% Mem=12GB
% NProc=8
#p rb3lyp/3-21g force test scf=novaracc

Gaussian Test Job 397:
Valinomycin force

0,1
O,-1.3754834437,-2.5956821046,3.7664927822
O,-0.3728418073,-0.530460483,3.8840401686
O,2.3301890394,0.5231526187,1.7996834334
O,0.2842272248,2.5136416005,-0.2483875054
O,2.3870396194,3.3004808604,0.2860546915
O,3.927241841,1.9677029583,-2.7261655162
O,2.2191878407,-1.0673859692,-2.0338343532
etc,etc,etc

job file: t397a.run

#!/bin/bash

#BSUB -q gaussian
#BSUB -m nfs-2-4
#BSUB -n 8

#BSUB -o t397a.out
#BSUB -e t397a.err
#BSUB -J t397a
input=t397a

# unique job scratch dirs
MYSANSCRATCH=/sanscratch/$LSB_JOBID
MYLOCALSCRATCH=/localscratch/$LSB_JOBID
export MYSANSCRATCH MYLOCALSCRATCH

# cd to remote working dir
cd $MYLOCALSCRATCH

# environment
export GAUSS_SCRDIR="$MYLOCALSCRATCH"
export g03root="/share/apps/gaussian/g03root"
. $g03root/g03/bsd/g03.profile

cp ~/gaussian/test397/$input.com .
time g03 < $input.com > output
cp ./output ~/gaussian/test397/$input.$LSB_JOBID.out

You would submit this in the typical way: bsub < t397a.run

You may have noticed that we force the job to run on one of the specific hosts in the gaussian queue with the #BSUB -m option. More about this later. You can find idle hosts either at the web address http://swallowtail.wesleyan.edu/clumon or via the command

bjobs -m host_name -u all

Threads

Gaussian is a threaded application. So our submission above results in the typical output of bjobs and we see that we are occupying 8 job slots.

[hmeij@swallowtail test397]$ bsub <t397a.run
Job <14131> is submitted to queue <04-hwnodes>.

[hmeij@swallowtail test397]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
14131   hmeij   RUN   gaussian swallowtail   8*nfs-2-4   t397a      Aug 31 14:33

However, when we log into the compute node (as root) and run top, only one process reveals itself. The load hovers around 9. In other parallel jobs, like Amber, multiple copies of the same binaries run, tied together with MPI (Message Passing Interface). But in this case only one process is running.

top - 14:44:12 up 5 days,  5:45,  1 user,  load average: 8.82, 7.37, 4.90
Tasks: 112 total,   4 running, 108 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.8% us,  0.2% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:  16415160k total, 13783092k used,  2632068k free,    53088k buffers
Swap:  4096564k total,     1500k used,  4095064k free,   955908k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
29725 hmeij     24   0 12.1g  12g 3140 R 99.9 76.7  37:05.97 l502.exe  
...

To reveal the threads used by Gaussian, we can run the following command:

[root@nfs-2-4 ~]# ps -Hm 29725 

  PID TTY      MAJFLT MINFLT   TRS   DRS  SIZE  SWAP  RSS  SHRD   LIB   DT COMMAND

29725 ?           129 25630522 5371 12687868 3173310 - 12597124 -   -    - 
/share/apps/gaussian/g03root/g03/l502.exe 1610612736 
/localscratch/14131/Gau-29725.chk 0 /localscratch/14131/Gau-29725.int 0
/localscratch/14131/Gau-29725.rwf 0 /localscratch/14131/Gau-29725.d2e 0
/localscratch/14131/Gau-29725.scr 0 /localscratch/14131/Gau-29724.inp 0 junk.out 0

29725 ?             1 393745  5371 12687868 3173310 - 12597124 -    -    -
/share/apps/gaussian/g03root/g03/l502.exe 1610612736 
/localscratch/14131/Gau-29725.chk 0 /localscratch/14131/Gau-29725.int 0
/localscratch/14131/Gau-29725.rwf 0 /localscratch/14131/Gau-29725.d2e 0
/localscratch/14131/Gau-29725.scr 0 /localscratch/14131/Gau-29724.inp 0 junk.out 0

... 6 more threads will be listed ...

Nprocs vs Mem

In the submission above we requested the scheduler to reserve 8 cores on a single host:

#BSUB -n nfs-2-4
#BSUB -n 8

and we instructed Gaussian to launch 8 threads and allocate 12 GB of memory:

% Mem=12GB
% NProc=8

<hi #ffc0cb>But what is our best option regarding job throughput? Good question.</hi>

Matrix

The table below is probably a bit excessive but i strongly urge you to build a similar table. Try several combinations of threads * mem values and observe which suits you the best.

In the table below, we start with requesting #BSUB -n 8 and play with the memory footprint. You must remember that when you request 8 cores via the scheduler, your job can only be submitted if a machine is idle. However, the option may exist to run with less cores (threads) and possibly a smaller memory footprint. Your job may take longer to run, but the possiblity of your jobs running on hosts ,or multiple jobs on a single host, that meet your requirements increases.

So here is what i did:


Heavy Weight Node 16 Gb Memory
BSUB	NProcs	GbMem	RealTime	Node	Load	Status	Job
first-block
8	8	12	020m12s	n2-4	8+	Full	t397a
8	8	06	018m38s	n2-4	8+	Full	t397aa
8	8	03	018m01s	n2-4	4+	Full	t397aa
second-block
4	4	06	037m01s	n2-4	5+	Normal	t397f
2	2	03	066m53s	n2-4	3	Normal	t397g
1	1	01	139m49s	n2-1	2	Normal	t397g
third block
8	12	12	027m25s	n2-2	12+	Full	t397a
4	8	06	047m53s	n2-3	12+	Normal	t397c
2	4	06	049m11s	n2-3	12+	Normal	t397d

Light Weight Node 4 Gb Memory
BSUB	NProcs	GbMem	RealTime	Node	Load	Status	Job

8	8	03	017m34s	c1-3	4+	Full	t397b
4	4	03	063m22s	c1-1	5+	Normal	t397c
2	2	03	075m02s	c1-10	3	Normal	t397d
1	1	03	135m37s	c1-11	2	Normal	t397e

First Observation: In the table above we try to keep BSUB == NProcs … that is, we ask the scheduler for a certain number of job slots (read processor cores) and instruct Gaussian to spawn an equal number of threads. The first block of results for the heavy weight nodes reveals that we need not ask for huge amounts of memory to get the job done. In fact, 3 Gb will do, which would imply we could also run these jobs on light weight nodes. In all of these scenarios though we ask for BSUB=8 meaning we do need exclusive access to the compute host.

Second Observation: In the next block, we reduce our request for job slots (cores) and simultaneously our request for memory. The idea is that if we only needed 4 out of 8 cores and 6 out of 12 gb of memory … we could then run 2 jobs concurrently. And more as we decrease our requirements. The results show that in this case we could run with half the cores and half the memory in about twice the run time. That may not seem like an improvement. But we have removed the requirement to obtain exclusive access to a host, meaning we could run our jobs on any heavy weight node if 4 or more cores were available.

Third Observation: In the third block of the heavy weight nodes, we attempt to “overload” the processors. We ask for more threads than there are cores (by a factor of 1.5). Gaussian now logs warning messages and the hosts starts to lightly swap. In some cases, this may show an improvement but not in our case.

Fourth Observation: Since we can run with 3 Gb of memory, our jobs do not have to go to the gaussian queue. In fact any host would do, so lets submit to the idle queue (i hardcoded some host names for reference). But again, 8 cores/threads is the best performance while reducing the core requirements does induce a performance hit.

Throughput

So now lets calculate job throughput. If i needed to run 8 jobs then …

BSUB	NProcs	GbMem	Bench	Total Time	Assume

8	8	03	018m01s/job	2 hrs 24 mins	one host: 8 x (1 job/cycle)
4	4	06	037m01s/job	2 hrs 28 mins	one host: 4 x (2 jobs/cycle)
4	4	03	063m22s/job	4 hrs 15 mins	one host: 4 x (2 jobs/cycle)
2	2	03	066m53s/job	2 hrs 14 mins	one host: 2 x (4 jobs/cycle)

So if my math is right, going with a low requirement option of only asking for 2 cores, 2 threads and 3 Gb of memory allows me to push an equal amount of jobs through in about the same amount of time (the bottom option).

But now add the flexibility of running those request on any node on the cluster if 2 cores and 3 Gb of memory are available. Ofcourse, we still could only run one of these on a light weight node because of the memory requirements.

IO Example

Here is nother example i ran. This time it is an IO bound piece of code.

% Mem=xxGB
% NProc=x
#P CCSD(T)/cc-pV6Z tran=(semidirect,iabc)

C2

0 1
C 0. 0.  0.6236170105
C 0. 0. -0.6236170105

The matrix i developed running some combination of threads and memory requirements:


Heavy Weight Node 16 Gb Memory
BSUB	NProcs	GbMem	RealTime	Node	Load	Status	Job
first-block
8	8	12	060m18s	n2-4	8+	Full	test2
8	1	12	124m15s	n2-4	1	Full	test2
4	4	06	078m24s	n2-4	4	Normal	test2a
2	2	03	106m01s	n2-1	2	Normal	testb

So now lets calculate job throughput. If i needed to run 8 jobs then …


BSUB	NProcs	GbMem	Bench	Total Time	Assume
8	8	12	060m18s/job	8 hrs 02 mins	one host: 8 x (1 job/cycle)
4	4	06	078m24s/job	5 hrs 13 mins	one host: 4 x (2 jobs/cycle)
2	2	03	106m01s/job	3 hrs 32 mins	one host: 2 x (4 jobs/cycle)

Now we observe an advantage in our job throughput by reducing the core (thread) request and memory requirement. And we obtain flexibility in the number of hosts that potentially could run our jobs.

Ofcourse IO bound jobs may compete with each other for bandwidth to the disks. So lets run the examples above and test our assumptions. We submit on one idle host 2 jobs each requesting 4 cores (=threads) and 6 GB memory. On another idle host we submit 4 jobs each requesting 2 cores(=threads) and 3 GB memory. Both hosts are thus handling total requests of 8 cores (=threads) and 12 GB memory. We use the bsub -m hostname option to route our jobs to target hosts.

The proof is in the pudding.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
14389   hmeij   RUN   gaussian   swallowtail 4*nfs-2-4   test2a     Sep  6 14:14
14390   hmeij   RUN   gaussian   swallowtail 4*nfs-2-4   test2b     Sep  6 14:14
14392   hmeij   RUN   gaussian   swallowtail 2*nfs-2-1   test6.1    Sep  6 14:17
14393   hmeij   RUN   gaussian   swallowtail 2*nfs-2-1   test6.2    Sep  6 14:17
14394   hmeij   RUN   gaussian   swallowtail 2*nfs-2-1   test6.3    Sep  6 14:17
14396   hmeij   RUN   gaussian   swallowtail 2*nfs-2-1   test6.4    Sep  6 14:17


Heavy Weight Node 16 Gb Memory
BSUB	NProcs	GbMem	RealTime	Node	Load	Status	Job
concurrent runs depicted above <hi #ffa500>10% penalty</hi>
4	4	06	089m05s	n2-4	8	Full	test2a
4	4	06	086m38s	n2-4	8	Full	test2b
concurrent runs depicted above, also a <hi #ffa500>10% penalty</hi>
2	2	03	112m49s	n2-1	9	Full	test6.1
2	2	03	116m57s	n2-1	9	Full	test6.2
2	2	03	121m52s	n2-1	9	Full	test6.3
2	2	03	120m24s	n2-1	9	Full	test6.4

So our assumptions (of running 2 or 4 jobs/cycle on same host) incurrs a penalty of slightly more than 10% when running on the same host. Still that means that if we submitted 4 jobs each requesting 2 cores(=threads) and 3 GB memory on the same host, they would finish in just under 4 hours.

Notes

The heavy weight nodes have fast (15K RPM disks providing /localscratch) which may be a requirement for your jobs.
If you wish to run jobs and request for example 2 cores only with 3 Gg memory, submit your jobs with the “resource requirement” string … bsub -n 2 -R “mem>3000” …, consult the manual page for bsub for more information.
Unfortunately, our scheduler does not support ''bsub -X …“ meaning submit my job for exclusive use by the host. However you can obtain the same effect with the resource requirement string.

Back