This shows you the differences between two versions of the page.
cluster:47 [2007/09/06 16:43] |
cluster:47 [2007/09/06 16:43] (current) |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | \\ | ||
+ | **[[cluster: | ||
+ | |||
+ | |||
+ | ===== Running Gaussian ===== | ||
+ | |||
+ | To run Gaussian jobs on the cluster, read this page. \\ | ||
+ | It may help in identifying some errors you may encounter\\ | ||
+ | getting your jobs to run. It may also give you ideas to\\ | ||
+ | increase your overall job throughput rate. | ||
+ | |||
+ | ==== Access ==== | ||
+ | |||
+ | You must be a member of the group '' | ||
+ | |||
+ | Request for access should be emailed to '' | ||
+ | |||
+ | |||
+ | ==== Env ==== | ||
+ | |||
+ | In order to run your job add these lines to your script that submits your job. The scheduler copies your submission environment ... so issue the '' | ||
+ | |||
+ | * for bash shell | ||
+ | |||
+ | < | ||
+ | export g03root="/ | ||
+ | . $g03root/ | ||
+ | |||
+ | # set scratch dir inside your job script | ||
+ | export GAUSS_SCRDIR=" | ||
+ | </ | ||
+ | |||
+ | * for csh shell | ||
+ | |||
+ | < | ||
+ | setenv g03root "/ | ||
+ | source $g03root/ | ||
+ | |||
+ | # set scratch dir inside your job script | ||
+ | setenv GAUSS_SCRDIR " | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ==== Submit ==== | ||
+ | |||
+ | Job submissions of Gaussian jobs follow the normal procedure. | ||
+ | |||
+ | * input file: '' | ||
+ | |||
+ | < | ||
+ | % Mem=12GB | ||
+ | % NProc=8 | ||
+ | #p rb3lyp/ | ||
+ | |||
+ | Gaussian Test Job 397: | ||
+ | Valinomycin force | ||
+ | |||
+ | 0,1 | ||
+ | O, | ||
+ | O, | ||
+ | O, | ||
+ | O, | ||
+ | O, | ||
+ | O, | ||
+ | O, | ||
+ | etc,etc,etc | ||
+ | </ | ||
+ | |||
+ | * job file: '' | ||
+ | |||
+ | < | ||
+ | #!/bin/bash | ||
+ | |||
+ | #BSUB -q gaussian | ||
+ | #BSUB -m nfs-2-4 | ||
+ | #BSUB -n 8 | ||
+ | |||
+ | #BSUB -o t397a.out | ||
+ | #BSUB -e t397a.err | ||
+ | #BSUB -J t397a | ||
+ | input=t397a | ||
+ | |||
+ | # unique job scratch dirs | ||
+ | MYSANSCRATCH=/ | ||
+ | MYLOCALSCRATCH=/ | ||
+ | export MYSANSCRATCH MYLOCALSCRATCH | ||
+ | |||
+ | # cd to remote working dir | ||
+ | cd $MYLOCALSCRATCH | ||
+ | |||
+ | # environment | ||
+ | export GAUSS_SCRDIR=" | ||
+ | export g03root="/ | ||
+ | . $g03root/ | ||
+ | |||
+ | cp ~/ | ||
+ | time g03 < $input.com > output | ||
+ | cp ./output ~/ | ||
+ | |||
+ | </ | ||
+ | |||
+ | You would submit this in the typical way: '' | ||
+ | |||
+ | You may have noticed that we force the job to run on one of the specific hosts in the gaussian queue with the ''# | ||
+ | |||
+ | '' | ||
+ | |||
+ | |||
+ | |||
+ | ==== Threads ==== | ||
+ | |||
+ | Gaussian is a threaded application. | ||
+ | |||
+ | < | ||
+ | |||
+ | [hmeij@swallowtail test397]$ bsub < | ||
+ | Job < | ||
+ | |||
+ | [hmeij@swallowtail test397]$ bjobs | ||
+ | JOBID | ||
+ | 14131 | ||
+ | |||
+ | </ | ||
+ | |||
+ | However, when we log into the compute node (as root) and run '' | ||
+ | |||
+ | < | ||
+ | top - 14:44:12 up 5 days, 5:45, 1 user, load average: 8.82, 7.37, 4.90 | ||
+ | Tasks: 112 total, | ||
+ | Cpu(s): 99.8% us, 0.2% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si | ||
+ | Mem: 16415160k total, 13783092k used, 2632068k free, 53088k buffers | ||
+ | Swap: 4096564k total, | ||
+ | |||
+ | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | ||
+ | 29725 hmeij | ||
+ | ... | ||
+ | </ | ||
+ | |||
+ | |||
+ | To reveal the threads used by Gaussian, we can run the following command: | ||
+ | |||
+ | < | ||
+ | |||
+ | [root@nfs-2-4 ~]# ps -Hm 29725 | ||
+ | |||
+ | PID TTY MAJFLT MINFLT | ||
+ | |||
+ | 29725 ? 129 25630522 5371 12687868 3173310 - 12597124 - | ||
+ | / | ||
+ | / | ||
+ | / | ||
+ | / | ||
+ | |||
+ | 29725 ? 1 393745 | ||
+ | / | ||
+ | / | ||
+ | / | ||
+ | / | ||
+ | |||
+ | ... 6 more threads will be listed ... | ||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ==== Nprocs vs Mem ==== | ||
+ | |||
+ | In the submission above we requested the scheduler to reserve 8 cores on a single host: | ||
+ | |||
+ | #BSUB -n nfs-2-4 | ||
+ | #BSUB -n 8 | ||
+ | |||
+ | and we instructed Gaussian to launch 8 threads and allocate 12 GB of memory: | ||
+ | |||
+ | % Mem=12GB | ||
+ | % NProc=8 | ||
+ | |||
+ | <hi # | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ==== Matrix ==== | ||
+ | |||
+ | The table below is probably a bit excessive but i strongly urge you to build a similar table. | ||
+ | |||
+ | In the table below, we start with requesting ''# | ||
+ | |||
+ | So here is what i did: | ||
+ | |||
+ | | ||||||||| | ||
+ | ^ Heavy Weight Node 16 Gb Memory ^^^^^^^^ | ||
+ | ^BSUB^NProcs^GbMem^RealTime^Node^Load^Status^Job^ | ||
+ | | first-block ||||||||| | ||
+ | | 8 | 8 | 12 | 020m12s | ||
+ | | 8 | 8 | 06 | 018m38s | ||
+ | | 8 | 8 | 03 | 018m01s | ||
+ | | second-block ||||||||| | ||
+ | | 4 | 4 | 06 | 037m01s | ||
+ | | 2 | 2 | 03 | 066m53s | ||
+ | | 1 | 1 | 01 | 139m49s | ||
+ | | third block ||||||||| | ||
+ | | 8 | 12 | 12 | 027m25s | ||
+ | | 4 | 8 | 06 | 047m53s | ||
+ | | 2 | 4 | 06 | 049m11s | ||
+ | | ||||||||| | ||
+ | ^ Light Weight Node 4 Gb Memory ^^^^^^^^ | ||
+ | ^BSUB^NProcs^GbMem^RealTime^Node^Load^Status^Job^ | ||
+ | | ||||||||| | ||
+ | | 8 | 8 | 03 | 017m34s | ||
+ | | 4 | 4 | 03 | 063m22s | ||
+ | | 2 | 2 | 03 | 075m02s | ||
+ | | 1 | 1 | 03 | 135m37s | ||
+ | | ||||||||| | ||
+ | |||
+ | First Observation: | ||
+ | |||
+ | Second Observation: | ||
+ | |||
+ | Third Observation: | ||
+ | |||
+ | Fourth Observation: | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ==== Throughput ==== | ||
+ | |||
+ | So now lets calculate job throughput. If i needed to run 8 jobs then ... | ||
+ | |||
+ | | |||||| | ||
+ | ^ BSUB ^ NProcs | ||
+ | | 8 | 8 | 03 | 018m01s/ | ||
+ | | 4 | 4 | 06 | 037m01s/ | ||
+ | | 4 | 4 | 03 | 063m22s/ | ||
+ | | 2 | 2 | 03 | 066m53s/ | ||
+ | | |||||| | ||
+ | |||
+ | So if my math is right, going with a low requirement option of only asking for 2 cores, 2 threads and 3 Gb of memory allows me to push an equal amount of jobs through in about the same amount of time (the bottom option). | ||
+ | |||
+ | But now add the flexibility of running those request on any node on the cluster if 2 cores and 3 Gb of memory are available. | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ==== IO Example ==== | ||
+ | |||
+ | Here is nother example i ran. This time it is an IO bound piece of code. | ||
+ | |||
+ | < | ||
+ | |||
+ | % Mem=xxGB | ||
+ | % NProc=x | ||
+ | #P CCSD(T)/ | ||
+ | |||
+ | C2 | ||
+ | |||
+ | 0 1 | ||
+ | C 0. 0. 0.6236170105 | ||
+ | C 0. 0. -0.6236170105 | ||
+ | |||
+ | </ | ||
+ | |||
+ | The matrix i developed running some combination of threads and memory requirements: | ||
+ | |||
+ | | ||||||||| | ||
+ | ^ Heavy Weight Node 16 Gb Memory ^^^^^^^^ | ||
+ | ^BSUB^NProcs^GbMem^RealTime^Node^Load^Status^Job^ | ||
+ | | first-block ||||||||| | ||
+ | | 8 | 8 | 12 | 060m18s | ||
+ | | 8 | 1 | 12 | 124m15s | ||
+ | | 4 | 4 | 06 | 078m24s | ||
+ | | 2 | 2 | 03 | 106m01s | ||
+ | | ||||||||| | ||
+ | |||
+ | So now lets calculate job throughput. If i needed to run 8 jobs then ... | ||
+ | |||
+ | | |||||| | ||
+ | ^ BSUB ^ NProcs | ||
+ | | 8 | 8 | 12 | 060m18s/ | ||
+ | | 4 | 4 | 06 | 078m24s/ | ||
+ | | 2 | 2 | 03 | 106m01s/ | ||
+ | | ||||||||| | ||
+ | |||
+ | Now we observe an advantage in our job throughput by reducing the core (thread) request and memory requirement. | ||
+ | |||
+ | |||
+ | Ofcourse IO bound jobs may compete with each other for bandwidth to the disks. | ||
+ | |||
+ | The proof is in the pudding. | ||
+ | < | ||
+ | JOBID | ||
+ | 14389 | ||
+ | 14390 | ||
+ | 14392 | ||
+ | 14393 | ||
+ | 14394 | ||
+ | 14396 | ||
+ | </ | ||
+ | |||
+ | | ||||||||| | ||
+ | ^ Heavy Weight Node 16 Gb Memory ^^^^^^^^ | ||
+ | ^BSUB^NProcs^GbMem^RealTime^Node^Load^Status^Job^ | ||
+ | | concurrent runs depicted above <hi # | ||
+ | | 4 | 4 | 06 | 089m05s | ||
+ | | 4 | 4 | 06 | 086m38s | ||
+ | | concurrent runs depicted above, also a <hi # | ||
+ | | 2 | 2 | 03 | 112m49s | ||
+ | | 2 | 2 | 03 | 116m57s | ||
+ | | 2 | 2 | 03 | 121m52s | ||
+ | | 2 | 2 | 03 | 120m24s | ||
+ | | |||||||||| | ||
+ | |||
+ | So our assumptions (of running 2 or 4 jobs/cycle on same host) incurrs a penalty of slightly more than 10% when running on the same host. Still that means that if we submitted 4 jobs each requesting 2 cores(=threads) and 3 GB memory on the same host, they would finish in just under 4 hours. | ||
+ | |||
+ | |||
+ | ==== Notes ==== | ||
+ | |||
+ | - The heavy weight nodes have fast (15K RPM disks providing / | ||
+ | - If you wish to run jobs and request for example 2 cores only with 3 Gg memory, submit your jobs with the " | ||
+ | - Unfortunately, | ||
+ | |||
+ | \\ | ||
+ | **[[cluster: |