User Tools

Site Tools


cluster:129


Back

Gaussian Checkpointing

When you have one or more jobs running that rely on Gaussian internal checkpoint mechanism, heavy read/write operations may result. That traffic should definitely not hit the /home file system but the /sanscratch file system. That scratch space is also NFS mounted over the Infiniband interconnects (via IPoIB). The result is that this file systems IO operations will also slow our file server down tremendously (even though /sanscratch is a 5 disk Raid 0 setup).

So we've been trying to figure out how to control, or throttle, the IO traffic. Usually the application itself will provide option for this, like rsync's –bwlimit option, but we've not found anything so far. However from the operating systems' point of view we do have a tool available: ionice - get/set program io scheduling class and priority.

So for those that rely on the generation of large Gaussian checkpoint file, please add the following lines to the very top of your submission script:

ionice -c 2 -n 7 -p $$
ionice -p $$

This instructs the compute node to schedule the IO traffic with “best effort” scheduling class and lowest priority for the scripts process ID and any processes launched from the script. This seems (surprisingly, perhaps because of IpoIB?) to have a positive effect on the client's IO traffic that hits the NFS mounted filesystem. It has a tremendous positive impact when issued on the file server itself so perhaps a monitor script is needed in the future.

Example using the /sanscratch area; you must stage your data in the temporary work directory and save results back to your home directory before the job finishes. The scheduler will automatically remove and delete that work directory.

#!/bin/bash
# submit like so: bsub < run.forked

# if writing large checkpoint files uncommnet next lines
#ionice -c 2 -n 7 -p $$
#ionice -p $$

rm -rf err* out* output.*

#BSUB -q mw256fd
#BSUB -o out
#BSUB -e err
#BSUB -J test

# job slots: match inside gaussian.com
#BSUB -n 4
# force all onto one host (shared code and data stack)
#BSUB -R "span[hosts=1]"

# unique job scratch dirs
MYSANSCRATCH=/sanscratch/$LSB_JOBID
MYLOCALSCRATCH=/localscratch/$LSB_JOBID
export MYSANSCRATCH MYLOCALSCRATCH

# cd to remote working dir
cd $MYSANSCRATCH
pwd

# environment
export GAUSS_SCRDIR="$MYSANSCRATCH"

export g09root="/share/apps/gaussian/g09root"
. $g09root/g09/bsd/g09.profile

#export gdvroot="/share/apps/gaussian/gdvh11"
#. $gdvroot/gdv/bsd/gdv.profile

# copy input data to fast disk
cp ~/jobs/forked/gaussian.com .
touch gaussian.log

# run plain vanilla
g09 < gaussian.com > gaussian.log

# run dev
#gdv < gaussian.com > gaussian.log

# save results back to homedir
cp gaussian.log ~/jobs/forked/output.$LSB_JOBID


Back

cluster/129.txt · Last modified: 2014/06/18 14:04 by hmeij