User Tools

Site Tools


cluster:148

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:148 [2016/03/30 14:00]
hmeij07
cluster:148 [2020/01/24 13:36] (current)
hmeij07
Line 3: Line 3:
  
 ==== BLCR Checkpoint in OL3 ==== ==== BLCR Checkpoint in OL3 ====
 +
 +**Deprecated since we did [[cluster:185|OS Update]] \\
 +We will replace it with [[cluster:190|DMTCP]] ** \\
 + --- //[[hmeij@wesleyan.edu|Henk]] 2020/01/14 14:31//
  
   * This page concerns PARALLEL mpirun jobs only; there are some restrictions   * This page concerns PARALLEL mpirun jobs only; there are some restrictions
Line 14: Line 18:
   * Users Guide [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html]]   * Users Guide [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html]]
  
-Checkpointing parallel jobs is a bit more complex than a serial job. MPI jobs are fired off by worker 0 of ''mpirun'' and all workers may open files and perform socket to socket communications. Therefore a restart will need to restore all file IDs, process Ids, etc. A job may thus fail if a certain process ID is already running. Restarted files also behave as if the old JOBPID is running and will write results to the old STDERR and STDOUT files. And use the old hostfile.+Checkpointing parallel jobs is a bit more complex than a serial job. MPI workers (the -n) are fired off by worker 0 of ''mpirun'' and all workers may open files and perform socket to socket communications. Therefore a restart will need to restore all file IDs, process IDs, etc. A job may thus fail if a certain process ID is already running. Restarted files also behave as if the old JOBPID is running and will write results to the old STDERR and STDOUT files. 
 + 
 +The ''blcr_wrapper_parallel' below will manage all this for you. Like the serial wrapper only edit the top of the file and provide the information necessary. But first, your software needs to be compiled with a special "older" version of OpenMPI. MPI checkpointing support has been removed in later versions of OpenMPI
  
-The ''blcr_wrapper_parallel' will manage all this for you. Like the serial wrapper only edit the top of the file and provide the information necessary. But first, your software needs to be compile with a special "older" version of OpenMPI. MPI checkpointing has been removed in later versions of OpenMPI. Here is the admin stuff.+Here is the admin stuff.
  
 <code> <code>
  
-# from eric at lbl+# from eric at lbl, configure openmpi, I choose 1.6.5 (version needs to be < 1.7)
 ./configure \ ./configure \
             --enable-ft-thread \             --enable-ft-thread \
Line 28: Line 34:
             --without-tm \             --without-tm \
             --prefix=/share/apps/CENTOS6/openmpi/1.6.5.cr             --prefix=/share/apps/CENTOS6/openmpi/1.6.5.cr
 +make
 +make install
  
-# next download cr_mpirun+# next download cr_mpirun from LBL
 https://upc-bugs.lbl.gov/blcr-dist/cr_mpirun/cr_mpirun-295.tar.gz https://upc-bugs.lbl.gov/blcr-dist/cr_mpirun/cr_mpirun-295.tar.gz
  
-# configure and test +# configure and test cr_mpirun
 export PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/bin:$PATH export PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/bin:$PATH
 export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/lib:$LD_LIBRARY_PATH
  
 ./configure --with-blcr=/share/apps/blcr/0.8.5/test ./configure --with-blcr=/share/apps/blcr/0.8.5/test
 +make 
 +make check
  
 ============================================================================ ============================================================================
Line 52: Line 61:
 make[1]: Leaving directory `/home/apps/src/petaltail6/cr_mpirun-295' make[1]: Leaving directory `/home/apps/src/petaltail6/cr_mpirun-295'
  
-# I coped cr_runmpi into /share/apps/CENTOS6/openmpi/1.6.5.cr/bin +# I copied cr_mpirun into /share/apps/CENTOS6/openmpi/1.6.5.cr/bin/ 
-cr_runmpi needs access to all these in $PATH+cr_mpirun needs access to all these in $PATH
 # mpirun cr_mpirun ompi-checkpoint ompi-restart cr_checkpoint cr_restart # mpirun cr_mpirun ompi-checkpoint ompi-restart cr_checkpoint cr_restart
  
-# next compile you parallel software using mpicc/mpicxx from thhe 1.6.5 distro+# next compile you parallel software using mpicc/mpicxx from the 1.6.5 distro
  
 </code> </code>
  
 +Here is what a sample run using the Openlava scheduler looks like
  
 <code> <code>
  
-[hmeij@cottontail lammps]$ bsub < blcr_wrapper+# submit as usual after editing the top of the file, see comments in that wrapper file 
 +[hmeij@cottontail lammps]$ bsub < blcr_wrapper_parallel 
 Job <681> is submitted to queue <test>. Job <681> is submitted to queue <test>.
-[hmeij@cottontail lammps]$ bjobs + 
-JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME +# cr_mpirun job
-681     hmeij   PEND  test       cottontail              test       Mar 29 13:23+
 [hmeij@cottontail lammps]$ bjobs [hmeij@cottontail lammps]$ bjobs
 JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
Line 76: Line 86:
                                              petaltail                                              petaltail
  
-[hmeij@cottontail lammps]$ tail ~/.lsbatch/1459272204.681.out                    +# wrapper stores BLCR checkpoint file (chk.PID) in this location 
-     160    4062132.4   -4439564.1 1.4618689e+08 2.8818933e+08    17552.743      +# and it calls the openmpi snapshot tools and stores that in 
-     170    4395340.9   -4440084.6 1.3417499e+08 2.8818925e+08    19150.394      +# ompi_global_snapshot_SOME-PID.ckpt, also in same location 
-     180    4711438.8   -4440426.5 1.2277977e+08 2.8818918e+08    20665.317      +
-     190    5007151.9   -4440573.2 1.1211925e+08 2.8818913e+08    22081.756      +
-     200    5279740.3     -4440516 1.0229219e+08 2.8818909e+08    23386.523      +
-     210    5527023.5   -4440257.5     93377214 2.8818906e+08    24569.109       +
-     220    5747387.7   -4439813.3     85432510 2.8818904e+08    25621.734       +
-     230    5939773.3   -4439214.4     78496309 2.8818904e+08    26539.282       +
-     240    6103647.2   -4438507.6     72587871 2.8818905e+08    27319.145       +
-     250    6238961.8   -4437755.5     67708974 2.8818907e+08    27961.064      +
 [hmeij@cottontail lammps]$ ll /sanscratch/checkpoints/681 [hmeij@cottontail lammps]$ ll /sanscratch/checkpoints/681
 total 30572                                               total 30572                                              
Line 99: Line 102:
 drwx------ 3 hmeij its       46 Mar 29 13:28 ompi_global_snapshot_9134.ckpt drwx------ 3 hmeij its       46 Mar 29 13:28 ompi_global_snapshot_9134.ckpt
 -rw-r--r-- 1 hmeij its       16 Mar 29 13:23 pwd.9127 -rw-r--r-- 1 hmeij its       16 Mar 29 13:23 pwd.9127
 +
 +# the processes running
 [hmeij@cottontail lammps]$ ssh petaltail ps -u hmeij [hmeij@cottontail lammps]$ ssh petaltail ps -u hmeij
   PID TTY          TIME CMD   PID TTY          TIME CMD
Line 107: Line 112:
  9113 ?        00:00:00 1459272204.681.  9113 ?        00:00:00 1459272204.681.
  9127 ?        00:00:00 cr_mpirun  9127 ?        00:00:00 cr_mpirun
- 9128 ?        00:00:00 blcr_watcher+ 9128 ?        00:00:00 blcr_watcher <--- the watcher, will terminate 9113 if 9127 is gone (ie done or crashed)
  9133 ?        00:00:00 cr_mpirun  9133 ?        00:00:00 cr_mpirun
  9134 ?        00:00:00 mpirun  9134 ?        00:00:00 mpirun
Line 119: Line 124:
  9370 ?        00:00:00 ps  9370 ?        00:00:00 ps
 18559 pts/2    00:00:00 bash 18559 pts/2    00:00:00 bash
 +
 +# how far did the job progress?
 [hmeij@cottontail lammps]$ tail ~/.lsbatch/1459272204.681.out [hmeij@cottontail lammps]$ tail ~/.lsbatch/1459272204.681.out
      190    5007151.9   -4440573.2 1.1211925e+08 2.8818913e+08    22081.756      190    5007151.9   -4440573.2 1.1211925e+08 2.8818913e+08    22081.756
Line 130: Line 137:
      270    6425840.1   -4436426.7     60970646 2.8818913e+08    28840.129      270    6425840.1   -4436426.7     60970646 2.8818913e+08    28840.129
      280    6479251.9   -4436021.1     59044759 2.8818917e+08    29086.075      280    6479251.9   -4436021.1     59044759 2.8818917e+08    29086.075
 +     
 +# simulate crash
 [hmeij@cottontail lammps]$ ssh petaltail kill 9133 [hmeij@cottontail lammps]$ ssh petaltail kill 9133
  
 +# edit the file and prep for a restart, submit again
 [hmeij@cottontail lammps]$ bsub < blcr_wrapper [hmeij@cottontail lammps]$ bsub < blcr_wrapper
 Job <684> is submitted to queue <test>       Job <684> is submitted to queue <test>      
-[hmeij@cottontail lammps]$ rm -f ../.lsbatch/*^C + 
-[hmeij@cottontail lammps]$ bjobs                 +# so job 684 is resatrinig job 681, wrapper preps files                        
-JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME +
-684     hmeij   RUN   test       cottontail  petaltail   test       Mar 29 13:48 +
-                                             petaltail                           +
-                                             petaltail                           +
-                                             petaltail                          +
 [hmeij@cottontail lammps]$ ll ../.lsbatch/                                       [hmeij@cottontail lammps]$ ll ../.lsbatch/                                      
 total 172                                                                        total 172                                                                       
Line 149: Line 154:
 -rw------- 1 hmeij its   53 Mar 29 13:48 1459273700.684.out                      -rw------- 1 hmeij its   53 Mar 29 13:48 1459273700.684.out                     
 -rwxr--r-- 1 hmeij its 4270 Mar 29 13:48 1459273700.684.shell                    -rwxr--r-- 1 hmeij its 4270 Mar 29 13:48 1459273700.684.shell                   
-lrwxrwxrwx 1 hmeij its   33 Mar 29 13:48 hostfile.681 -> /home/hmeij/.lsbatch/hostfile.684+-rwxr--r-- 1 hmeij its   33 Mar 29 13:48 hostfile.681 
 -rw-r--r-- 1 hmeij its   40 Mar 29 13:48 hostfile.684                                      -rw-r--r-- 1 hmeij its   40 Mar 29 13:48 hostfile.684                                     
 -rw-r--r-- 1 hmeij its   40 Mar 29 13:48 hostfile.tmp.684                                  -rw-r--r-- 1 hmeij its   40 Mar 29 13:48 hostfile.tmp.684                                 
-[hmeij@cottontail lammps]$ less ../.lsbatch/*684.err+
 [hmeij@cottontail lammps]$ ssh petaltail ps -u hmeij [hmeij@cottontail lammps]$ ssh petaltail ps -u hmeij
   PID TTY          TIME CMD                            PID TTY          TIME CMD                         
Line 165: Line 170:
 10002 ?        00:00:00 1459273700.684               10002 ?        00:00:00 1459273700.684              
 10005 ?        00:00:00 1459273700.684.              10005 ?        00:00:00 1459273700.684.             
-10039 ?        00:00:00 cr_restart                  +10039 ?        00:00:00 cr_restart    <------ started everything back up              
 10051 ?        00:00:00 cr_mpirun                    10051 ?        00:00:00 cr_mpirun                   
 10052 ?        00:00:00 mpirun                       10052 ?        00:00:00 mpirun                      
Line 179: Line 184:
 18559 pts/2    00:00:00 bash                         18559 pts/2    00:00:00 bash                        
  
 +# and now you can watch the output picking from last checkpoint file 
 [hmeij@cottontail lammps]$ tail -20 ../.lsbatch/1459272204.681.out [hmeij@cottontail lammps]$ tail -20 ../.lsbatch/1459272204.681.out
      210    5527023.5   -4440257.5     93377214 2.8818906e+08    24569.109      210    5527023.5   -4440257.5     93377214 2.8818906e+08    24569.109
Line 200: Line 206:
      390    5796854.1   -4451671.6     83661722 2.8818969e+08    25640.235      390    5796854.1   -4451671.6     83661722 2.8818969e+08    25640.235
      400    5665179.3     -4453332     88410367 2.8818972e+08    24990.519      400    5665179.3     -4453332     88410367 2.8818972e+08    24990.519
 +
 +# let job finish
  
 </code> </code>
  
 +
 +==== Parallel Wrapper v2 ====
 +
 +A bit more verbose and error handling. Also the blcr_wrapper or the cr_checkpoint loop code can now terminate the job.
 +
 +<code>
 +
 +#!/bin/bash -x 
 +rm -f err out
 +# work dir and cwd
 +export MYSANSCRATCH=/sanscratch/$LSB_JOBID
 +cd $MYSANSCRATCH
 +
 +# at job finish, all content in /sanscratch/JOBPID
 +# will be copied to /sanscratch/checkpoints/JOBPID
 +# content older than 3 months will be removed
 +
 +# SCHEDULER set queue name in next TWO lines
 +queue=hp12
 +#BSUB -q hp12
 +#BSUB -n 6
 +#BSUB -J test
 +#BSUB -o out
 +#BSUB -e err
 +# next required for mpirun checkpoint to work
 +# restarts must use same node (not sure why)
 +#BSUB -R "span[hosts=1]"
 +#BSUB -m n5
 +
 +# CHECK POINT TIME INTERVAL: 10m (debug) 6h 12h 18h 1d 
 +cpti=15m
 +
 +# COPY APPLICATION TO WORK DIR $MYSANSCRATCH (cwd)
 +# always stage the application (and data if needed)
 +# if mpirun save_exec="n" (default)
 +save_exec="n"
 +pre_cmd=" scp -r
 +$HOME/python/kflaherty/data/HD163296.CO32.regridded.cen15.vis
 +$HOME/python/kflaherty/data/HD163296.CO32.regridded.cen15.vis.fits
 +$HOME/python/kflaherty/data/lowres_ALMA_weights_calc.sav
 +$HOME/python/kflaherty/co.dat
 +$HOME/python/kflaherty/disk_other.py
 +$HOME/python/kflaherty/disk.py
 +$HOME/python/kflaherty/mol_dat.py
 +$HOME/python/kflaherty/mpi_run_models.py
 +$HOME/python/kflaherty/sample_co32.sh
 +$HOME/python/kflaherty/single_model.py . "
 +post_cmd=" scp $MYSANSCRATCH/chain*.dat $HOME/tmp/"
 +
 +
 +# IF START OF JOB, UNCOMMENT
 +# its either start or restart block
 +#mode=start
 +#cmd=" python mpi_run_models.py /sanscratch/$LSB_JOBID > /sanscratch/$LSB_JOBID/test2.out "
 +
 +# IF RESTART OF JOB, UNCOMMENT, MUST BE RUN ON SAME NODE
 +# you must have pwd.JOBPID and chk.JOBPID in $orgjobpid/
 +mode=restart
 +orgjobpid=636341
 +
 +# user environment
 +export PYTHONHOME=/share/apps/CENTOS6/blcr_soft/python/2.7.10
 +export PYTHONPATH=/home/apps/CENTOS6/blcr_soft/python/2.7.10/lib/python2.7/site-packages
 +export PATH=$PYTHONHOME/bin:$PATH
 +. /home/apps/miriad/MIRRC.sh
 +export PATH=$MIRBIN:$PATH
 +which python
 +
 +
 +############### NOTHING TO EDIT BELOW THIS LINE ##################
 +
 +
 +
 +# checkpoints
 +checkpoints=/sanscratch/checkpoints
 +
 +# kernel modules
 +mods=`/sbin/lsmod | grep ^blcr | wc -l`
 +if [ $mods -ne 2 ]; then
 +        echo "Error: BLCR modules not loaded on `hostname`"
 +        kill $$
 +fi
 +
 +# blcr setup
 +restore_options=""
 +#restore_options="--no-restore-pid --no-restore-pgid --no-restore-sid"
 +if [ $save_exec == "n" ]; then
 +        #save_options="--save-private --save-shared"
 +        save_options="--save-none"
 +else
 +        save_options="--save-all"
 +fi
 +
 +# environment 
 +export PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/bin:$PATH
 +export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/lib:$LD_LIBRARY_PATH
 +
 +export PATH=/share/apps/blcr/0.8.5/${queue}/bin:$PATH
 +export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/${queue}/lib:$LD_LIBRARY_PATH
 +
 +which mpirun cr_mpirun ompi-checkpoint ompi-restart cr_checkpoint cr_restart
 +
 +# setup checkpoints dir
 +if [ ! -d $checkpoints/$LSB_JOBID ]; then
 +        mkdir -p $checkpoints/$LSB_JOBID 
 +else
 +        echo "Error: $checkpoints/$LSB_JOBID already exists, exciting"
 +        kill $$
 +fi
 +
 +# save process id and path and start application
 +if [ "$mode" == "start" ];  then
 +        # hostfile
 +        echo "${LSB_HOSTS}" > $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID
 +        tr '\/ ' '\r\n' < $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID > $HOME/.lsbatch/hostfile.$LSB_JOBID
 +        c=`wc -l $HOME/.lsbatch/hostfile.$LSB_JOBID | awk '{print $1}'`
 +        for i in `seq 1 $c`; do echo '127.0.0.1' >> $HOME/.lsbatch/localhost.$LSB_JOBID; done
 +        $pre_cmd
 +        # why
 +        rm -f /tmp/tmp?????? 
 +        cr_mpirun -v -am ft-enable-cr --gmca snapc_base_global_snapshot_dir $checkpoints/$LSB_JOBID \
 +        -x LD_LIBRARY_PATH --hostfile $HOME/.lsbatch/localhost.$LSB_JOBID $cmd 2>>$checkpoints/$LSB_JOBID/cr_mpirun.err &
 +        pid=$!
 +        pwd > $checkpoints/$LSB_JOBID/pwd.$pid
 +        orgjobpid=0
 +
 +# otherwise restart the job
 +elif [ "$mode" == "restart" ]; then
 +        orgpid=`ls $checkpoints/$orgjobpid/pwd.* | awk -F\. '{print $2}'`
 +        orgpwd=`cat $checkpoints/$orgjobpid/pwd.$orgpid`
 +        if [ "X$orgpwd" == "X" ]; then
 +                echo "Error: orgpwd problem, check error log"
 +                exit
 +        fi
 +        # cleanup old if present
 +        rm -rf /sanscratch/$orgjobpid /localscratch/$orgjobpid 
 +        rm -f $HOME/.lsbatch/*.$orgjobpid 
 +        # why
 +        rm -f /tmp/tmp?????? 
 +        # stage old
 +        scp $checkpoints/$orgjobpid/*.$orgjobpid.err $checkpoints/$orgjobpid/*.$orgjobpid.out $HOME/.lsbatch/
 +        scp -r $checkpoints/$orgjobpid/* $MYSANSCRATCH
 +        ln -s $MYSANSCRATCH /sanscratch/$orgjobpid
 +        scp $checkpoints/$orgjobpid/hostfile.$orgjobpid $HOME/.lsbatch/
 +        scp -r $checkpoints/$orgjobpid/$orgjobpid/* /localscratch/$LSB_JOBID
 +        # why
 +        scp $checkpoints/$orgjobpid/$orgjobpid/tmp?????? /tmp/
 +        ln -s /localscratch/$LSB_JOBID /localscratch/$orgjobpid
 +        c=`wc -l $HOME/.lsbatch/hostfile.$orgjobpid | awk '{print $1}'`
 +        for i in `seq 1 $c`; do echo '127.0.0.1' >> $HOME/.lsbatch/localhost.$orgjobpid; done
 +        cr_restart --kmsg-warning $restore_options --relocate $orgpwd=$MYSANSCRATCH --cont \
 +        $MYSANSCRATCH/chk.$orgpid 2>>$checkpoints/$LSB_JOBID/cr_restart.err &
 +        pid=$!
 +        started=`ps -u $USER | awk '{print $1}' | grep $pid | wc -l`
 +        if [ $started -ne 1 ]; then
 +                echo "Error: cr_restart failed, check error log"
 +                kill $$
 +        fi
 +        pwd > $checkpoints/$LSB_JOBID/pwd.$pid
 +
 +# obviously
 +else
 +        echo "Error: startup mode not defined correctly"
 +        kill $$
 +fi
 +
 +# if $cmd disappears during $pcit, terminate wrapper
 +export POST_CMD="$post_cmd"
 +blcr_watcher $pid $$ $LSB_JOBID $orgjobpid &
 +bw_pid=$!
 +
 +# always run this block
 +while [ true ]; do
 +        # checkpoint time interval
 +        sleep $cpti
 +        # finished?
 +        no_pid=`ps -u $USER | grep $pid | awk '{print $1}'`
 +        if [ "${no_pid}x" == 'x' ]; then
 +                # save output
 +                scp -rp $MYSANSCRATCH/* $checkpoints/$LSB_JOBID/
 +                $POST_CMD
 +                kill $bw_pid
 +                rm -f $HOME/.lsbatch/*${orgjobpid}*
 +                exit
 +        fi
 +        # checkpoint file outside of sanscratch
 +        scp -r $MYSANSCRATCH/* $checkpoints/$LSB_JOBID/
 +        scp -r /localscratch/$LSB_JOBID $checkpoints/$LSB_JOBID/
 +        chmod u+w $checkpoints/$LSB_JOBID/chk.* /sanscratch/$LSB_JOBID/chk.*
 +        # why
 +        scp /tmp/tmp?????? $checkpoints/$LSB_JOBID/$LSB_JOBID/
 +        cr_checkpoint -v --tree --cont $save_options -f $checkpoints/$LSB_JOBID/chk.$pid $pid \
 +        2>>$checkpoints/$LSB_JOBID/cr_checkpoint.err
 +        scp $HOME/.lsbatch/*.$LSB_JOBID.err $HOME/.lsbatch/*.$LSB_JOBID.out $checkpoints/$LSB_JOBID/
 +        scp $HOME/.lsbatch/hostfile.$LSB_JOBID $checkpoints/$LSB_JOBID/
 +        scp -r /localscratch/$LSB_JOBID $checkpoints/$LSB_JOBID/
 +        # why
 +        scp /tmp/tmp?????? $checkpoints/$LSB_JOBID/$LSB_JOBID/
 +        date >> $checkpoints/$LSB_JOBID/cr_checkpoint.err
 +done
 +
 +
 +</code>
 +
 +==== Parallel Wrapper v1 ====
 +
 +<code>
 +
 +#!/bin/bash 
 +rm -f err out
 +# work dir and cwd
 +export MYSANSCRATCH=/sanscratch/$LSB_JOBID
 +cd $MYSANSCRATCH
 +
 +# at job finish, all content in /sanscratch/JOBPID
 +# will be copied to /sanscratch/checkpoints/JOBPID
 +# content older than 3 months will be removed
 +
 +# SCHEDULER 
 +#BSUB -q test
 +#BSUB -J test
 +#BSUB -n 4
 +#BSUB -o out
 +#BSUB -e err
 +# next required for mpirun checkpoint to work
 +# restarts must use same node in test queue (not sure why, others ca restart anywhere)
 +#BSUB -R "span[hosts=1]"
 +
 +# CHECK POINT TIME INTERVAL: 10m (debug) 6h 12h 18h 1d 
 +cpti=1d
 +
 +# COPY APPLICATION TO WORK DIR $MYSANSCRATCH (cwd)
 +# always stage the application (and data if needed)
 +# if mpirun save_exec="n" (default)
 +save_exec="n"
 +pre_cmd=" scp -r  $HOME/lammps/au.inp $HOME/lammps/auu3 
 + $HOME/lammps/henz.dump  $HOME/lammps/data.Big11AuSAMInitial  . "
 +post_cmd=" scp auout $HOME/lammps/auout.$LSB_JOBID "
 +
 +# IF START OF JOB, UNCOMMENT
 +# its either start or restart block
 +mode=start
 +queue=test
 +cmd=" lmp_mpi -c off -var GPUIDX 0 -in au.inp -l auout "
 +
 +# IF RESTART OF JOB, UNCOMMENT
 +# you must have pwd.JOBPID and chk.JOBPID in $orgjobpid/
 +#mode=restart
 +#queue=test
 +#orgjobpid=691
 +
 +# user environment
 +export PATH=/share/apps/CENTOS6/blcr_soft/lammps/16Feb16/:$PATH
 +export LD_LIBRARY_PATH=/share/apps/CENTOS6/blcr_soft/lammps/16Feb16/lib:$LD_LIBRARY_PATH
 +#which lmp_mpi
 +
 +
 +
 +############### NOTHING TO EDIT BELOW THIS LINE ##################
 +
 +
 +
 +# checkpoints
 +checkpoints=/sanscratch/checkpoints
 +
 +# kernel modules
 +mods=`/sbin/lsmod | grep ^blcr | wc -l`
 +if [ $mods -ne 2 ]; then
 +        echo "Error: BLCR modules not loaded on `hostname`"
 +        kill $$
 +fi
 +
 +# blcr setup
 +restore_options=""
 +#restore_options="--no-restore-pid --no-restore-pgid --no-restore-sid"
 +if [ $save_exec == "n" ]; then
 +        #save_options="--save-private --save-shared"
 +        save_options="--save-none"
 +else
 +        save_options="--save-all"
 +fi
 +
 +# environment 
 +export PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/bin:$PATH
 +export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/lib:$LD_LIBRARY_PATH
 +
 +export PATH=/share/apps/blcr/0.8.5/${queue}/bin:$PATH
 +export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/${queue}/lib:$LD_LIBRARY_PATH
 +
 +#which mpirun cr_mpirun ompi-checkpoint ompi-restart cr_checkpoint cr_restart
 +
 +# setup checkpoints dir
 +if [ ! -d $checkpoints/$LSB_JOBID ]; then
 +        mkdir -p $checkpoints/$LSB_JOBID 
 +else
 +        echo "Error: $checkpoints/$LSB_JOBID already exists, exciting"
 +        kill $$
 +fi
 +
 +# save process id and path and start application
 +if [ "$mode" == "start" ];  then
 +        # hostfile
 +        echo "${LSB_HOSTS}" > $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID
 +        tr '\/ ' '\r\n' < $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID > $HOME/.lsbatch/hostfile.$LSB_JOBID
 +        $pre_cmd
 +        cr_mpirun -am ft-enable-cr --gmca snapc_base_global_snapshot_dir $checkpoints/$LSB_JOBID \
 +                  --hostfile $HOME/.lsbatch/hostfile.$LSB_JOBID $cmd &
 +        pid=$!
 +        pwd > $checkpoints/$LSB_JOBID/pwd.$pid
 +        orgjobpid=0
 +
 +# otherwise restart the job
 +elif [ "$mode" == "restart" ]; then
 +        orgpid=`ls $checkpoints/$orgjobpid/pwd.* | awk -F\. '{print $2}'`
 +        orgpwd=`cat $checkpoints/$orgjobpid/pwd.$orgpid`
 +        # cleanup old
 +        rm -rf /sanscratch/$orgjobpid $HOME/.lsbatch/*.$orgjobpid
 +        # stage old
 +        scp $checkpoints/$orgjobpid/*.$orgjobpid.err $checkpoints/$orgjobpid/*.$orgjobpid.out $HOME/.lsbatch/
 +        scp -r $checkpoints/$orgjobpid/* $MYSANSCRATCH
 +        ln -s $MYSANSCRATCH /sanscratch/$orgjobpid
 +        scp $checkpoints/$orgjobpid/hostfile.$orgjobpid $HOME/.lsbatch/
 +        cr_restart --kmsg-warning $restore_options --relocate $orgpwd=$MYSANSCRATCH $MYSANSCRATCH/chk.$orgpid &
 +        pid=$!
 +        started=` ps -u hmeij | awk '{print $1}' | grep $pid | wc -l`
 +        if [ $started -ne 1 ]; then
 +                echo "Error: cr_restart failed, check error log"
 +                kill $$
 +        fi
 +        pwd > $checkpoints/$LSB_JOBID/pwd.$pid
 +
 +# obviously
 +else
 +        echo "Error: startup mode not defined correctly"
 +        kill $$
 +fi
 +
 +# if $cmd disappears during $pcit, terminate wrapper
 +export POST_CMD="$post_cmd"
 +blcr_watcher $pid $$ $LSB_JOBID $orgjobpid &
 +
 +# always run this block
 +while [ true ]; do
 +        # checkpoint time interval
 +        sleep $cpti
 +        # checkpoint file outside of sanscratch
 +        scp -r $MYSANSCRATCH/* $checkpoints/$LSB_JOBID/
 +        cr_checkpoint --tree $save_options -f $checkpoints/$LSB_JOBID/chk.$pid $pid
 +        scp $HOME/.lsbatch/*.$LSB_JOBID.err $HOME/.lsbatch/*.$LSB_JOBID.out $checkpoints/$LSB_JOBID/
 +        scp $HOME/.lsbatch/hostfile.$LSB_JOBID $checkpoints/$LSB_JOBID/
 +done
 +
 +</code>
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/148.1459360853.txt.gz · Last modified: 2016/03/30 14:00 by hmeij07