Differences

This shows you the differences between two versions of the page.

--- cluster:147 [2016/03/17 19:19]
hmeij07 [BLCR Checkpoint in OL3]
+++ cluster:147 [2020/02/27 18:06] (current)
hmeij07
@@ Line 3: / Line 3: @@
 ==== BLCR Checkpoint in OL3 ====
+**Deprecated since we did OS upgrades [[cluster:185|OS Update]]\\
+We will install DMTCP as a replacement...[[cluster:190|DMTCP]]** \\
+ --- //[[hmeij@wesleyan.edu|Henk]] 2020/01/14 14:28//
+  * This page concerns SERIAL jobs only; SERIAL jobs can restart on any node
   * Installation and what it does [[cluster:124|BLCR]]
@@ Line 8: / Line 14: @@
   * Users Guide [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html]]
-When we move to Openlava 3.x all queues will support checkpointing, which means you can run your job in a "wrapper" and if the job or cluster crashes you can restart your job from last checkpoint file.
+All queues will support checkpointing, which means you can run your job in a "wrapper" and if the job or cluster crashes you can restart your job from last checkpoint file.
-Checkpointing is an expensive operation so do not checkpoint under 6 hours. For example, if your job runs for a month checkpoint once a day, if your job runs for a week checkpoint every 12 hours. From this point on I expect all users to checkpoint. Some software does this internally (Amber, Gaussian). For applications or home grown code you can use BLCR. (Too bad it does not work out of box within Openlava).
+Checkpointing is an expensive operation so do not checkpoint under 6 hours. For example, if your job runs for a month checkpoint once a day, if your job runs for a week checkpoint every 12 hours. From this point on I expect all users to checkpoint. Some software does this internally (Amber, Gaussian). For applications or home grown code you can use BLCR.
-You need to test out checkpointing before you rely on it. I've notice that some local code, when opening files for output, BLCR does not notice it. The code below has such an example (file fid.txt). Hopefully future versions of BLCR will fix this. Or maybe we shuold open files differently, this needs investigating further.
+You need to test out checkpointing before you rely on it. I've notice that some local code, when opening files for output, BLCR does not notice it. The code below has such an example (file fid.txt). Hopefully future versions of BLCR will fix this. Or maybe we should open files differently, this needs investigating further.
-BLCR, Berkely Lab Checkpoint and Restart, remembers file paths and process ids. The code stages the necessary STDOUT and STDERR files Openlava generates and invokes the relocation feature while ignore old process ids. If an application is large (for example 10G), and static, it is advisable to not save the application inside the checkpoint file.
+BLCR, Berkely Lab Checkpoint and Restart, remembers file paths and process ids. The code stages the necessary STDOUT and STDERR files scheduler generates and invokes the relocation feature while ignore old process ids. If an application is large, and static, it is advisable to not save the application inside the checkpoint file.
-At the bottom of this page is the v0.1 version of ''blcr_wrapper'' program which will hide the complexity for you. ''blcr_watcher'' is a program that is in your PATH already and will terminate the wrapper if the application done inside of a check point time interval. I will work with any group interested to customize your ''blcr_wrapper'' for your lab/group.
+At the bottom of this page is the current version of ''blcr_wrapper'' program which will hide the complexity for you. ''blcr_watcher'' is a program that is in your PATH already and will terminate the wrapper if the application finishes inside of a check point time interval. I will work with any group interested to customize your ''blcr_wrapper'' for your lab/group.
    * Here is an interactive simple sample run.
-   *
 <code>
@@ Line 43: / Line 49: @@
 Connection to petaltail lost.
-# that was not too clever, log back in, restart application in another directory
+# ooops, that was not too clever, log back in, restart application in another directory
 [hmeij@petaltail 187]$ cd ..
 [hmeij@petaltail sanscratch]$ mkdir 188
@@ Line 84: / Line 90: @@
 </code>
+==== Putting it all Together ====
+The ''blcr_wrapper'' will perform a "change directory" to $MYSANSCRATCH which is /sanscratch/JOBPID. So think in those terms. Copy the application (and input data) to '.' using $pre_cmd in the script. If $save_exec="n", then upon a restart the script will copy the application back.
+You edit the top part of ''blcr_wrapper'' to match your job's needs. Then either the block of START or RESTART is uncommented. When you restart a job (some new JOBPID assigned by cluster) the script needs the old JOBPID of crashed job that has latest checkpoint file in /sanscratch/checkpoints/
+Then submit to scheduler as usual
+<code>
+[hmeij@cottontail ~/ynam]$ bsub < blcr_wrapper_serial
+</code>
+==== Files v0.2 ====
+  * ''blcr_wrapper_serial'' at /home/hmeij/jobs/blcr/ for non MPI jobs
+<code>
+#!/bin/bash
+# work dir and cwd
+export MYSANSCRATCH=/sanscratch/$LSB_JOBID
+cd $MYSANSCRATCH
+# at job finish, all content in /sanscratch/JOBPID
+# will be copied to /sanscratch/checkpoints/JOBPID
+# content older than 3 months will be removed
+# SCHEDULER
+#BSUB -q test
+#BSUB -n 1
+#BSUB -J test
+#BSUB -o out
+#BSUB -e err
+# CHECK POINT TIME INTERVAL: 10m (debug) 6h 12h 18h 1d
+cpti=10m
+# COPY APPLICATION TO WORK DIR $MYSANSCRATCH (cwd)
+# always stage the application (and data if needed)
+# if application is large and static save_exec="n"
+save_exec="n"
+pre_cmd=" scp $HOME/ynam/a.out . "
+post_cmd=" scp $MYSANSCRATCH/fid.txt $HOME/ynam "
+# IF START OF JOB, UNCOMMENT
+# its either start or restart block
+mode=start
+queue=test
+cmd="./a.out"
+# IF RESTART OF JOB, UNCOMMENT
+# you must have pwd.JOBPID and chk.JOBPID in $orgjobpid/
+#mode=restart
+#queue=test
+#orgjobpid=250
+# buglines: if your group/lab is mentioned set value to "y", else "n"
+# "y" for rblumel/ynam
+do_bug1a_cmd="y"
+do_bug1b_cmd="y"
+############### NOTHING TO EDIT BELOW THIS LINE ##################
+# checkpoints
+checkpoints=/sanscratch/checkpoints
+# bug commands
+bug1a_cmd="scp $MYSANSCRATCH/fid.txt $checkpoints/$LSB_JOBID/"
+bug1b_cmd="scp $checkpoints/$orgjobpid/fid.txt $MYSANSCRATCH"
+# kernel modules
+mods=`lsmod | grep ^blcr | wc -l`
+if [ $mods -ne 2 ]; then
+        echo "Error: BLCR modules not loaded on `hostname`"
+        kill $$
+fi
+# blcr setup
+restore_options="--no-restore-pid --no-restore-pgid --no-restore-sid"
+if [ $save_exec == "n" ]; then
+        save_options="--save-private --save-shared"
+else
+        save_options="--save-all"
+fi
+export PATH=/share/apps/blcr/0.8.5/${queue}/bin:$PATH
+export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/${queue}/lib:$LD_LIBRARY_PATH
+# setup checkpoints dir
+if [ ! -d $checkpoints/$LSB_JOBID ]; then
+        mkdir -p $checkpoints/$LSB_JOBID
+else
+        echo "Error: $checkpoints/$LSB_JOBID already exists, exciting"
+        kill $$
+fi
+# save process id and path and start application
+if [ "$mode" == "start" ];  then
+        $pre_cmd
+        cr_run $cmd &
+        pid=$!
+        pwd > $checkpoints/$LSB_JOBID/pwd.$pid
+        orgjobpid=0
+        if [ "X$do_bug1a_cmd" == "Xy" ]; then
+                $bug1a_cmd
+        fi
+# otherwise restart the job
+elif [ "$mode" == "restart" ]; then
+        orgpid=`ls $checkpoints/$orgjobpid/pwd.* | awk -F\. '{print $2}'`
+        orgpwd=`cat $checkpoints/$orgjobpid/pwd.$orgpid`
+        #if [ "X$orgpid" == "X" -o "X$orgpwd" == "X" ]; then
+        #       echo "Error: problem with missing orgpid or orgpwd values"
+        #       kill $$
+        #fi
+        scp $checkpoints/$orgjobpid/*.$orgjobpid.err $checkpoints/$orgjobpid/*.$orgjobpid.out $HOME/.lsbatch/
+        if [ $save_exec == "n" ]; then
+                $pre_cmd
+        fi
+        if [ "X$do_bug1b_cmd" == "Xy" ]; then
+                $bug1b_cmd
+        fi
+        cr_restart $restore_options --relocate $orgpwd=$MYSANSCRATCH $checkpoints/$orgjobpid/chk.$orgpid &
+        pid=$!
+        pwd > $checkpoints/$LSB_JOBID/pwd.$pid
+# obviously
+else
+        echo "Error: startup mode not defined correctly"
+        kill $$
+fi
+# if $cmd disappears during $pcit, terminate wrapper
+export POST_CMD="$post_cmd"
+blcr_watcher $pid $$ $LSB_JOBID $orgjobpid &
+# always run this block
+while [ true ]; do
+        # checkpoint time interval
+        sleep $cpti
+        # checkpoint file outside of sanscratch
+        cr_checkpoint $save_options -f $checkpoints/$LSB_JOBID/chk.$pid $pid
+        scp $HOME/.lsbatch/*.$LSB_JOBID.err $HOME/.lsbatch/*.$LSB_JOBID.out $checkpoints/$LSB_JOBID/
+        if [ "X$do_bug1a_cmd" == "Xy" ]; then
+                $bug1a_cmd
+        fi
+done
+</code>
+  * ''blcr_watcher'' v01 at /share/apps/bin/
+<code>
+#!/bin/bash
+# watch a process during check point time interval
+# if it disappears, terminate the blcr_wrapper
+checkpoints=/sanscratch/checkpoints
+watch_pid=$1
+watch_wrapper=$2
+jobpid=$3
+orgjobpid=$4
+while [ $watch_pid -gt 0 ]; do
+        sleep 600
+        nopid=`ps -u $USER | grep $watch_pid | awk '{print $1}'`
+        if [ "${nopid}x" == 'x' ]; then
+                # save output
+                scp -rp $MYSANSCRATCH/* $checkpoints/$LSB_JOBID/
+                if [ $orgjobpid -gt 0 ]; then
+                        rm -f $HOME/.lsbatch/*.$orgjobpid.err $HOME/.lsbatch/*.$orgjobpid.out
+                fi
+                $POST_CMD
+                kill $watch_wrapper
+                exit;
+        fi
+done
+</code>
+==== Matlab ====
+  * https://www.bu.edu/tech/support/research/software-and-programming/common-languages/matlab/matlab-batch/checkpointing/

DokuWiki

User Tools

Site Tools

Differences

Page Tools