This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revision Both sides next revision | ||
cluster:124 [2013/10/31 18:32] hmeij [BLCR] |
cluster:124 [2013/10/31 18:37] hmeij |
||
---|---|---|---|
Line 129: | Line 129: | ||
Now we can write a batch script for the scheduler. | Now we can write a batch script for the scheduler. | ||
+ | * The job will always end up in / | ||
+ | * The checkpoint file should be written to a safe place, like /home | ||
+ | * The time interval for checkpointing should be sufficiently large to not slow the job down | ||
+ | * for example set it to 12 hours or 24 hours even | ||
+ | * the small interval times in script is just for testing | ||
+ | * Then there are 2 blocks of line sto (un)comment | ||
+ | * One to invoke '' | ||
+ | * One to invoke '' | ||
* | * | ||
Line 174: | Line 182: | ||
echo " | echo " | ||
while [ $process_id -gt 0 ]; do | while [ $process_id -gt 0 ]; do | ||
- | # checkpoint time interval, make it an hour or larger | + | # checkpoint time interval, make it very large (small for testing) |
sleep 120 | sleep 120 | ||
- | # save the checkpoint file outside of sanscratch | + | # save the checkpoint file outside of /sanscratch |
cr_checkpoint -f ~/ | cr_checkpoint -f ~/ | ||
# if the application has crashed, exit | # if the application has crashed, exit | ||
process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk ' | process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk ' | ||
if [ " | if [ " | ||
- | # save some stuff for checking | + | # save some stuff for checking |
cp -p pwd* *.shell *.out *.err context ~/blcr/ | cp -p pwd* *.shell *.out *.err context ~/blcr/ | ||
rm -f `cat ~/ | rm -f `cat ~/ |