User Tools

Site Tools


cluster:147

This is an old revision of the document!



Back

BLCR Checkpoint in OL3

  • Installation and what it does BLCR

When we move to Openlava 3.x all queues will support checkpointing, which means you can run your job in a “wrapper” and if the job or cluster crashes you can restart your job from last checkpoint file.

Checkpointing is an expensive operation so do not checkpoint under 6 hours. For example, if your job runs for a month checkpoint once a day, if your job runs for a week checkpoint every 12 hours. From this point on I expect all users to checkpoint.

You need to test out checkpointing before you rely on it. I've notice that some local code, when opening files for output, BLCR does not notice it. The code below has such an example (file fid.txt). Hopefully future versions of BLCR will fix this.

BLCR, Berkely Lab Checkpoint and Restart, remembers file paths and process ids. The code stages the necessary STDOUT and STDERR openlava generates and invokes the relocation feature and ignore process ids.


Back

cluster/147.1458236190.txt.gz · Last modified: 2016/03/17 13:36 by hmeij07