This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:214 [2022/04/06 19:02] hmeij07 [Amber20] |
cluster:214 [2023/08/18 16:19] (current) hmeij07 [Upgrading] |
||
---|---|---|---|
Line 397: | Line 397: | ||
==== Upgrading ==== | ==== Upgrading ==== | ||
- | Figure out an upgrade process before going production (don't forget any chroot images and rebuild images). | + | Figure out an upgrade process before going production. |
+ | |||
+ | * **Do you actually want to upgrade OpenHPC? | ||
+ | * v2.6 deploys ww4.x (maybe not want this, containers) | ||
+ | * chroot images and rebuild images | ||
+ | * OneAPI similar conflicts? (/opt/intel and / | ||
+ | * slurm complications? | ||
+ | * **Upgrade Openhpc, OneAPI should be on new head node** | ||
+ | * test compatibility compilers | ||
+ | * slurm clients | ||
< | < | ||
Line 403: | Line 412: | ||
yum upgrade " | yum upgrade " | ||
yum upgrade " | yum upgrade " | ||
+ | |||
+ | or | ||
+ | |||
+ | yum update --disablerepo=* --enablerepo=[oneAPI, | ||
</ | </ | ||
+ | **Upgrade history** | ||
+ | * OS only, 30 Jun 2022 (90+ days up) - no ohpc, oneapi (/opt) | ||
+ | * OS only, 18 Aug 2023 (440+ days up) - no ohpc, oneapi (/opt) | ||
+ | * | ||
==== example modules ==== | ==== example modules ==== | ||
Line 459: | Line 476: | ||
Amber cmake download fails with READLINE error ... package readline-devel needs to be installed to get past that which pulls in | Amber cmake download fails with READLINE error ... package readline-devel needs to be installed to get past that which pulls in | ||
+ | |||
+ | ** Example script run.rocky for cpu or gpu run** (for queues amber128 [n78] and test [n100-n101] for gpus and mw128 and tinymem for cpus) | ||
< | < | ||
Line 480: | Line 499: | ||
# CPU control | # CPU control | ||
#SBATCH -n 8 # tasks=S*C*T | #SBATCH -n 8 # tasks=S*C*T | ||
- | #SBATCH -B 1:1:1 # S: | + | ###SBATCH -B 1:1:1 # S: |
+ | #SBATCH -B 2:4:1 # S: | ||
+ | ###SBATCH --cpus-per-gpu=1 | ||
+ | ###SBATCH --mem-per-gpu=7168 | ||
# | # | ||
# GPU control | # GPU control | ||
- | #SBATCH --gres=gpu: | + | ###SBATCH --gres=gpu: |
+ | ###SBATCH --gres=gpu: | ||
+ | # | ||
+ | # Node control | ||
+ | #SBATCH --partition=tinymem | ||
+ | #SBATCH --nodelist=n57 | ||
# unique job scratch dirs | # unique job scratch dirs | ||
- | MYSANSCRATCH=/ | + | MYSANSCRATCH=/ |
- | MYLOCALSCRATCH=/ | + | MYLOCALSCRATCH=/ |
export MYSANSCRATCH MYLOCALSCRATCH | export MYSANSCRATCH MYLOCALSCRATCH | ||
cd $MYLOCALSCRATCH | cd $MYLOCALSCRATCH | ||
Line 504: | Line 530: | ||
cp -r ~/ | cp -r ~/ | ||
- | # for amber20 on n[100-101] | + | export CUDA_VISIBLE_DEVICES=0 |
+ | |||
+ | # for amber20 on n[100-101] | ||
+ | #mpirun -x LD_LIBRARY_PATH -machinefile ~/ | ||
+ | #-np 1 \ | ||
+ | #pmemd.cuda \ | ||
+ | #-O -o mdout.$SLURM_JOB_ID -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd | ||
+ | |||
+ | # for amber20 on n59/n77 cpus, select partition | ||
mpirun -x LD_LIBRARY_PATH -machinefile ~/ | mpirun -x LD_LIBRARY_PATH -machinefile ~/ | ||
- | -np | + | -np |
- | pmemd.cuda \ | + | pmemd.MPI \ |
-O -o mdout.$SLURM_JOB_ID -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd | -O -o mdout.$SLURM_JOB_ID -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd | ||
- | # save results | ||
scp mdout.$SLURM_JOB_ID ~/tmp/ | scp mdout.$SLURM_JOB_ID ~/tmp/ | ||
+ | </ | ||
+ | ** Example script run.centos for cpus or gpu run** (queues mwgpu, exx96) | ||
+ | |||
+ | < | ||
+ | #!/bin/bash | ||
+ | # [found at XStream] | ||
+ | # Slurm will IGNORE all lines after the FIRST BLANK LINE, | ||
+ | # even the ones containing #SBATCH. | ||
+ | # Always put your SBATCH parameters at the top of your batch script. | ||
+ | # Took me days to find ... really silly behavior -Henk | ||
+ | # | ||
+ | # GENERAL | ||
+ | #SBATCH --job-name=" | ||
+ | #SBATCH --output=out | ||
+ | #SBATCH --error=err | ||
+ | ##SBATCH --mail-type=END | ||
+ | ##SBATCH --mail-user=hmeij@wesleyan.edu | ||
+ | # | ||
+ | # NODE control | ||
+ | #SBATCH -N 1 # default, nodes | ||
+ | # | ||
+ | # CPU control | ||
+ | #SBATCH -n 1 # tasks=S*C*T | ||
+ | #SBATCH -B 1:1:1 # S: | ||
+ | ###SBATCH -B 2:4:1 # S: | ||
+ | # | ||
+ | # GPU control | ||
+ | ###SBATCH --gres=gpu: | ||
+ | #SBATCH --gres=gpu: | ||
+ | #SBATCH --cpus-per-gpu=1 | ||
+ | #SBATCH --mem-per-gpu=7168 | ||
+ | # | ||
+ | # Node control | ||
+ | #SBATCH --partition=exx96 | ||
+ | #SBATCH --nodelist=n88 | ||
+ | |||
+ | |||
+ | # unique job scratch dirs | ||
+ | MYSANSCRATCH=/ | ||
+ | MYLOCALSCRATCH=/ | ||
+ | export MYSANSCRATCH MYLOCALSCRATCH | ||
+ | cd $MYLOCALSCRATCH | ||
+ | |||
+ | # amber20/ | ||
+ | export PATH=/ | ||
+ | export LD_LIBRARY_PATH=/ | ||
+ | export CUDA_HOME=/ | ||
+ | export PATH=/ | ||
+ | export LD_LIBRARY_PATH=/ | ||
+ | export LD_LIBRARY_PATH="/ | ||
+ | export PATH=/ | ||
+ | export LD_LIBRARY_PATH=/ | ||
+ | which nvcc mpirun python | ||
+ | |||
+ | |||
+ | source / | ||
+ | # stage the data | ||
+ | cp -r ~/ | ||
+ | |||
+ | ###export CUDA_VISIBLE_DEVICES=`shuf -i 0-3 -n 1` | ||
+ | ###export CUDA_VISIBLE_DEVICES=0 | ||
+ | |||
+ | |||
+ | # for amber20 on n[33-37] gpus, select gpu model | ||
+ | mpirun -x LD_LIBRARY_PATH -machinefile ~/ | ||
+ | -np 1 \ | ||
+ | pmemd.cuda \ | ||
+ | -O -o mdout.$SLURM_JOB_ID -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd | ||
+ | |||
+ | # for amber20 on n59/n100 cpus, select partition | ||
+ | #mpirun -x LD_LIBRARY_PATH -machinefile ~/ | ||
+ | #-np 8 \ | ||
+ | #pmemd.MPI \ | ||
+ | #-O -o mdout.$SLURM_JOB_ID -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd | ||
+ | |||
+ | scp mdout.$SLURM_JOB_ID ~/tmp/ | ||
</ | </ | ||
- | The script amber.sh was converted to a module like so | + | |
+ | **The script amber.sh was converted to a module like so** | ||
< | < | ||
Line 546: | Line 656: | ||
</ | </ | ||
+ | |||
+ | ==== Amber22 ==== | ||
+ | |||
+ | Amber22 is somehow incompatible with CentOS/ | ||
+ | |||
+ | https:// | ||
+ | " | ||
+ | |||
+ | < | ||
+ | |||
+ | [hmeij@n79 src]$ echo $AMBERHOME | ||
+ | / | ||
+ | |||
+ | [hmeij@n79 src]$ which mpirun mpicc | ||
+ | / | ||
+ | / | ||
+ | |||
+ | </ | ||
+ | |||
+ | First establish a successful run with the **run.rocky** script for Amber20 (listed above). Then change the module in your script. (for queues amber128 [n78] and test [n100-n101] for gpus and mw128 and tinymem for cpus) | ||
+ | |||
+ | < | ||
+ | |||
+ | module load amber/22 | ||
+ | |||
+ | # if the module does not show up in the output of your console | ||
+ | |||
+ | module avail | ||
+ | |||
+ | # treat your module cache as out of date | ||
+ | |||
+ | module --ignore_cache avail | ||
+ | |||
+ | </ | ||
+ | |||
+ | First establish a success full run with the **run.centos** script for Amber20 (listed above, for cpus or gpus on queues mwgpu and exx96). | ||
+ | |||
+ | Then edit the script and apply these edits. We had to use a specific compatible '' | ||
+ | |||
+ | < | ||
+ | |||
+ | # comment out the 2 export lines pointing to openmpi | ||
+ | ##export PATH=/ | ||
+ | ##export LD_LIBRARY_PATH=/ | ||
+ | |||
+ | # additional gcc 6.5.0 | ||
+ | export PATH=/ | ||
+ | export LD_LIBRARY_PATH=/ | ||
+ | |||
+ | # edit or add correct source line, which and ldd lines just for debugging | ||
+ | ###source / | ||
+ | ###source / | ||
+ | source / | ||
+ | which nvcc mpirun python | ||
+ | ldd `which pmemd.cuda_SPFP` | ||
+ | |||
+ | </ | ||
+ | |||
\\ | \\ | ||
**[[cluster: | **[[cluster: |