Warning: Undefined array key "DOKU_PREFS" in /usr/share/dokuwiki/inc/common.php on line 2082
cluster:218 [DokuWiki]

User Tools

Site Tools


cluster:218

Warning: Undefined array key 25 in /usr/share/dokuwiki/inc/html.php on line 1453

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
cluster:218 [2022/06/30 13:23]
hmeij07 created
cluster:218 [2022/09/08 08:51]
hmeij07 [Resources]
Line 28: Line 28:
   * August 2022 is designated **migration** period   * August 2022 is designated **migration** period
   * Queues ''hp12'' and ''mwgpu'' (centos6) will be serviced by Openlava, not Slurm   * Queues ''hp12'' and ''mwgpu'' (centos6) will be serviced by Openlava, not Slurm
 +
 +
 +==== Quick Start Slurm  Guide ====
 +
 +Jump to the **Rocky8/CentOs7 script templates** listed in the menu of this page, top right.
 +
 +There is also detailed information on Amber20/Amber22 on this page with script examples.
 +
 +  * [[cluster:214|Tada]] new head node
  
 ==== Basic Commands ==== ==== Basic Commands ====
Line 34: Line 43:
  
 # sorta like bqueues # sorta like bqueues
-sinfo -l+ sinfo -l
  
 # more node info # more node info
Line 47: Line 56:
 # sorta like bhosts -l # sorta like bhosts -l
  scontrol show node n78  scontrol show node n78
 +
 +# sorta like bstop/bresume
 +scontrol suspend job 1000001
 +scontrol resume job 1000001 
  
 # sorta like bhist -l # sorta like bhist -l
Line 59: Line 72:
  
   * manual pages for conf files or commands, for example   * manual pages for conf files or commands, for example
-    * ''man lsf.conf''+    * ''man slurm.conf''
     * ''man sbatch''     * ''man sbatch''
     * etc ...(see above commands)     * etc ...(see above commands)
Line 72: Line 85:
  
 You must request **resources**, that is for example number of cpu cores or which gpu model to use. ** If you do not request resources, Slurm will assume you need all the node's resources** and thus prevent other jobs from running on that node. You must request **resources**, that is for example number of cpu cores or which gpu model to use. ** If you do not request resources, Slurm will assume you need all the node's resources** and thus prevent other jobs from running on that node.
 +
 +Details
 +
 +  * https://slurm.schedmd.com/cons_res_share.html
  
 Some common examples are:  Some common examples are: 
Line 83: Line 100:
 #SBATCH -n 8     # tasks=S*C*T #SBATCH -n 8     # tasks=S*C*T
 #SBATCH -B 2:4:1 # S:C:T=sockets/node:cores/socket:threads/core #SBATCH -B 2:4:1 # S:C:T=sockets/node:cores/socket:threads/core
 +#SBATCH --mem=250           # needed to override oversubscribe
 +#SBATCH --ntasks-per-node=1 # perhaps needed to override oversubscribe
 +#SBATCH --cpus-per-task=1   # needed to override oversubscribe
 +
  
 GPU control GPU control
-#SBATCH --cpus-per-gpu=1 +#SBATCH --cpus-per-gpu=1                  # needed to override oversubscribe 
-#SBATCH --mem-per-gpu=7168+#SBATCH --mem-per-gpu=7168                # needed to override oversubscribe
 #SBATCH --gres=gpu:geforce_gtx_1080_ti: # n[78], amber128 #SBATCH --gres=gpu:geforce_gtx_1080_ti: # n[78], amber128
 #SBATCH --gres=gpu:geforce_rtx_2080_s:  # n[79-90], exx96 #SBATCH --gres=gpu:geforce_rtx_2080_s:  # n[79-90], exx96
Line 105: Line 126:
 Slurm has a builtin MPI flavor. I suggest you do not rely on it. The documentation states that on major release upgrades the ''libslurm.so'' library is not backwards compatible. All software using this library would need to be recompiled.   Slurm has a builtin MPI flavor. I suggest you do not rely on it. The documentation states that on major release upgrades the ''libslurm.so'' library is not backwards compatible. All software using this library would need to be recompiled.  
  
-There is a handy parallel job launcher which may be of use, it is called ''srun''. ''srun'' commands can be embedded in a job submission script but it can also be used interactively to test commands out.+There is a handy parallel job launcher which may be of use, it is called ''srun''. ''srun'' commands can be embedded in a job submission script but it can also be used interactively to test commands out. The submmited job will have a single JOBPID and launch multiple tasks.
  
 <code> <code>
Line 207: Line 228:
 [hmeij@cottontail2 ~]$ module avail [hmeij@cottontail2 ~]$ module avail
  
------------------------- /opt/ohpc/pub/moduledeps/gnu9-openmpi4 ------------------------- +------------------- /opt/ohpc/pub/moduledeps/gnu9-openmpi4 ------------- 
-   adios/1.13.1     fftw/3.3.8      netcdf-cxx/4.3.1        petsc/3.16.1        py3-scipy/1.5.1    slepc/3.16.0 +   adios/1.13.1     netcdf-cxx/4.3.1        py3-scipy/1.5.1 
-   boost/1.76.0     hypre/2.18.1    netcdf-fortran/4.5.3    phdf5/1.10.8        scalapack/2.1.0    superlu_dist/6.4.0 +   boost/1.76.0     netcdf-fortran/4.5.3    scalapack/2.1.0 
-   dimemas/5.4.2    imb/2019.6      netcdf/4.7.4            pnetcdf/1.12.2      scalasca/2.5       tau/2.29 +   dimemas/5.4.2    netcdf/4.7.4            scalasca/2.5 
-   example2/1.0     mfem/4.3        omb/5.8                 ptscotch/6.0.     scorep/6.0         trilinos/13.2.0 +   example2/1.0     omb/5.8                 scorep/6.0 
-   extrae/3.7.0     mumps/5.2.1     opencoarrays/2.9.2      py3-mpi4py/3.0.3    sionlib/1.7.4+   extrae/3.7.0     opencoarrays/2.9.2      sionlib/1.7.4 
 +   fftw/3.3.8       petsc/3.16.1            slepc/3.16.0 
 +   hypre/2.18.1     phdf5/1.10.8            superlu_dist/6.4.0 
 +   imb/2019.6       pnetcdf/1.12.2          tau/2.29 
 +   mfem/4.        ptscotch/6.0.6          trilinos/13.2.0 
 +   mumps/5.2.1      py3-mpi4py/3.0.3
  
------------------------------- /opt/ohpc/pub/moduledeps/gnu9 ------------------------------ +------------------------- /opt/ohpc/pub/moduledeps/gnu9 ---------------- 
-   R/4.1.2        impi/2021.5.1    mpich/3.4.2-ofi        openblas/0.3.7          plasma/2.8.0        superlu/5.2.1 +   R/4.1.2          mpich/3.4.2-ofi         plasma/2.8.0 
-   gsl/2.7        likwid/5.0.1     mpich/3.4.2-ucx (D)    openmpi4/4.1.1   (L)    py3-numpy/1.19.5 +   gsl/2.7          mpich/3.4.2-ucx  (D)    py3-numpy/1.19.5 
-   hdf5/1.10.8    metis/5.1.0      mvapich2/2.3.6         pdtoolkit/3.25.1        scotch/6.0.6+   hdf5/1.10.8      mvapich2/2.3.6          scotch/6.0.6 
 +   impi/2021.5.1    openblas/0.3.7          superlu/5.2.1 
 +   likwid/5.0.1     openmpi4/4.1.1   (L) 
 +   metis/5.1.0      pdtoolkit/3.25.1
  
--------------------------------- /opt/ohpc/pub/modulefiles ------------------------------- +--------------------------- /opt/ohpc/pub/modulefiles ------------------- 
-   EasyBuild/4.5.0          example1/1.0          libfabric/1.13.0 (L)    prun/2.2          (L) +   EasyBuild/4.5.0          hwloc/2.5.0      (L)    prun/2.2          (L) 
-   autotools         (L)    gnu9/9.4.0     (L)    ohpc             (L)    singularity/3.7.1 +   autotools         (L)    intel/2022.0.2          singularity/3.7.1 
-   charliecloud/0.15        hwloc/2.5.0    (L)    os                      ucx/1.11.2        (L) +   charliecloud/0.15        libfabric/1.13.0 (L)    ucx/1.11.2        (L) 
-   cmake/3.21.3             intel/2022.0.       papi/5.7.0              valgrind/3.18.1+   cmake/3.21.3             ohpc             (L)    valgrind/3.18.1 
 +   example1/1.0             os 
 +   gnu9/9.4.0        (L)    papi/5.7.0
  
---------------------------- /share/apps/CENTOS8/ohpc/modulefiles ---------------------------+----------------------- /share/apps/CENTOS8/ohpc/modulefiles ------------
    amber/20    cuda/11.6    hello-mpi/1.0    hello/1.0    miniconda3/py39    amber/20    cuda/11.6    hello-mpi/1.0    hello/1.0    miniconda3/py39
  
Line 326: Line 357:
  
  
-  * ''/zfshomes/hmeij/slurm/run.rocky''+  * ''/zfshomes/hmeij/slurm/run.rocky'' for tinymem, mw128, amber128, test queues
  
 <code> <code>
Line 351: Line 382:
 #SBATCH -B 1:1:1 # S:C:T=sockets/node:cores/socket:threads/core #SBATCH -B 1:1:1 # S:C:T=sockets/node:cores/socket:threads/core
 ###SBATCH -B 2:4:1 # S:C:T=sockets/node:cores/socket:threads/core ###SBATCH -B 2:4:1 # S:C:T=sockets/node:cores/socket:threads/core
-#SBATCH --cpus-per-gpu=1 
-#SBATCH --mem-per-gpu=7168  
 # #
 # GPU control # GPU control
 +#SBATCH --cpus-per-gpu=1
 +#SBATCH --mem-per-gpu=7168 
 ###SBATCH --gres=gpu:geforce_gtx_1080_ti: # n78 ###SBATCH --gres=gpu:geforce_gtx_1080_ti: # n78
 #SBATCH --gres=gpu:quadro_rtx_5000: # n[100-101] #SBATCH --gres=gpu:quadro_rtx_5000: # n[100-101]
Line 369: Line 400:
 cd $MYLOCALSCRATCH cd $MYLOCALSCRATCH
  
-### AMBER20+### AMBER20 works via slurm's imaged nodes, test and amber128  queues
 #source /share/apps/CENTOS8/ohpc/software/amber/20/amber.sh #source /share/apps/CENTOS8/ohpc/software/amber/20/amber.sh
 # OR # # OR #
Line 438: Line 469:
 ==== CentOS7 Slurm Template ==== ==== CentOS7 Slurm Template ====
  
-In this job template I have it setup to run ''pmemd.MPI'' but could also invoke ''pmemd.cuda'' with proper parameter settings. I could also toggle between amber16 or amber20 which on queues ''mwgpu'' and ''exx96'' are local disk CentOS7 software installations. Amber16 will not run on Rocky8 (tried it but forgot error message...we can expect problems like this, hence testing!).+In this job template I have it setup to run ''pmemd.MPI'' but could also invoke ''pmemd.cuda'' with proper parameter settings. On queues ''mwgpu'' and ''exx96'' amber[16,20] are local disk CentOS7 software installations. Amber16 will not run on Rocky8 (tried it but forgot error message...we can expect problems like this, hence testing!).
  
-Note also that we're running mwgpu's K20 cuda version 9.2 on exx96 queue (default cuda version 10.2). Not proper but it works. Hence this script will run on both queues. Oh, now I remember, it is that amber16 was compiled with cuda 9.2 drivers which are supported in cuda 10+_ but not in cuda 11+. So Amber 16, if needed, would need to be compiled in Rocky8 environment(that may work like amber20).+Note also that we're running mwgpu's K20 cuda version 9.2 on exx96 queue (default cuda version 10.2). Not proper but it works. Hence this script will run on both queues. Oh, now I remember, it is that amber16 was compiled with cuda 9.2 drivers which are supported in cuda 10.x but not in cuda 11.x. So Amber 16, if needed, would need to be compiled in Rocky8 environment (and may work like amber20 module).
  
-  * ''/zfshomes/hmeij/slurm/run.centos''+  * ''/zfshomes/hmeij/slurm/run.centos'' for mwgpu, exx96 queues
  
 <code> <code>
Line 469: Line 500:
 # #
 # GPU control # GPU control
-###SBATCH --gres=gpu:tesla_k20m: # n[33-37] 
-###SBATCH --gres=gpu:geforce_rtx_2080_s: # n[79-90] 
 ###SBATCH --cpus-per-gpu=1 ###SBATCH --cpus-per-gpu=1
 ###SBATCH --mem-per-gpu=7168 ###SBATCH --mem-per-gpu=7168
 +###SBATCH --gres=gpu:tesla_k20m: # n[33-37]
 +###SBATCH --gres=gpu:geforce_rtx_2080_s: # n[79-90]
 # #
 # Node control # Node control
Line 478: Line 509:
 #SBATCH --nodelist=n88 #SBATCH --nodelist=n88
  
 +# may or may not be needed, centos7 login env
 +source $HOME/.bashrc  
 +which ifort           # should be the parallel studio 2016 version
  
 # unique job scratch dirs # unique job scratch dirs
Line 497: Line 531:
  
  
-###source /usr/local/amber16/amber.sh + 
-source /usr/local/amber20/amber.sh+###source /usr/local/amber16/amber.sh # works via slurm's mwgpu 
 +source /usr/local/amber20/amber.sh # works via slurm's exx96
 # stage the data # stage the data
 cp -r ~/sharptail/* . cp -r ~/sharptail/* .
Line 557: Line 592:
 July 2022 is for **testing...** lots to learn! July 2022 is for **testing...** lots to learn!
  
-Kudos to Abhilash for working our way through all this.+Kudos to Abhilash and Colin for working our way through all this.
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
cluster/218.txt · Last modified: 2023/10/14 15:24 by hmeij07