This is an old revision of the document!
A recycle head node name, seems appropriate.
The new hardware has been delivered and rack&stacked. First priority was looking around while /home was copied from greentail:/home. This cluster is comprised of one head node (sharptail) and 5 compute nodes (n33-n37). The head node has a 48 TB disk array and 128 GB of memory. The compute nodes each have 256 GB of memory and offer dual 8 core chips. With so much memory, hyperthreading has been turned on doubling the number of cores the operating systems recognizes (so 32 cores per node). Each node also contain 4 GPUs. The entire cluster provides for 1.3+ TB of memory and 20+ Teraflops of computational power. That's almost 7x what we currently have. All these resources will be made available via the Lava scheduler later on.
July and August 2013 I'll call the “Recess! Stay&Play” period. Due to vacation days and all that, final configuration will not be achieved till later this summer. So sharptail is open for ssh access during this time. You may run whatever you want directly on these nodes. There is no scheduler. Your on your own. You can also ssh to the nodes from sharptail. I've advised the unix admins to not help out with any problems (they are busy enough) and the configuration is still foreign to me. So help each other out if possible.
Sharptail is slated to become our file server for /home taking over from greentail. Cut over will be last step before going into production. Meanwhile the first sync from greentail to sharptail is about to finish but refreshes will happen. When a refresh happens:
So it's important that if you want to keep stuff on sharptail you need to copy that to greentail before a refresh happens. I suggest you create a ~/sharptail directory and work inside of that on sharptail. You can transfer files like so:
So in short, in the future, sharptail:/home is our active file system while greentail:/home will become say greentail:/home_backup (inactive). They will be kept in sync and rsnapshot'ed on both disk arrays so we have a better backup/restore strategy.
Sharptail will provide the users (and scheduler) with another 5 TB scratch file system. During this period it is only provided to the sharptail nodes (n33-n37). In the future it will provide this file system to all nodes except greentail nodes (n1-n32).
Without a scheduler jobs may clash, too many jobs may be running. To avoid that you need to find idle cores before you start jobs. This can be done for idle CPU cores and GPUs.
To find idle CPU cores, ssh to a node, start 'top' then press number one '1' … the display shows one line per core. These cores are all idle. You can submit jobs.
top - 10:20:47 up 2 days, 55 min, 3 users, load average: 0.04, 0.01, 0.00 Tasks: 766 total, 1 running, 765 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.7%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st ...
To find idle GPU cores, ssh to a node, then type 'gpu-info'. The display shows idle GPUs if present.
[hmeij@n33 bin]$ gpu-info ==================================================== Device Model Temperature Utilization ==================================================== 0 Tesla K20m 25 C 0 % 1 Tesla K20m 27 C 0 % 2 Tesla K20m 25 C 0 % 3 Tesla K20m 25 C 0 % ====================================================
In both cases you do not need to target any specific core, the operating system will handle that part of the scheduling.
With hyperthreading on the 5 nodes, it provides for 160 cores. We need to reserve 20 cores for the GPUs (one per GPU, 4 per node), and lets reserve another 20 cores for the OS (4 per node). That still leaves 120 cores for regular jobs like you are used to on greentail. These 120 cores (24 per node) will show up later as a new queue on greentail/swallowtail; one that is fit for jobs that need much memory. On average 256 gb per node minus 20 gb for 4 GPUs minus 20 gb for OS leaves 5.6 gb per core
.
So since there is no scheduler, you need to setup your environment and execute your program. Here is an example of a program that normally runs on the imw queue. If your program involves MPI you need to be a bit up to speed on what the lava wrapper actually does for you.
First create the machinesfile, set up your environment by sourcing the appropriate files, submit your program, and monitor the parallel jobs starting using 'top'.
[hmeij@sharptail cd]$ cat mpi_machines n33 n33 n33 n33 n34 n34 n34 n34 [hmeij@sharptail cd]$ . /share/apps/intel/cce/10.0.025/bin/iccvars.sh [hmeij@sharptail cd]$ . /share/apps/intel/fce/10.0.025/bin/ifortvars.sh [hmeij@sharptail cd]$ time /home/apps/openmpi/1.2+intel-10/bin/mpirun \ -x LD_LIBRARY_PATH -machinefile ./mpi_machines \ /share/apps/amber/9+openmpi-1.2+intel-9/exe/pmemd \ -O -i inp/mini.in -p 1g6r.cd.parm -c 1g6r.cd.randions.crd.1 -ref 1g6r.cd.randions.crd.1 & [1] 3304 [hmeij@sharptail cd]$ ssh n33 top -b -n1 -u hmeij top - 14:49:28 up 1 day, 5:24, 1 user, load average: 0.89, 0.20, 0.06 Tasks: 769 total, 5 running, 764 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 264635888k total, 2364236k used, 262271652k free, 44716k buffers Swap: 31999992k total, 0k used, 31999992k free, 382224k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24348 hmeij 20 0 334m 58m 6028 R 100.0 0.0 0:17.23 pmemd 24345 hmeij 20 0 307m 44m 8816 R 100.0 0.0 0:17.20 pmemd 24346 hmeij 20 0 310m 42m 8824 R 98.3 0.0 0:17.22 pmemd 24347 hmeij 20 0 318m 48m 8004 R 98.3 0.0 0:17.19 pmemd 24353 hmeij 20 0 15552 1636 832 R 1.9 0.0 0:00.03 top 24344 hmeij 20 0 86828 2324 1704 S 0.0 0.0 0:00.01 orted 24352 hmeij 20 0 107m 1864 860 S 0.0 0.0 0:00.00 sshd [hmeij@sharptail cd]$ ssh n34 top -b -n1 -u hmeij top - 14:49:37 up 1 day, 2:40, 0 users, load average: 1.89, 0.47, 0.16 Tasks: 766 total, 5 running, 761 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 264635888k total, 2310176k used, 262325712k free, 29788k buffers Swap: 31999992k total, 0k used, 31999992k free, 359596k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12198 hmeij 20 0 334m 61m 5328 R 99.8 0.0 0:25.88 pmemd 12200 hmeij 20 0 302m 33m 5368 R 99.8 0.0 0:25.88 pmemd 12201 hmeij 20 0 310m 40m 5352 R 99.8 0.0 0:25.88 pmemd 12199 hmeij 20 0 310m 39m 5372 R 97.8 0.0 0:25.87 pmemd 12205 hmeij 20 0 15552 1636 832 R 3.8 0.0 0:00.04 top 12197 hmeij 20 0 86828 2324 1704 S 0.0 0.0 0:00.01 orted 12204 hmeij 20 0 107m 1864 860 S 0.0 0.0 0:00.00 sshd
LAMMPS, Amber and NAMD have been compiled using Nvidia's toolkit. They are located in /cm/share/apps.
Module files have been created for these apps and are automatically loaded upon login. For example:
[hmeij@sharptail ~]$ module list Currently Loaded Modulefiles: 1) cuda50/toolkit/5.0.35 3) namd/ibverbs-smp-cuda/2013-06-02 5) lammps/cuda/2013-01-27 2) mvapich2/gcc/64/1.6 4) amber/gpu/13
Testing of GPUs at vendor sites may help get the idea of how to run GPU compiled code.
LAMMPS and Amber were compiled against mvapich2. They should be run with “mpirun_rsh -ssh -hostfile /path/to/hostfile -np# other_program_options”.
Lammps GPU Testing … may help shed some ideas
Sharptail example. The hostfile only has 1 line in it with one node name. This allows LAMMPS to pick any idle GPU it finds, a potential clash problem. The link above shows how to target GPUs by ID.
[hmeij@sharptail sharptail]$ cat hostfile n34 [hmeij@sharptail sharptail]$ mpirun_rsh -ssh -hostfile ~/sharptail/hostfile \ -np 12 lmp_nVidia -sf gpu -c off -v g 2 -v x 32 -v y 32 -v z 64 -v t 100 < \ ~/sharptailin.lj.gpu unloading gcc module LAMMPS (31 May 2013) Lattice spacing in x,y,z = 1.6796 1.6796 1.6796 Created orthogonal box = (0 0 0) to (53.7471 53.7471 107.494) 2 by 2 by 3 MPI processor grid Created 262144 atoms -------------------------------------------------------------------------- - Using GPGPU acceleration for lj/cut: - with 6 proc(s) per device. -------------------------------------------------------------------------- GPU 0: Tesla K20m, 2496 cores, 4.3/4.7 GB, 0.71 GHZ (Mixed Precision) GPU 1: Tesla K20m, 2496 cores, 4.3/0.71 GHZ (Mixed Precision) -------------------------------------------------------------------------- Initializing GPU and compiling on process 0...Done. Initializing GPUs 0-1 on core 0...Done. Initializing GPUs 0-1 on core 1...Done. Initializing GPUs 0-1 on core 2...Done. Initializing GPUs 0-1 on core 3...Done. Initializing GPUs 0-1 on core 4...Done. Initializing GPUs 0-1 on core 5...Done. Setting up run ... Memory usage per processor = 5.83686 Mbytes Step Temp E_pair E_mol TotEng Press 0 1.44 -6.7733676 0 -4.6133759 -5.0196742 100 0.75875604 -5.7604958 0 -4.622366 0.19306017 Loop time of 0.431599 on 12 procs for 100 steps with 262144 atoms Pair time (%) = 0.255762 (59.2592) Neigh time (%) = 4.80811e-06 (0.00111402) Comm time (%) = 0.122923 (28.481) Outpt time (%) = 0.00109257 (0.253146) Other time (%) = 0.051816 (12.0056) Nlocal: 21845.3 ave 22013 max 21736 min Histogram: 2 3 3 0 0 0 0 2 1 1 Nghost: 15524 ave 15734 max 15146 min Histogram: 2 2 0 0 0 0 0 0 3 5 Neighs: 0 ave 0 max 0 min Histogram: 12 0 0 0 0 0 0 0 0 0 Total # of neighbors = 0 Ave neighs/atom = 0 Neighbor list builds = 5 Dangerous builds = 0 --------------------------------------------------------------------- GPU Time Info (average): --------------------------------------------------------------------- Neighbor (CPU): 0.0041 s. GPU Overhead: 0.0429 s. Average split: 1.0000. Threads / atom: 4. Max Mem / Proc: 31.11 MB. CPU Driver_Time: 0.0405 s. CPU Idle_Time: 0.2199 s. ---------------------------------------------------------------------
Amber GPU Testing … may help shed some ideas
Note: ran out of time to get an example running but it should follow the LAMMPS approach of above pretty closely. The binary is in /cm/share/apps/amber/amber12/bin/pmemd.cuda.MPI
NAMD was compiled with the built-in multi-node networking capabilities, including ibverbs support.
An example of running NAMD is below.
[microway@n33 namd-test]$ which charmrun /cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/charmrun [microway@n33 namd-test]$ echo $NAMD_DIR /cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/ [microway@n33 ~]$ cat namd-machines group main host n33 host n34 host n35 host n36 host n37 [microway@n33 namd-test]$ charmrun $NAMD_DIR/namd2 +p8 ++nodelist ~/namd-machines +idlepoll apoa1/apoa1.namd Charmrun> IBVERBS version of charmrun unloading gcc module Charmrun> started all node programs in 1.333 seconds. Converse/Charm++ Commit ID: v6.5.0-8-g61d76cf Trace: traceroot: /cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02//namd2 Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. Charm++> Running on 5 unique compute nodes (32-way SMP). Charm++> cpu topology info is gathered in 0.003 seconds. Info: Running on 8 processors, 8 nodes, 5 physical nodes. Info: CPU topology information available. Info: Charm++/Converse parallel runtime startup completed at 0.0095768 s Pe 3 physical rank 0 binding to CUDA device 3 on n36: 'Tesla K20m' Mem: 4799MB Rev: 3.5 Pe 4 physical rank 0 binding to CUDA device 0 on n37: 'Tesla K20m' Mem: 4799MB Rev: 3.5 Pe 5 physical rank 1 binding to CUDA device 2 on n33: 'Tesla K20m' Mem: 4799MB Rev: 3.5 Pe 0 physical rank 0 binding to CUDA device 0 on n33: 'Tesla K20m' Mem: 4799MB Rev: 3.5 Info: 289.738 MB of memory in use based on /proc/self/stat ...etc [microway@n33 ~]$ gpu-info ==================================================== Device Model Temperature Utilization ==================================================== 0 Tesla K20m 29 C 50 % 1 Tesla K20m 27 C 0 % 2 Tesla K20m 28 C 51 % 3 Tesla K20m 25 C 0 % ====================================================
Hint: Look in /home/microway for sample run of all GPI compiled software.