Sharptail Cluster

A recycle head node name, seems appropriate.

The new hardware has been delivered and rack&stacked. First priority was looking around while /home was copied from greentail:/home. This cluster is comprised of one head node (sharptail) and 5 compute nodes (n33-n37). The head node has a 48 TB disk array and 128 GB of memory. The compute nodes each have 256 GB of memory and offer dual 8 core chips. With so much memory, hyperthreading has been turned on doubling the number of cores the operating systems recognizes (so 32 cores per node). Each node also contain 4 GPUs. The entire cluster provides for 1.3+ TB of memory and 20+ Teraflops of computational power. That's almost 7x what we currently have. All these resources will be made available via the Lava scheduler later on.

What is a GPU-HPC cluster

Recess Period

July and August 2013 I'll call the “Recess! Stay&Play” period. Due to vacation days and all that, final configuration will not be achieved till later this summer. So sharptail is open for ssh access during this time. You may run whatever you want directly on these nodes. There is no scheduler. Your on your own. You can also ssh to the nodes from sharptail. I've advised the unix admins to not help out with any problems (they are busy enough) and the configuration is still foreign to me. So help each other out if possible.

  • ssh
    • then ssh to one of the nodes, see samples below
    • setup your environment like in a submit script, then run your program
  • Reboots may happen. I'll try to warn folks when.
  • Shell access will disappear in final production mode! (use greentail or swallowtail)
  • /home is still being populated, should finish some time Thursday Jul 11th at night


Sharptail is slated to become our file server for /home taking over from greentail. Cut over will be last step before going into production. Meanwhile the first sync from greentail to sharptail is about to finish but refreshes will happen. When a refresh happens:

  • Files that are created on greentail are pushed to sharptail
  • Files that disappeared on greentail also disappear on sharptail
  • Files that were created on sharptail (and do not exist on greentail) will disappear!

So it's important that if you want to keep stuff on sharptail you need to copy that to greentail before a refresh happens. I suggest you create a ~/sharptail directory and work inside of that on sharptail. You can transfer files like so:

  • cp -rp /home/username/sharptail /mnt/greentail_home/username/
  • scp -rp ~/sharptail greentail:~

So in short, in the future, sharptail:/home is our active file system while greentail:/home will become say greentail:/home_backup (inactive). They will be kept in sync and rsnapshot'ed on both disk arrays so we have a better backup/restore strategy.


Sharptail will provide the users (and scheduler) with another 5 TB scratch file system. During this period it is only provided to the sharptail nodes (n33-n37). In the future it will provide this file system to all nodes except greentail nodes (n1-n32).

  • Please offload as much IO from /home by staging your jobs in /sanscratch
  • an example: SAS read the submit2 section.


Without a scheduler jobs may clash, too many jobs may be running. To avoid that you need to find idle cores before you start jobs. This can be done for idle CPU cores and GPUs.

To find idle CPU cores, ssh to a node, start 'top' then press number one '1' … the display shows one line per core. These cores are all idle. You can submit jobs.

top - 10:20:47 up 2 days, 55 min,  3 users,  load average: 0.04, 0.01, 0.00
Tasks: 766 total,   1 running, 765 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.7%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

To find idle GPU cores, ssh to a node, then type 'gpu-info'. The display shows idle GPUs if present.

[hmeij@n33 bin]$ gpu-info
Device  Model           Temperature     Utilization
0       Tesla K20m      25 C            0 %
1       Tesla K20m      27 C            0 %
2       Tesla K20m      25 C            0 %
3       Tesla K20m      25 C            0 %

In both cases you do not need to target any specific core, the operating system will handle that part of the scheduling.


With hyperthreading on the 5 nodes, it provides for 160 cores. We need to reserve 20 cores for the GPUs (one per GPU), and lets reserve another 20 cores for the OS (5 per node). That still leaves 120 cores for regular jobs like you are used to on greentail. These 120 cores (24 per node) will show up later as a new queue on greentail/swallowtail; one that is fit for jobs that need much memory. On average 256 gb per node minus 20 gb for 4 GPUs minus 20 gb for OS leaves 5.6 gb per core.

So since there is no scheduler, you need to setup your environment and execute your program. Here is an example of a program that normally runs on the imw queue. If your program involves MPI you need to be a bit up to speed on what the lava wrapper actually does for you.

First create the machinesfile, set up your environment by sourcing the appropriate files, submit your program, and monitor the parallel jobs starting using 'top'.

[hmeij@sharptail cd]$ cat mpi_machines                                         

[hmeij@sharptail cd]$ . /share/apps/intel/cce/10.0.025/bin/
[hmeij@sharptail cd]$ . /share/apps/intel/fce/10.0.025/bin/

[hmeij@sharptail cd]$ time /home/apps/openmpi/1.2+intel-10/bin/mpirun \
-x LD_LIBRARY_PATH -machinefile ./mpi_machines \
/share/apps/amber/9+openmpi-1.2+intel-9/exe/pmemd \
-O -i inp/ -p -c -ref &
[1] 3304

[hmeij@sharptail cd]$ ssh n33 top -b -n1 -u hmeij
top - 14:49:28 up 1 day,  5:24,  1 user,  load average: 0.89, 0.20, 0.06
Tasks: 769 total,   5 running, 764 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  264635888k total,  2364236k used, 262271652k free,    44716k buffers
Swap: 31999992k total,        0k used, 31999992k free,   382224k cached

24348 hmeij     20   0  334m  58m 6028 R 100.0  0.0   0:17.23 pmemd
24345 hmeij     20   0  307m  44m 8816 R 100.0  0.0   0:17.20 pmemd
24346 hmeij     20   0  310m  42m 8824 R 98.3  0.0   0:17.22 pmemd
24347 hmeij     20   0  318m  48m 8004 R 98.3  0.0   0:17.19 pmemd
24353 hmeij     20   0 15552 1636  832 R  1.9  0.0   0:00.03 top
24344 hmeij     20   0 86828 2324 1704 S  0.0  0.0   0:00.01 orted
24352 hmeij     20   0  107m 1864  860 S  0.0  0.0   0:00.00 sshd

[hmeij@sharptail cd]$ ssh n34 top -b -n1 -u hmeij
top - 14:49:37 up 1 day,  2:40,  0 users,  load average: 1.89, 0.47, 0.16
Tasks: 766 total,   5 running, 761 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  264635888k total,  2310176k used, 262325712k free,    29788k buffers
Swap: 31999992k total,        0k used, 31999992k free,   359596k cached

12198 hmeij     20   0  334m  61m 5328 R 99.8  0.0   0:25.88 pmemd
12200 hmeij     20   0  302m  33m 5368 R 99.8  0.0   0:25.88 pmemd
12201 hmeij     20   0  310m  40m 5352 R 99.8  0.0   0:25.88 pmemd
12199 hmeij     20   0  310m  39m 5372 R 97.8  0.0   0:25.87 pmemd
12205 hmeij     20   0 15552 1636  832 R  3.8  0.0   0:00.04 top
12197 hmeij     20   0 86828 2324 1704 S  0.0  0.0   0:00.01 orted
12204 hmeij     20   0  107m 1864  860 S  0.0  0.0   0:00.00 sshd


LAMMPS, Amber and NAMD have been compiled using Nvidia's toolkit. They are located in /cm/share/apps.

Module files have been created for these apps and are automatically loaded upon login. For example:

[hmeij@sharptail ~]$ module list
Currently Loaded Modulefiles:
  1) cuda50/toolkit/5.0.35              3) namd/ibverbs-smp-cuda/2013-06-02   5) lammps/cuda/2013-01-27
  2) mvapich2/gcc/64/1.6                4) amber/gpu/13

Testing of GPUs at vendor sites may help get the idea of how to run GPU compiled code.

LAMMPS and Amber were compiled against mvapich2. They should be run with “mpirun_rsh -ssh -hostfile /path/to/hostfile -np# other_program_options”.

Lammps GPU Testing … may help shed some ideas

Sharptail example. The hostfile only has 1 line in it with one node name. This allows LAMMPS to pick any idle GPU it finds, a potential clash problem. The link above shows how to target GPUs by ID.

[hmeij@sharptail sharptail]$ cat hostfile                                                      
[hmeij@sharptail sharptail]$ mpirun_rsh -ssh -hostfile ~/sharptail/hostfile \
-np 12 lmp_nVidia -sf gpu -c off -v g 2 -v x 32 -v y 32 -v z 64 -v t 100 <  \
unloading gcc module                                                                                                       
LAMMPS (31 May 2013)                                                  
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796                                                     
Created orthogonal box = (0 0 0) to (53.7471 53.7471 107.494)                                
  2 by 2 by 3 MPI processor grid                                                                         
Created 262144 atoms                                                                                       

- Using GPGPU acceleration for lj/cut:                                    
-  with 6 proc(s) per device.                                             
GPU 0: Tesla K20m, 2496 cores, 4.3/4.7 GB, 0.71 GHZ (Mixed Precision)     
GPU 1: Tesla K20m, 2496 cores, 4.3/0.71 GHZ (Mixed Precision)             

Initializing GPU and compiling on process 0...Done.
Initializing GPUs 0-1 on core 0...Done.
Initializing GPUs 0-1 on core 1...Done.
Initializing GPUs 0-1 on core 2...Done.
Initializing GPUs 0-1 on core 3...Done.
Initializing GPUs 0-1 on core 4...Done.
Initializing GPUs 0-1 on core 5...Done.

Setting up run ...
Memory usage per processor = 5.83686 Mbytes
Step Temp E_pair E_mol TotEng Press
       0         1.44   -6.7733676            0   -4.6133759   -5.0196742
     100   0.75875604   -5.7604958            0    -4.622366   0.19306017
Loop time of 0.431599 on 12 procs for 100 steps with 262144 atoms

Pair  time (%) = 0.255762 (59.2592)
Neigh time (%) = 4.80811e-06 (0.00111402)
Comm  time (%) = 0.122923 (28.481)
Outpt time (%) = 0.00109257 (0.253146)
Other time (%) = 0.051816 (12.0056)

Nlocal:    21845.3 ave 22013 max 21736 min
Histogram: 2 3 3 0 0 0 0 2 1 1
Nghost:    15524 ave 15734 max 15146 min
Histogram: 2 2 0 0 0 0 0 0 3 5
Neighs:    0 ave 0 max 0 min
Histogram: 12 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 5
Dangerous builds = 0

      GPU Time Info (average):
Neighbor (CPU):  0.0041 s.
GPU Overhead:    0.0429 s.
Average split:   1.0000.
Threads / atom:  4.
Max Mem / Proc:  31.11 MB.
CPU Driver_Time: 0.0405 s.
CPU Idle_Time:   0.2199 s.

Amber GPU Testing … may help shed some ideas

NAMD was compiled with the built-in multi-node networking capabilities, including ibverbs support.

An example of running NAMD is below.

[microway@n33 namd-test]$ which charmrun

[microway@n33 namd-test]$ echo $NAMD_DIR

[microway@n33 ~]$ cat namd-machines
group main
host n33
host n34
host n35
host n36
host n37

[microway@n33 namd-test]$ charmrun $NAMD_DIR/namd2 +p8 ++nodelist ~/namd-machines +idlepoll apoa1/apoa1.namd 
Charmrun> IBVERBS version of charmrun

unloading gcc module
Charmrun> started all node programs in 1.333 seconds.
Converse/Charm++ Commit ID: v6.5.0-8-g61d76cf        
Trace: traceroot: /cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02//namd2
Charm++> scheduler running in netpoll mode.                              
CharmLB> Load balancer assumes all CPUs are same.                        
Charm++> Running on 5 unique compute nodes (32-way SMP).                 
Charm++> cpu topology info is gathered in 0.003 seconds.                 
Info: Running on 8 processors, 8 nodes, 5 physical nodes.                      
Info: CPU topology information available.                                      
Info: Charm++/Converse parallel runtime startup completed at 0.0095768 s       
Pe 3 physical rank 0 binding to CUDA device 3 on n36: 'Tesla K20m'  Mem: 4799MB  Rev: 3.5
Pe 4 physical rank 0 binding to CUDA device 0 on n37: 'Tesla K20m'  Mem: 4799MB  Rev: 3.5
Pe 5 physical rank 1 binding to CUDA device 2 on n33: 'Tesla K20m'  Mem: 4799MB  Rev: 3.5
Pe 0 physical rank 0 binding to CUDA device 0 on n33: 'Tesla K20m'  Mem: 4799MB  Rev: 3.5
Info: 289.738 MB of memory in use based on /proc/self/stat  

[microway@n33 ~]$ gpu-info
Device  Model           Temperature     Utilization
0       Tesla K20m      29 C            50 %
1       Tesla K20m      27 C            0 %
2       Tesla K20m      28 C            51 %
3       Tesla K20m      25 C            0 %

Hint: Look in /home/microway for sample run of all GPI compiled software.


