The purpose of this testing is to find out how fast the storage systems respond either directly attached to compute nodes, or attached via ethernet (gigabit ethernet) or infiniband (SDR via queue imw or QDR via queue hp12). When using infiniband interconnects we use IPoIB (IP traffic over infiniband interconnects which theoretically might be 3-4 times faster than ethernet).
So, nothing beats directly attached storage ofcourse (scenario: fastlocal.dell.out below), the attached disk arrays on compute nodes in the ehwfd queue. Each node is presented with 230 gb of dedicated disk space provided by seven 10K disks using Raid 0 (all drives read and write simultaneously). IOZone suite finished in an hour.
However, that queue may be a bottle neck (only 4 compute nodes in the ehwfd queue) or perhaps 230 GB is not enough (for 8 job slots). So one alternative is the use of MYSANSCRATCH in your submit scripts. MYSANSCRATCH refers to a directory made for you by the scheduler at location /sanscratch/JOBPID which is a Raid 5 filesystem of 5 TB provided by 5 disks spinning at 7.2K. IOZone suite was done in 2 hrs 45 mins (scenario: san.hp.out).
For an example of using MYSANSCRATCH, look at the bottom of this page. You will have to stage your data in the directory provided and copy the results back to your home directory when finished. The scheduler will remove the directory.
So after testing the memory performance of our clusters using Linpack, View Results, what about the file system access performance? There are many variables at play in this area, so a higher-level view is appropriate rather than a too detailed view.
In order to have comparative numbers, I choose the package IOZone which seemed to be used for this type of activities. IOZone performs many different tests including read, re-read, write, re-write, read-and-write, random mix, backwards reads and a few others. The whole mix then might be an appropriate comparative standard. As details spin out, we could focus on those that most reflecct our environment best; probably random mix.
IOZone was compiled for x86 64 bit Linux and staged in a tarball. That tarball would be copied to the disk housing the file system in question, unpacked, and with the vanilla out of the box “rule set” invoked with 'time ./iozone -a -g 12G > output.out'. Then the results were saved and graphed. The reason for 12GB as the file size limit to test at the upper bounds was set because cluster greeentail memory footprint across the board is that. I did not raise the file size limit above the memory footprint to avoid introducing another variable. You can read all about it External Link
As some of the tests IOZone performs put quite the load on the host (observed a single invocation to generate a load of 6), I ran IOZone with the LSF/Lava scheduler flag '-x' meaning exclusive use so no other programs would interfere.
So lets start with cluster petaltail/swallowtail.
The compute nodes have a single 80GB 7.2K RPM disk containing a /localscratch linux file system. IOZone took 6+ hours to finish doing all the tests. So: local disk, one spindle, 4 year old hardware, no raid. Used one of the
ehw queue nodes. So how does the fast disks on queue
The computes nodes in the
ehwfd queue have directly attached to them, via iSCSI, a disk array. Each host has dedicated access to 230 GB provided by seven 36GB 15K RPM disks presented as /localscratch. So: local disks, 7 spindles, 4 year old hardware, raid 0. All seven disks working together at high speeds. This probably is the best IOZone performance we'll attain.
Our Netapp filer (filer3) provides 5 TB of home directory space, which is the same volume as /sanscratch, served up via a NFS mount. So now we have added a network component, IOZone will perform tests against a network mounted file system. The volume containing /sanscratch is composed of 24 1TB disks at 7.2K RPM speeds. The aggregate holding this volume, also holds other volumes. So: network NFS volume, 24 spindles, raid 50 (i believe). No surprise, it is slow. About 1/3rd slower than the single local disk, that is another surprise.
Then lets look at cluster greentail.
Like the in the petaltail cluster, cluster greentail's compute nodes sport a single 160 GB disk spinning at 7.2K RPM. As above /localscratch is a linux file system. So: local disk, one spindle, new hardware, no raid. Performance is double that of the petaltail nodes, must have to be related to disk caching.
The head node on cluster greentail has a direct attached smart disk array connected via iSCSI. A logical volume of 24 1TB disks, spinning at 7.2K RPM, holds a volume of 5TB presented to compute nodes as an NFS mount /sanscratch. To add another variable, the NFS mount is done using an infiniband switch, all previous examples used gigabit ethernet switches. IPoIB as it is referred to, and operates at roughly 3x gigE, depends on a lot of things. So: network NFS volume over infiniband, 24 spindles, raid 6. Surprisingly, it betters the single spindle - local disk example above by roughly 20%.
On cluster greentail, a separate logical volume presents /home. This volume is comprised of 12 1TB disks at 7.2K RPM speeds. Same as above in terms of NFS mount across infiniband. Note that the disks involved for /home are different than those for /sanscratch. As expected it falls slightly short of the sancratch volume performance but not by much. However, as users exercise the /home volume this may become a larger gap.
* home.dell.out real (4 hours and 8 mins)
This was running against greentail's /home mounted via gigabit ethernet Force 10 switch on the petaltail/swallowtail cluster (runnning on node c28). So just one hour penalty versus running locally on greentail. Not bad at all, will seriously speed up jobs on the Dell cluster then.
IOZone generates lots of interesting graphs, whose interpretations elude me somewhat still. But it is obvious in some graphs were anomalies exists; at sudden thresholds the performance starts to nose dive.
Using MYSANSCRATCH with gaussian jobs (you can use any queue but hp12 will be the fastest):
#!/bin/bash #BSUB -q hp12 #BSUB -o out #BSUB -e err #BSUB -J test # job slots: change both lines, also inside gaussian.com #BSUB -n 8 #BSUB -x # unique job scratch dirs MYSANSCRATCH=/sanscratch/$LSB_JOBID MYLOCALSCRATCH=/localscratch/$LSB_JOBID export MYSANSCRATCH MYLOCALSCRATCH # cd to remote working dir cd $MYSANSCRATCH pwd # environment export GAUSS_SCRDIR="$MYSANSCRATCH" export g09root="/share/apps/gaussian/g09root" . $g09root/g09/bsd/g09.profile # stage input data rm -rf ~/gaussian/err ~/gaussian/out* cp ~/gaussian/gaussian.com . # run time g09 < gaussian.com > gaussian.log # save results cp gaussian.log ~/gaussian/output.$LSB_JOBID