General answers to questions posed by the CLACReps.
This wiki may have much more detailed information scattered about and i'll point to some relevant pages. Click on the Back link above to go to the main page. Our cluster resides on our internal VLAN, hence is only accessible via Active Directory guest accounts and VPN for non-wesleyan users.
You can view our cluster activities …
What type of cluster is it? Our cluster is a Dell cluster that arrived completely racked. Dell engineers performed the final configuration on-site installing Platform/OCS. This is a ROCKS based cluster.
The cluster is comprised of 36 compute nodes. Each node contains dual Quad Core Xeon Processors (Xeon 5355 chips, 2x4MB Cache, 2.66GHz, 1333MHz FSB), basically a Dell Power Edge1950. 32 servers have 4 GB 667MHz (4x1GB), Dual Ranked DIMMs and 4 servers have 16GB 667MHz (8x2GB), Dual Ranked DIMMs. That makes for a total of 36*8 = 288 cores.
There is one head node (also runs the scheduler Platform/Lava which we will upgrade soon to Platform/LSF 6.2) a PowerEdge 2950 with 2 GB memory. In addition we have one IO node which is identical to our head node. It is connect via dual 4Gbps fiber cards (in fail over mode) to our Netapp storage device. 5 TB of file system is made available, see below.
Compute nodes run Redhat Enterprise Linux WS4 while the head node runs Redhat Enterprise Linux AS4. Both linux versions in x86_64 mode running a 2.6.9 kernel version.
We currently operate under the pragma of “no limitations”. Which we can do since we are not experiencing saturation of resources yet. Seldomly do jobs go into a pending state because of lack of resources. However, it has been our experience that 4 GB of memory per node (for a total of 8 cores) is not enough. Since 8 jobs may be scheduled on these nodes, memory is in high demand.
Some queues reflect the internal network of our cluster. There is one gigE Dell switch for the administrative software subnet (192.168.1.xxx) using their first NIC. A higher grade gigE switch (Cisco 7000) provides the gigE connectivity amongst all the nodes which we name our “NFS” subnet (10.3.1.xxx). Each node's second NIC is dedicated for nfs traffic to the ionode. A third Infiniband switch connects 16 of the nodes together. Hence we have what we call the light weight node queue “16-lwnodes” (gigE enabled nodes), the light weight nodes queue “16-ilwnodes” (gigE and Infiniband enabled nodes).
Another queue, “04-hwnodes”, is comprised of the 4 servers with the large memory footprint (16 GB each). These nodes are also connected (each) to two Dell MD1000 storage arrays. Each node has dedicated access to 7 15,000 RPM disks (striped, mirrored, raid 0) for fast scratch space.
Other than those queues we have a Matlab queue which limits the number of jobs based on the licensed number of workers. This Matlab installation uses the Distributed Computing Engine.
Four of the compute nodes allow our users ssh access. These nodes also comprise the “debug” queues. One queue, the “idle” queue, allocates jobs to any host not considered busy irregardless of resources available.
[root@swallowtail web]# bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP debug 70 Open:Active - - - - 0 0 0 0 idebug 70 Open:Active - - - - 0 0 0 0 16-lwnodes 50 Open:Active - - - - 40 0 40 0 16-ilwnodes 50 Open:Active - - - - 13 0 13 0 04-hwnodes 50 Open:Active - - - - 0 0 0 0 matlab 50 Open:Active 8 8 - 8 1 0 1 0 molscat 50 Open:Active - - - 2 0 0 0 0 gaussian 50 Open:Active - - - 8 4 0 4 0 nat-test 50 Open:Active - - - - 0 0 0 0 idle 10 Open:Active - - - - 60 0 60 0
The list of software installed can be found here: User Guide and Manuals
This access point is considered “required reading” for new users. We have roughly 50 accounts currently but seldomly show more than a dozen active login sessions.
We currently do not enforce any disk quotas.
Home directories are spread over a dozen or so LUNs, each 1 TB in size (“thin provisioned”). The total disk space available for home directories is 4 TB. This implies that a user could use 1 TB as long as it is available.
/sanscratch is a 1 TB LUN that is shared by all the nodes for large scratch space. There are also local /localscratch file system that is provided by a single 80 GB hard disk. Except on the heavy nodes were this is roughly 230 GB of fast disk space.
The scheduler, when a job is submitted for execution will provide unique working directories in both /sanscratch and /localscratch. And also clean up afterwards. Users are encouraged to use these areas and not perform extensive read/writes using their home directories.
Home directories are backed up using the NetApp snapshot capabilities twice daily. Tivoli runs incrementals backup each night to our tape storage device.