User Tools

Site Tools


cluster:56

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

cluster:56 [2007/11/06 10:08] (current)
Line 1: Line 1:
 +\\
 +**[[cluster:​0|Back]]**
  
 +
 +
 +
 +
 +
 +====== [CLACReps] High Performance Cluster @ Wesleyan ======
 +
 +General answers to questions posed by the CLACReps.
 +
 +This wiki may have much more detailed information scattered about and i'll point to some relevant pages. ​ Click on the **Back** link above to go to the main page.  Our cluster resides on our internal VLAN, hence is only accessible via Active Directory guest accounts and VPN for non-wesleyan users.
 +
 +You can view our cluster activities ...
 +
 +  * ** [[http://​clumon-external.wesleyan.edu|External Cluster Monitor Mirror]] ** //updates every 10 mins ...//
 +
 +===== HPC specs? =====
 +
 +What type of cluster is it?  Our cluster is a Dell cluster that arrived completely racked. ​ Dell engineers performed the final configuration on-site installing Platform/​OCS. ​ This is a ROCKS based cluster.
 +
 +The cluster is comprised of 36 compute nodes. Each node contains dual Quad Core Xeon Processors (Xeon 5355 chips, 2x4MB Cache, 2.66GHz, 1333MHz FSB), basically a Dell Power Edge1950. ​ 32 servers have 4 GB 667MHz (4x1GB), Dual Ranked DIMMs and 4 servers have 16GB 667MHz (8x2GB), Dual Ranked DIMMs. ​ That makes for a total of 36*8 = 288 cores.
 +
 +There is one head node (also runs the scheduler Platform/​Lava which we will upgrade soon to Platform/​LSF 6.2) a PowerEdge 2950 with 2 GB memory. ​ In addition we have one IO node which is identical to our head node.  It is connect via dual 4Gbps fiber cards (in fail over mode) to our Netapp storage device. ​ 5 TB of file system is made available, see below.
 +
 +Compute nodes run Redhat Enterprise Linux WS4 while the head node runs Redhat Enterprise Linux AS4.  Both linux versions in x86_64 mode running a 2.6.9 kernel version.
 +
 +
 +
 +
 +===== Queue Policies ? =====
 +
 +We currently operate under the pragma of "no limitations"​. ​ Which we can do since we are not experiencing saturation of resources yet.  Seldomly do jobs go into a pending state because of lack of resources. However, it has been our experience that 4 GB of memory per node (for a total of 8 cores) is not enough. ​ Since 8 jobs may be scheduled on these nodes, memory is in high demand.
 +
 +Some queues reflect the internal network of our cluster. ​ There is one gigE Dell switch for the administrative software subnet (192.168.1.xxx) using their first NIC.  A higher grade gigE switch (Cisco 7000) provides the gigE connectivity amongst all the nodes which we name our "​NFS"​ subnet (10.3.1.xxx). ​ Each node's second NIC is dedicated for nfs traffic to the ionode. ​ A third Infiniband switch connects 16 of the nodes together. ​ Hence we have what we call the light weight node queue "​16-lwnodes"​ (gigE enabled nodes), the light weight nodes queue "​16-ilwnodes"​ (gigE and Infiniband enabled nodes).
 +
 +Another queue, "​04-hwnodes",​ is comprised of the 4 servers with the large memory footprint (16 GB each). ​ These nodes are also connected (each) to two Dell MD1000 storage arrays. ​ Each node has dedicated access to 7 15,000 RPM disks (striped, mirrored, raid 0) for fast scratch space.
 +
 +Other than those queues we have a Matlab queue which limits the number of jobs based on the licensed number of workers. ​ This Matlab installation uses the Distributed Computing Engine.
 +
 +Four of the compute nodes allow our users ssh access. ​ These nodes also comprise the "​debug"​ queues. One queue, the "​idle"​ queue, allocates jobs to any host not considered busy irregardless of resources available.
 +
 +<​code>​
 +[root@swallowtail web]# bqueues
 +QUEUE_NAME ​     PRIO STATUS ​         MAX JL/U JL/P JL/H NJOBS  PEND   ​RUN ​ SUSP 
 +debug            70  Open:​Active ​      ​- ​   -    -    -     ​0 ​    ​0 ​    ​0 ​    0
 +idebug ​          ​70 ​ Open:​Active ​      ​- ​   -    -    -     ​0 ​    ​0 ​    ​0 ​    0
 +16-lwnodes ​      ​50 ​ Open:​Active ​      ​- ​   -    -    -    40     ​0 ​   40     0
 +16-ilwnodes ​     50  Open:​Active ​      ​- ​   -    -    -    13     ​0 ​   13     0
 +04-hwnodes ​      ​50 ​ Open:​Active ​      ​- ​   -    -    -     ​0 ​    ​0 ​    ​0 ​    0
 +matlab ​          ​50 ​ Open:​Active ​      ​8 ​   8    -    8     ​1 ​    ​0 ​    ​1 ​    0
 +molscat ​         50  Open:​Active ​      ​- ​   -    -    2     ​0 ​    ​0 ​    ​0 ​    0
 +gaussian ​        ​50 ​ Open:​Active ​      ​- ​   -    -    8     ​4 ​    ​0 ​    ​4 ​    0
 +nat-test ​        ​50 ​ Open:​Active ​      ​- ​   -    -    -     ​0 ​    ​0 ​    ​0 ​    0
 +idle             ​10 ​ Open:​Active ​      ​- ​   -    -    -    60     ​0 ​   60     0
 +</​code>​
 +
 +===== Software? =====
 +
 +The list of software installed can be found here:  **[[cluster:​28|User Guide and Manuals]]**
 +
 +This access point is considered "​required reading"​ for new users. ​ We have roughly 50 accounts currently but seldomly show more than a dozen active login sessions.
 +
 +
 +
 +===== File systems, Quotas ? =====
 +
 +We currently do not enforce any disk quotas.
 +
 +Home directories are spread over a dozen or so LUNs, each 1 TB in size ("thin provisioned"​). ​ The total disk space available for home directories is 4 TB.  This implies that a user could use 1 TB as long as it is available.
 +
 +/sanscratch is a 1 TB LUN that is shared by all the nodes for large scratch space. ​ There are also local /​localscratch file system that is provided by a single 80 GB hard disk.  Except on the heavy nodes were this is roughly 230 GB of fast disk space.
 +
 +The scheduler, when a job is submitted for execution will provide unique working directories in both /sanscratch and /​localscratch. ​ And also clean up afterwards. ​ Users are encouraged to use these areas and not perform extensive read/writes using their home directories.
 +
 +Home directories are backed up using the NetApp snapshot capabilities twice daily. ​ Tivoli runs incrementals backup each night to our tape storage device.
 +
 +
 +===== ...? =====
 +
 +\\
 +**[[cluster:​0|Back]]**
cluster/56.txt ยท Last modified: 2007/11/06 10:08 (external edit)