User Tools

Site Tools


cluster:88

This is an old revision of the document!



Back

Blue Sky Studios

Hardware

We have 4 racks of which 3 are powered up. All on utility power including head/login node. Racks are surprisingly cool compared to our Dell cluster. Some digging revealed that the AMD Opteron chip cycles down to 1 Ghz if not used instead of running at 2.4 Ghz all the time (You can observe this in /proc/cpuinfo).

If you want to use the ProCurve switches you need to power up the top two shelves within each rack or use an alternate source of power. Access can be established via serial connection and a program like Hyperterminal. Settings are the default settings but use COM1, 9600 baud rate, hardware flow is None. Type 'menu' once connection is made depressing the Enter key. Give the unit an IP on the provision or data subnet and access via browser for web GUI (assumes your laptop is also on same subnet).

We wanted to separate the data traffic (NFS) from the software management and MPI traffic so will be leveraging both ethernet ports on each blade. In order to do that we changed the cabling. In our setup the top ProCurve switch is always the provision switch (192.168.1.y/255.255.255.0) and the bottom switch is the data switch (10.10.100.y/255.255.0.0). Port 48 of each switch cascades into the next switch, horizontally, so that all 3 ProCurve switches across the racks become one network; provision or data.

It's important to note that the management software (Project Kusu, see below) assumes eth0 is provision (192.168) but eth1 is on your domain (like wesleyan.edu or pace.edu). If that is implemented it means that each node can be reached from the outside which is not what we wanted in our case.

We bought 52 three feet CAT 6 ethernet cables for each rack. The original purple cables, connecting blade to rack in the top two shelves within a rack, connect to the bottom ethernet blade port (eth0). For the bottom two racks, the purple cables connect to top ethernet blade port (eth1). Then the rest of the ethernet blade ports were connected with the three feet cables. This results in each blade being connected to top and bottom switch. Now the math does not work out smoothly; 4 shelves with 13 blades is 52 eth[0|1] connections but the switches have 48 ports (minus the uplink port). So you have some blades not connected in each rack.

Our storage is provided by one of our NetApp filers (5TB volume) via NFS. The filer is known as filer3a or filer13a and sits on our internal private network with IPs in the 10.10.0.y/255.255.0.0 network range. Two ethernet cables, link aggregated, connect our Dell cluster data switch to this private network (hence we have fail over and possibly a 2 Gbit pipe). For simplicity sake, we connected the first ProCurve switch into the Dell data switch rather than running more cables to private network switches for the BSS cluster. This means that each blade mounts directly the filer file system (home directories) off the Netapp filer over the private network.

Our head node has a similar setup (provision and data ports). This means that the BSS cluster sits entirely on the private network and is not reachable from our domain wesleyan.edu. Users must first login to the head/login nodes of the Dell cluster and then via ssh keys (no passwords) reach the BSS head/login node. This has worked out, but with only one cluster and only two ports on the head node, there needs to be a connection to the outside world (for example, eth1 could become the connection to the external world and the storage then must be mounted over this connection as well; eth0 must be on the 192.168 provision network).

Software

For our operating system we choose CentOS 5.3 and burned the ISO images to cdrom. For our management software we choose Project Kusu which can be found at http://www.hpccommunity.org/. Project Kusu is the open source counter part of Platform.com's OCS (now known as PCM) software stack, a ROCKS based but enhanced commercial version (which we run on the Dell cluster). For our scheduler we choose Lava, also found at this site, which is the open source counter part of Platfrom.com's LSF scheduler. You can also find monitoring tools at this site and so we also burned to cdrom the ISO images for Ganglia, NTop and Cacti in addition to the Kusu Installer and Lava kits.

Once you have all these burned to cdrom, you are ready to step through 12 installation screens which are fairly straight forward. The screens are described at http://www.hpccommunity.org/section/kusu-45/ along with Installation and Overview guides. Boot a selected blade, this will become the installer node also referred to as the head or login node, off the Kusu Installer cdrom (in BIOS specify to boot of USB device first). Provide information configuring the network, root account, local hard disks, etc, when prompted. Towards the last step Kusu will ask for the kits you want installed. Feed it the CentOS, Lava, Ganglia, NTop and Cacti kits. After this step Kusu will finish the installation and reboot. One customization inserted in this process is that we added a new partition /localscratch of about 50GB.

After reboot, Kusu will have create a /depot directory with the CentOS inside it. It can be manipulated with repoman (for example, take a snapshot before you change anything). Configuration information is loaded in postgres sql databases. A DHCP server is started listening on the provision network. Also in /opt you'll find GNU compilations of many MPI flavors including OpenMPI. Also a working installation of Lava can be queried (bhosts, bqueues, etc). Ganglia, Ntop and Cacti will also be running and are monitoring your installer node.

The next step is optional but I did it because I wanted my node IPs to be in a certain range and increment downwards, for example start at 192.168.1.254/10.10.100.254 with a step of -1. We also shorten the host naming convention to something simple like bss000, iterate by step +1. Copy with commands netedit and ngedit the configurations for the compute node and then customize those settings. These templates can then be associated with the blades when imaging. You may also want to scan the selected rpm packages and add software from the operating system depot, such as for example vim and emacs (annoying they are not selected by default).

Once you have the templates in place, on the installer node start the command addhost (which will take over the console). Select the appropriate node group template when addhost starts. Power on your first blade, enter the BIOS when booting, make sure it tries to boot off the network cards first (rather than disk/cdrom), and boot. What will happen next is that the blade will broadcast it's first MAC address (of eth0) across the provision network. addhost will register that, assign the first 192.168 IP to it, send the blade a kickstart file, and the blade will now begin to format its local hard disk and then install the kits, and reboot. As soon as you quit addhost, after having added some blades, addhost will reconfigure Lava, Ganglia and Cacti and your new compute nodes will be visible.

You also have the option of configuring diskless compute nodes within Kusu, or you can mix and match. You can also during initial configuration add more than one operating system. However, you can also add these later.

Some final configurations steps. We added all compute nodes with 12 gb memory footprint into a queue named bss12, and similarly we have a bss24 queue for the 24 gb memory footprint compute nodes; for a total of roughly 250 job slots. Kusu will also create on the installer node /home and export it to all the compute nodes. We simply mount our filer home directories on top of that via /etc/rc.local.

You can change configurations on the compute nodes in two ways. Command pdsh will execute the same command in parallel across all the hosts listed in /etc/hosts.pdsh (created by Kusu). Or you can use comand cfmsync, content file manager. cfm looks in /etc/cfm/compute-centos-node-group-name for files or links to files and will update the remote copies on the nodes if they are not up to date. cfm rewrites a lot files during reboot which sometimes becomes annoying, like /etc/fstab, based on the information in the databases.

Of note is that we compile and install requested software in /home/apps/ so that it is immediately available cluster wide. For parallel programs we compile with OpenMPI so these programs can run on both infiniband and ethernet switches.

There are two scratch areas. /localscratch which is local file system on each blade and will support NFS locking if needed. /sanscratch is a directory mounted via NFS from the Netapp filer. The Lava scheduler, using the pre_exec and post_exec queue directives will create in both areas a directory for the job submitted using $JOBPID. Users can observe their job progress in the /sanscratch area but not in the /localscratch areas. Both directories are removed when the job finishes.

Accounts are created locally on the cluster. When the user logs in for the first time, Kusu automatically creates the ssh keys necessary to submit programs via the scheduler to the compute nodes without relying on passwords.

There are very few policies on our clusters. Use disk space as needed and archive data elsewhere. Run as many jobs as needed but leave resources for others. Infiniband switches are primarily for MPI compiled programs (does not apply to BSS cluster).

Step 1

Download, MD check sum, and burn following ISOs to disc.

I recommend check summing the files. Had trouble with these files downloading cleanly.

Step 2

  • Select an installer node, insert Kusu Installer into CD/DVD, and connect device via USB ports.
  • Installer node, and 2-3 compute nodes, must have the purple cable connecting eth0 (bottom port) to rack ProCurve switch (top one). If you wish, you can cable top port (eth1) into bottom switch for testing, but this is not necessary.
  • Boot installer node, hit F2 to Enter BIOS, traverse to menu tab Boot and make sure both CDROM and Removable Device are listed before any other options like hard disk and network cards, hit F10, save changes and exit/reboot.
  • Next you should see the Project Kusu splash page with the orange lego turtle; when prompted type 'centos'.
  • Navigation around these screens is Tab/Enter and arrow keys.
  • Next come the informational screens, in order
  • 1 - language: English
  • 2 - keyboard: us
  • 3 - network, configure each interface, edit and configure two private networks (for Pace we'll reset eth1 on installer node later on for public access), this is so that the cluster is not accessible from outside and we could separate provision from private (NFS data/MPI) traffic. Edit:
    • eth0: 192.168.101.254/255.255.0.0
    • name: kusu101prov, type: provision
    • eth1: 10.10.101.254/255.255.0.0
    • name: kusupriv, type: other
  • 4 - gateway & dns: gateway 192.168.101.0 (is not used but required field), dns server 192.168.101.254 (installer node)
  • 5 - host: FQDN kusu101, PCD kusu101 (basically we will not provide internet accessible names)
  • 6 - time: American/New York with NTP server 192.168.101.254 (could be configured later to poll outside)
  • 7 - root password: password (keep simple for now, change later)
  • 8 - disk partitions: select 'Use Default'
    • edit /home downsize to 1024 (Pace may want to leave this much larger and create a 1 GB /localscratch in this test setup)
    • add a logical volume
      • mount point /localscratch
      • label LOCALSCRATCH
      • size: leave blank, see below
      • type ext3, on hda, check “fill remaining space on disk” (!only one partition can have this setting!)
  • 9 - confirm: accept (at this point the disk gets reformatted)
  • 10 - kits: select Add, insert kit cd, wait, cycle through disks by kit, then No More Kits, then Finish (node reboots).

Upon reboot (enter BIOS and reset boot to hard disk first) check some command output: hostname, route, ifconfig, bhosts, bqueues

Step 3

  • first create network interfaces for nodes, different from installer network interfaces
  • type 'netedit' and browse installer eth0/1 screens, next
  • 'New' and define:
    • network: 192.168.0.0
    • subnet: 255.255.0.0
    • gateway: 192.168.101.0
    • device: eth0
    • starting IP: 192.168.101.250
    • suffix: -eth0
    • increment: -1 (that's a minus 1)
    • options:
    • description: nodeprov of type provision
  • 'Save' and now via 'New' do the same for eth1 (change 192.168 to 10.10, eth0 to eth1, with description nodepriv of type public
  • 'Quit'
  • next we are going to create our nodegroup template for the compute nodes, type 'ngedit'
  • use 'Copy' and copy template compute-centos5.3-5-x86_64, then 'Edit' that copy
    • general: change name Copy 1 to _BSS with format node#NN (we don't care about rack and like short names)
    • repository: there is only one, select it
    • boot time:
    • components: (check that non-server/non-master components only are checked, select if not)
    • networks: here select only the interfaces you create: nodeprov eth0 and nodepriv eth1
    • optional: do select vim* and emacs* packages (annoying)
    • partition: resize /data to 1024 and add partition /localscratch, ext3, size of 50000
    • cfmsync: update, no
    • 'Exit'
  • now we're ready to add compute nodes. type 'addhost'
    • if you receive an error about MySQLDB not found in 10-cacti.py one of two situations we have encountered
      • mysql was not installed, add an initialize database
        • grep mysql /etc/passwd, if none, 'adduser mysql'
        • /etc/init.d/mysqld status, if none 'yum install mysql-server', 'chkconfig mysqld on', /etc/init.d/mysqld start'
        • 'mysql -u root' should work
      • and/or python is missing a driver
        • yum install MySQL-python
  • when addhost starts, select the *_BSS nodegroup created, and eth0 interface
  • make sure blades have purple cable in bottom interface, turn blade on
  • if you know blade will boot of network let it go, else F2, enter BIOS, set Boot menu to network first
  • once blade sends its eth0 IP over and receives kickstart file, move on to next blade
  • do 2-3 blades this way
  • once the first blade is rebooted, enter BIOS, set boot menu to hard disk
  • there's a rhythm to this …
  • once the last blade has fully booted of the hard disk quit addhost on installer node
  • addhost will now push new files to all the members of the cluster using cfmsync

Issue 'export WCOLL=/etc/hosts.pdsh' for pdsh and then 'pdsh uptime' and nodes should respond. 'bhosts' will list them probably as unavailable but it means the scheduler is aware of the nodes. 'adduser foo', 'passwd foo', 'cfmsync -f', 'pdsh grep foo /etc/shadow' will show you how cfmsync pushes the information out; this is done via /etc/cfm/nodegroup_name/ and any files it finds here or are linked in.

Step 4

Ugly step. If you look at /etc/hosts you'll see what we mean. All blade host names should be unique, so we're going to fix some files.

  • first installer 'hostname' command comes back with 'kusu101'
  • copy /etc/hosts to /etc/hosts-good. edit.
  • put installer lines together, put 192.168 and 10.10 lines together for an easier read
  • for 10.10 remove all short host names like 'kusu101' or 'node00' etc
  • for 192.168 add 'kusu101' or 'node00' etc short host names as the first word after IP on each line
  • leave all the other host names intact (*.kusu101, *-eth0, etc)
  • copy hosts-good across hosts file
  • next do the same for hosts.pdsh but only use short host names, one name per node
  • next do the same for /etc/lava/conf/hosts, use only 192.168 IPs and only one (short) host name
  • next edit /etc/rc.d/rc.local and add lines
    • cp /etc/hosts-good /etc/hosts
    • cp /etc/hosts.pdsh-good /etc/hosts.pdsh
    • cp /etc/lava/conf/hosts-good /etc/lava/conf/hosts
  • in /etc/cfm/compute-centos5.3-5-x86_64_BSS
    • link in all the *-good files at appropriate locations
    • make the rc.d directory at appropriate level and link in rc.local
  • run 'cfmsync -f'
  • on installer node run '/etc/init.d/lava stop', then start, and do this on nodes via pdsh
  • 'pdsh uptime' should now list the hosts with short name
  • 'bhosts' should in a little while now show the hosts as available
  • 'lsload' should do the same

Now reboot the entire cluster and observe changes to be permanent. Sidebar: for Pace, you can now on the installer node assign eth1 a pace.edu IP, and have the necessary changes made to the ProCurve switch, so your users can log into the installer/head node. You still only have 50 gb or so of home directory space but users can play around.

Actually had a better idea: create another node group template from your _BSS template and remove eth1, naming convention login#N and set starting IP to something like 192.168.101.10 … call this node group _BSS_login or so. Start addhost, add new host to this node group. Next manually add eth1 with IP in pace.edu and wire up via switch to outside world. Next add this host to the list of LSF_MASTER_LIST. Now users can log into this node and submit jobs and stay out of your way on the installer node.

Step 5

Fun step.

  • make a backup copy of /etc/lava/conf/lsbatch/lava/configdir/lsb.queues
  • edit file, delete everything but queue 'normal' definition
  • (if you rename queue normal you also need to edit lsb.params and define default queue)
  • remove most queue definitions and set the following
    • QJOB_LIMIT = 4 (assuming you have 2 nodes in cluster, 6 if you have 3, iow #nodes * #cores)
    • UJOB_LIMIT = 1000 (user like to write scripts and submit jobs, this protects from runaway scripts)
    • INTERACTIVE = no (only batch is allowed)
    • EXCLUSIVE = Y (allow the bsub -x flag)
    • PRE_EXEC = /home/apps/lava/pre_exec (these two will create/remove the scratch dirs)
    • POST_EXEC = /home/apps/lava/post_exec
  • make the directories /home/apps (for compiled software)
  • make the directory /home/lava and /home/sanscratch
  • be sure /localscratch and /home/sanscratch have permissions like /tmp on all blades
  • create the pre/post exec files (post does an rm -rf against the created directories)
  • for example:
#!/bin/bash
if ["X$LSB_JOBID" != "X" ]; then
    mkdir -p /home/sanscratch/$LSB_JOBID /localscratch/$LSB_JOBID
    sleep 5; exit 0
else
    echo "LSB_JOBID NOT SET!"
    exit 111
fi
  • 'badmin reconfig'
  • 'bqueues' should now show new configuration

Now we're ready to submit a serial jobs. As a non-privilege user create two files:

  • run
#!/bin/bash

rm -f out err job3.out

#BSUB -q normal
#BSUB -J test
#BSUB -n 1
#BSUB -e err
#BSUB -o out

export MYSANSCRATCH=/home/sanscratch/$LSB_JOBID
export MYLOCALSCRATCH=/localscratch/$LSB_JOBID

cd $MYLOCALSCRATCH
pwd
cp ~/job.sh .
time job.sh > job.out

cd $MYSANSCRATCH
pwd
cp $LOCALSCRATCH/job.out job2.out

cd
pwd
cp $MYSANSCRATCH/job2.out job3.out
  • job.sh
#!/bin/bash

sleep 10
echo Done sleeping.

for i in `seq 1 100`
do
      date
done
  • 'bsub < run' (submits)
  • 'bjobs' (check dispatch)

Step 6

More fun. Parallel jobs can be submitted over ethernet interconnects but will not achieve the performance of Infiniband interconnects ofcourse. OpenMPI is a nice MPI flavor because software compiled with it automatically detects if the host has an HCA card or not and will allocate the appropriate libraries. So in order to compile, or run, some OpenMPI examples we need the following:

  • yum install libibverbs; pdsh yum install libibverbs -q -y
  • yum install gcc-c++

On our Dell cluster we have static pre-compiled flavors of MPI and OFED. A tarball of 200 MB can be found here http://lsfdocs.wesleyan.edu/mpis.tar.gz

  • download tarball, stage in /home/apps/src
  • cd /opt; tar zxvf /home/apps/src/mpis.tar.gz; pdsh “cd /opt; tar zxvf /home/apps/src/mpis.tar.gz”
  • examples in /opt/openmpi/gnu/examples have been compiled like so:
    • export PATH=/opt/openmpi/gnu/bin:$PATH
    • export LD_LIBRARY_PATH=/opt/openmpi/gnu/lib:$LD_LIBRARY_PATH
    • cd /opt/openmpi/gnu/examples; make
    • ./ring.c; ./hello.c (to test, it'll complain about no HCA card)

Ok, so now we need write a script to submit a parallel job. A parallel job is submitted with command 'mpirun'. However that command needs to know which hosts are allocated to the job. That is done with a wrapper script located in /usr/bin/openmpi-mpirun.

  • irun
#!/bin/bash

rm -f err out 

#BSUB -e err
#BSUB -o out
#BSUB -n 4
#BSUB -q normal
#BSUB -J ptest

export PATH=/opt/openmpi/gnu/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi/gnu/lib:$LD_LIBRARY_PATH

echo "make sure we have the right mpirun"
which mpirun

/usr/bin/openmpi-mpirun /opt/openmpi/gnu/examples/hello_c

/usr/bin/openmpi-mpirun /opt/openmpi/gnu/examples/ring_c
  • 'bsub < irun' (submits)
  • 'bjobs' (check status)

Step 7

Tools. As you add nodes, monitoring tools are added to Ganglia and Cacti. These are useful to look at.

But first we must fix firefox. You can down load a tarball here http://lsfdocs.wesleyan.edu/firefox.tar.gz, stage in /usr/local/src and untar, then link the firefox executable in /usr/local/bin.


Back

cluster/88.1281551270.txt.gz · Last modified: 2010/08/11 14:27 (external edit)