This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:88 [2010/08/10 18:25] hmeij |
cluster:88 [2010/08/17 19:56] (current) hmeij |
||
---|---|---|---|
Line 79: | Line 79: | ||
* name: kusu101prov, | * name: kusu101prov, | ||
* eth1: 10.10.101.254/ | * eth1: 10.10.101.254/ | ||
- | * name: kusupriv, type: other | + | * name: kusu101priv, type: other |
* 4 - gateway & dns: gateway 192.168.101.0 (is not used but required field), dns server 192.168.101.254 (installer node) | * 4 - gateway & dns: gateway 192.168.101.0 (is not used but required field), dns server 192.168.101.254 (installer node) | ||
* 5 - host: FQDN kusu101, PCD kusu101 (basically we will not provide internet accessible names) | * 5 - host: FQDN kusu101, PCD kusu101 (basically we will not provide internet accessible names) | ||
Line 149: | Line 149: | ||
Ugly step. If you look at /etc/hosts you'll see what we mean. All blade host names should be unique, so we're going to fix some files. | Ugly step. If you look at /etc/hosts you'll see what we mean. All blade host names should be unique, so we're going to fix some files. | ||
- | * first installer hostname command comes back with ' | + | * first installer |
- | * copy /etc/hosts to / | + | * copy /etc/hosts to / |
- | * put installer lines together, put 192.168 and 10.10 lines together | + | * put installer lines together, put 192.168 and 10.10 lines together |
- | * for 10.10 remove all short host names either | + | * for 10.10 remove all short host names like ' |
- | * for 192.168 add ' | + | * for 192.168 add ' |
* leave all the other host names intact (*.kusu101, *-eth0, etc) | * leave all the other host names intact (*.kusu101, *-eth0, etc) | ||
* copy hosts-good across hosts file | * copy hosts-good across hosts file | ||
- | * next do the same for hosts.pdsh | + | * next do the same for hosts.pdsh |
* next do the same for / | * next do the same for / | ||
Line 167: | Line 167: | ||
* in / | * in / | ||
- | * link in all the *-good files | + | * link in all the *-good files at appropriate locations |
* make the rc.d directory at appropriate level and link in rc.local | * make the rc.d directory at appropriate level and link in rc.local | ||
* run ' | * run ' | ||
- | * on installer node run / | + | * on installer node run '/ |
* 'pdsh uptime' | * 'pdsh uptime' | ||
* ' | * ' | ||
- | * ' | + | * ' |
- | * copy | + | Now reboot the entire cluster and observe changes to be permanent. Sidebar: for Pace, you can now on the installer node assign eth1 a pace.edu IP, and have the necessary changes made to the ProCurve switch, so your users can log into the installer/ |
+ | Actually had a better idea: create another node group template from your _BSS template and remove eth1, naming convention login#N and set starting IP to something like 192.168.101.10 ... call this node group _BSS_login or so. Start addhost, add new host to this node group. | ||
+ | |||
+ | ===== Step 5 ===== | ||
+ | |||
+ | Fun step. | ||
+ | |||
+ | * make a backup copy of / | ||
+ | * edit file, delete everything but queue ' | ||
+ | * (if you rename queue normal you also need to edit lsb.params and define default queue) | ||
+ | * remove most queue definitions and set the following | ||
+ | * QJOB_LIMIT = 4 (assuming you have 2 nodes in cluster, 6 if you have 3, iow #nodes * #cores) | ||
+ | * UJOB_LIMIT = 1000 (user like to write scripts and submit jobs, this protects from runaway scripts) | ||
+ | * INTERACTIVE = no (only batch is allowed) | ||
+ | * EXCLUSIVE = Y (allow the bsub -x flag) | ||
+ | * PRE_EXEC = / | ||
+ | * POST_EXEC = / | ||
+ | * make the directories /home/apps (for compiled software) | ||
+ | * make the directory /home/lava and / | ||
+ | * be sure / | ||
+ | * create the pre/post exec files (post does an rm -rf against the created directories) | ||
+ | * for example: | ||
+ | < | ||
+ | #!/bin/bash | ||
+ | if [" | ||
+ | mkdir -p / | ||
+ | sleep 5; exit 0 | ||
+ | else | ||
+ | echo " | ||
+ | exit 111 | ||
+ | fi | ||
+ | </ | ||
+ | |||
+ | * ' | ||
+ | * ' | ||
+ | |||
+ | Now we're ready to submit a serial jobs. As a non-privilege user create two files: | ||
+ | |||
+ | * run | ||
+ | |||
+ | < | ||
+ | #!/bin/bash | ||
+ | |||
+ | rm -f out err job3.out | ||
+ | |||
+ | #BSUB -q normal | ||
+ | #BSUB -J test | ||
+ | #BSUB -n 1 | ||
+ | #BSUB -e err | ||
+ | #BSUB -o out | ||
+ | |||
+ | export MYSANSCRATCH=/ | ||
+ | export MYLOCALSCRATCH=/ | ||
+ | |||
+ | cd $MYLOCALSCRATCH | ||
+ | pwd | ||
+ | cp ~/job.sh . | ||
+ | time job.sh > job.out | ||
+ | |||
+ | cd $MYSANSCRATCH | ||
+ | pwd | ||
+ | cp $LOCALSCRATCH/ | ||
+ | |||
+ | cd | ||
+ | pwd | ||
+ | cp $MYSANSCRATCH/ | ||
+ | </ | ||
+ | |||
+ | * job.sh | ||
+ | * | ||
+ | < | ||
+ | #!/bin/bash | ||
+ | |||
+ | sleep 10 | ||
+ | echo Done sleeping. | ||
+ | |||
+ | for i in `seq 1 100` | ||
+ | do | ||
+ | date | ||
+ | done | ||
+ | |||
+ | </ | ||
+ | |||
+ | * 'bsub < run' (submits) | ||
+ | * ' | ||
+ | |||
+ | |||
+ | ===== Step 6 ===== | ||
+ | |||
+ | More fun. Parallel jobs can be submitted over ethernet interconnects but will not achieve the performance of Infiniband interconnects ofcourse. | ||
+ | |||
+ | * yum install libibverbs | ||
+ | * pdsh yum install libibverbs -q -y | ||
+ | * yum install gcc-c++ | ||
+ | |||
+ | On our Dell cluster we have static pre-compiled flavors of MPI and OFED. A tarball of 200 MB can be found here [[hhttp:// | ||
+ | |||
+ | * download tarball, stage in / | ||
+ | * cd /opt; tar zxvf / | ||
+ | * pdsh "cd /opt; tar zxvf / | ||
+ | * examples in / | ||
+ | * export PATH=/ | ||
+ | * export LD_LIBRARY_PATH=/ | ||
+ | * cd / | ||
+ | * ./ring.c; ./hello.c (to test, it'll complain about no HCA card) | ||
+ | |||
+ | Ok, so now we need write a script to submit a parallel job. A parallel job is submitted with command ' | ||
+ | |||
+ | * irun | ||
+ | |||
+ | < | ||
+ | #!/bin/bash | ||
+ | |||
+ | rm -f err out | ||
+ | |||
+ | #BSUB -e err | ||
+ | #BSUB -o out | ||
+ | #BSUB -n 4 | ||
+ | #BSUB -q normal | ||
+ | #BSUB -J ptest | ||
+ | |||
+ | export PATH=/ | ||
+ | export LD_LIBRARY_PATH=/ | ||
+ | |||
+ | echo "make sure we have the right mpirun" | ||
+ | which mpirun | ||
+ | |||
+ | / | ||
+ | |||
+ | / | ||
+ | |||
+ | </ | ||
+ | |||
+ | * 'bsub < irun' (submits) | ||
+ | * ' | ||
+ | |||
+ | ===== Step 7 ===== | ||
+ | |||
+ | Tools. As you add nodes, monitoring tools are added to Ganglia and Cacti. | ||
+ | |||
+ | But first we must fix firefox. | ||
+ | |||
+ | * ' | ||
+ | * ' | ||
+ | * http:// | ||
+ | * http:// | ||
+ | * http:// | ||
+ | |||
+ | * http:// | ||
+ | * http:// | ||
\\ | \\ | ||
**[[cluster: | **[[cluster: |