cluster:88
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| cluster:88 [2010/08/10 21:02] – hmeij | cluster:88 [2010/08/17 19:56] (current) – hmeij | ||
|---|---|---|---|
| Line 79: | Line 79: | ||
| * name: kusu101prov, | * name: kusu101prov, | ||
| * eth1: 10.10.101.254/ | * eth1: 10.10.101.254/ | ||
| - | * name: kusupriv, type: other | + | * name: kusu101priv, type: other |
| * 4 - gateway & dns: gateway 192.168.101.0 (is not used but required field), dns server 192.168.101.254 (installer node) | * 4 - gateway & dns: gateway 192.168.101.0 (is not used but required field), dns server 192.168.101.254 (installer node) | ||
| * 5 - host: FQDN kusu101, PCD kusu101 (basically we will not provide internet accessible names) | * 5 - host: FQDN kusu101, PCD kusu101 (basically we will not provide internet accessible names) | ||
| Line 177: | Line 177: | ||
| Now reboot the entire cluster and observe changes to be permanent. Sidebar: for Pace, you can now on the installer node assign eth1 a pace.edu IP, and have the necessary changes made to the ProCurve switch, so your users can log into the installer/ | Now reboot the entire cluster and observe changes to be permanent. Sidebar: for Pace, you can now on the installer node assign eth1 a pace.edu IP, and have the necessary changes made to the ProCurve switch, so your users can log into the installer/ | ||
| + | Actually had a better idea: create another node group template from your _BSS template and remove eth1, naming convention login#N and set starting IP to something like 192.168.101.10 ... call this node group _BSS_login or so. Start addhost, add new host to this node group. | ||
| ===== Step 5 ===== | ===== Step 5 ===== | ||
| Line 184: | Line 185: | ||
| * make a backup copy of / | * make a backup copy of / | ||
| * edit file, delete everything but queue ' | * edit file, delete everything but queue ' | ||
| - | * (if you rename queue normal you also need to edit lsb.params) | + | * (if you rename queue normal you also need to edit lsb.params |
| * remove most queue definitions and set the following | * remove most queue definitions and set the following | ||
| - | * QJOBLIMIT | + | * QJOB_LIMIT |
| - | * UJOBLIMIT | + | * UJOB_LIMIT |
| * INTERACTIVE = no (only batch is allowed) | * INTERACTIVE = no (only batch is allowed) | ||
| * EXCLUSIVE = Y (allow the bsub -x flag) | * EXCLUSIVE = Y (allow the bsub -x flag) | ||
| * PRE_EXEC = / | * PRE_EXEC = / | ||
| * POST_EXEC = / | * POST_EXEC = / | ||
| - | * make the directories /home/apps (for compile | + | * make the directories /home/apps (for compiled |
| - | * make the directory / | + | * make the directory / |
| - | * be sure / | + | * be sure / |
| * create the pre/post exec files (post does an rm -rf against the created directories) | * create the pre/post exec files (post does an rm -rf against the created directories) | ||
| + | * for example: | ||
| < | < | ||
| #!/bin/bash | #!/bin/bash | ||
| Line 208: | Line 210: | ||
| * ' | * ' | ||
| - | * ' | + | * ' |
| + | |||
| + | Now we're ready to submit a serial jobs. As a non-privilege user create two files: | ||
| - | Now we're ready to submit jobs. As non-priviledged user create two files: | ||
| * run | * run | ||
| + | |||
| < | < | ||
| #!/bin/bash | #!/bin/bash | ||
| - | rm -f out err job*.out | + | |
| + | rm -f out err job3.out | ||
| #BSUB -q normal | #BSUB -q normal | ||
| #BSUB -J test | #BSUB -J test | ||
| Line 239: | Line 245: | ||
| * job.sh | * job.sh | ||
| + | * | ||
| < | < | ||
| #!/bin/bash | #!/bin/bash | ||
| Line 254: | Line 261: | ||
| * 'bsub < run' (submits) | * 'bsub < run' (submits) | ||
| * ' | * ' | ||
| + | |||
| + | |||
| + | ===== Step 6 ===== | ||
| + | |||
| + | More fun. Parallel jobs can be submitted over ethernet interconnects but will not achieve the performance of Infiniband interconnects ofcourse. | ||
| + | |||
| + | * yum install libibverbs | ||
| + | * pdsh yum install libibverbs -q -y | ||
| + | * yum install gcc-c++ | ||
| + | |||
| + | On our Dell cluster we have static pre-compiled flavors of MPI and OFED. A tarball of 200 MB can be found here [[hhttp:// | ||
| + | |||
| + | * download tarball, stage in / | ||
| + | * cd /opt; tar zxvf / | ||
| + | * pdsh "cd /opt; tar zxvf / | ||
| + | * examples in / | ||
| + | * export PATH=/ | ||
| + | * export LD_LIBRARY_PATH=/ | ||
| + | * cd / | ||
| + | * ./ring.c; ./hello.c (to test, it'll complain about no HCA card) | ||
| + | |||
| + | Ok, so now we need write a script to submit a parallel job. A parallel job is submitted with command ' | ||
| + | |||
| + | * irun | ||
| + | |||
| + | < | ||
| + | #!/bin/bash | ||
| + | |||
| + | rm -f err out | ||
| + | |||
| + | #BSUB -e err | ||
| + | #BSUB -o out | ||
| + | #BSUB -n 4 | ||
| + | #BSUB -q normal | ||
| + | #BSUB -J ptest | ||
| + | |||
| + | export PATH=/ | ||
| + | export LD_LIBRARY_PATH=/ | ||
| + | |||
| + | echo "make sure we have the right mpirun" | ||
| + | which mpirun | ||
| + | |||
| + | / | ||
| + | |||
| + | / | ||
| + | |||
| + | </ | ||
| + | |||
| + | * 'bsub < irun' (submits) | ||
| + | * ' | ||
| + | |||
| + | ===== Step 7 ===== | ||
| + | |||
| + | Tools. As you add nodes, monitoring tools are added to Ganglia and Cacti. | ||
| + | |||
| + | But first we must fix firefox. | ||
| + | |||
| + | * ' | ||
| + | * ' | ||
| + | * http:// | ||
| + | * http:// | ||
| + | * http:// | ||
| + | |||
| + | * http:// | ||
| + | * http:// | ||
| \\ | \\ | ||
| **[[cluster: | **[[cluster: | ||
cluster/88.1281474132.txt.gz · Last modified: by hmeij
