User Tools

Site Tools


cluster:52

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

cluster:52 [2007/11/20 10:18] (current)
Line 1: Line 1:
 +\\
 +**[[cluster:​0|Back]]**
  
 +====== Upgrading to LSF ======
 +
 +Why?  Here is my summation of some items i wish to take advantage of: **[[cluster:​46|Link]]**
 +
 +We're running Platform/​OCS which includes the Lava scheduler. ​ It's sorta like LSF but with functionality removed. ​ However, it is free and very good.  Our Dell cluster came pre-configured with Lava but it's time to leverage the resources of our cluster in more detail.
 +
 +===== First Stumble =====
 +
 +What version to upgrade to? Well i thought that would be easy, the latest stable version which is LSF v7.0.1. ​ But that has complications. ​ Our cluster Platform/​OCS is a "​rocks"​ based installation. ​ For more reading: ​ **[[cluster:​16|Link]]**
 +
 +Our OCS version is 4.1.1 and the only "​roll"​ available for **[[http://​platform.com/​Products/​Platform.LSF.Family/​Platform.LSF.HPC/​|LSF/​HPC]]** is the 6.2 version. ​ That's fine, because all the functionality i'm after exists in this version.
 +
 +In order to install a v7 "​roll",​ we'd have to upgrade to OCS 4.4 and that means a huge disruption. ​ All compute nodes would have be re-imaged and re-deployed. Hence my preference to upgrade via the roll method to v6.2, as advised by [[http://​platform.com|Platform]] support.
 +
 +Another option is to perform a manual install of v7 from source. ​ However this would involve some significant work as the scheduler software is installed locally on each node and head node in the OCS environment. ​ I'd like to keep it that way to reduce NFS traffic and not install the scheduler software on a shared filesystem.
 +
 +
 +
 +===== Next Step =====
 +
 +  * process a License Change request via [[http://​platform.com|Platform]]
 +  * obtain a new license file
 +  * download the LSF/HPC v6.2 roll at [[http://​my.platform.com|MyPlatform]]
 +  * plan the upgrade steps
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +===== Lots of Next Steps =====
 +
 +#0 Shut off NAT box
 +
 +Reset root password, shut off box.
 +
 +#1a Close the head node to all ssh traffic (firewall to trusted user VLAN access only).
 +
 +#1b Inactivate all queues, Backup scratch dirs, Stop all jobs.
 +
 +  * Make friends! ​ Tar up contents of /​localscratch & /sanscratch **before** stopping jobs ...
 +
 +  * Take a snapshot of jobs running (like the clumon jobs page or with ''​bjobs -u all''​) so users can view what was running ... info is here **[[http://​swallowtail.wesleyan.edu/​clumon/​jobs-killed.php|External Link]]**
 +
 +  * Stop all jobs: ''​ bkill 0 ''​
 +
 +  * Disable any cronjobs.  ​
 +
 +  * reset sknauert'​s home dir.
 +
 +#1c Backup up all files needed to rebuild the io-node.
 +
 +The io-node is currently a compute node (but not a member of any queue and admin_closed). It has fiber channel (2 cards) to the Netapp storage device. ​ A re-image of this implies rebuilding the NFS environment and reconfigure dm-multipath ​ -- so the entire cluster needs to go idle for that. Lets rebuild so we might use it if needed as compute node or backup/​failover head node. Rebuilding is documented on the [[https://​itsdoku.wesleyan.edu]].
 +
 +#1d Backup up all files needed to rebuild the compute nodes.
 +
 +This includes two varieties of nodes: a light weight node and a heavy weight node sample. Some of this should be customized with extend-compute.xml (minor chances for now ...). Rebuilding is documented on the [[https://​itsdoku.wesleyan.edu]].
 +
 +#1e Stop the lava system across the cluster.
 +
 +''/​etc/​init.d/​lava stop''​
 +
 +  * also on ionode!!
 +  * also on the head node!!
 +  * also on the head node run ''/​etc/​init.d/​lavagui stop''​
 +
 +#1f Backup all files in /opt/lava.
 +
 +=> copy the /​opt/​lava/​work/​lava/​logdir/​lsb.acct* files to the archive.
 +
 +LSF/HPC will install in /opt/lsfhpc but make sure you have remote backup copy of /opt/lava ... rsync to /​mnt/​cluster/​opt_lava_pre_lsf.
 +
 +-> Disable Tivoli agents and start a manual incremental backup.
 +
 +#1g Unmount all io-node exported file systems, leave nodes running.
 +
 +We'll force a reboot followed by a re-image later in staggered fashion after we are done with the LSF install. ​ But first unmount the NFS file systems with cluster-fork on all compute nodes & head node.
 +
 +#1h Good time to clean all orphaned jobs' working dirs in /​localscratch & /​sanscratch!
 +
 +-> fix: set this LUN to space reservation enabled (1TB)
 +
 +#1i Unmount all multipathed LUN filesystems on io-node (/​dev/​mapper).
 +
 +** => <hi #ff0000> AFTER THAT DISCONNECT THE FIBER CABLES </hi> **
 +
 +Node re-imaging involves formatting and partitioning. ​ We do not want to risk loosing any data because of [[http://​en.wikipedia.org/​wiki/​SNAFU|snafus]].
 +
 +
 +#2. Remove the lava roll.
 +
 +''​chkconfig lava off''​\\
 +''​chkconfig lavagui off''​\\
 +''​rollops --remove lava''​
 +
 +3. Add the LSFHPC roll.
 +
 +''​rollops --add lsfhpc --iso lsfhpc-4.1.1-0.x86_64.disk1.iso''​
 +
 +4.Prep ENV and license info.
 +
 +Edit /​share/​apps/​scripts/​cluster_bashrc (and cluster_cshrc)\\
 +Change this section to the appropriate lsf location:
 +
 +<​code>​
 +# source the job scheduler environment;​ sh, ksh or bash
 +if [ -f /​opt/​lava/​conf/​profile.lsf ]; then
 +        . /​opt/​lava/​conf/​profile.lsf
 +fi
 +</​code>​
 +
 +Source that new environment. ''​which lmgrd''​ should return the lsfhpc version.\\
 +Next copy the license info to /​opt/​lsfhpc/​conf/​license.dat.
 +
 +5a. Start the license daemon ... port 1700 is currently free.
 +
 +''​ su - flexlm -c "lmgrd -c /​opt/​lsfhpc/​conf/​license.dat -l /​tmp/​lsf_license.log"​ ''​
 +
 +5b. Add this startup command to /​etc/​rc.local with __full path to lmgrd__.
 +
 +5c. Check the license daemons: ''​lmstat''​
 +
 +6. Assign compute nodes to additional resources.
 +
 +''​insert into app_globals (service,​component,​values) values \\
 +('​Info','​DefaultLSFHostResource','​mvapich lammpi mpichp4'​);''​
 +
 +This will add the Infiniband MPI implementation.
 +
 +<hi #​ffff00>#​7a.</​hi>​ Re-image the io node.
 +
 +=> Before you do this, redefine the io node as a compute appliance in the cluster database and turn ''/​etc/​init.d/​rocks-grub start''​ on.
 +
 +''/​boot/​kickstart/​cluster-kickstart''​ or reseat the power cords.
 +
 +Once done, mount all NFS file systems on the head node.
 +
 +=> Redefine the io node as a "nas appliance"​ in the cluster database ​ and:
 +
 +''/​etc/​init.d/​rocks-grub stop''​\\
 +''​chkconfig rocks-grub off''​
 +
 +
 +<hi #​ffff00>#​7b.</​hi>​ Re-image all the compute nodes.
 +
 +''/​boot/​kickstart/​cluster-kickstart''​ or use ipmitools and cycle the power while nodes are running. ​ This mimics an unclean shutdown forcing a re-imaging.
 +
 +<hi #ffff00> Add the memory modules at this time? </hi>
 +
 +#8.  Starting and testing the LSF HPC Cluster.
 +
 +7a & 7b should add the nodes to the LSF cluster. ​ Check the hosts file because we have 2 NICs per node. Make sure the nodes are registered as //​hostname.local//​ meaning the 192.168.1 subnet.  ​
 +
 +On head node:\\
 +''​chkconfig lsf on''​\\
 +''/​etc/​init.d/​lsf restart''​\\
 +''​lsid''​\\
 +''​lsinfo''​\\
 +''​lshosts''​\\
 +''​lsload''​
 +
 +After this is done, and all nodes are back up, walk by the lava configuration files and add information that is missing to the LSF equivalent files.
 +
 +On head node:\\
 +''​lsadmin reconfig''​\\
 +''​badmin mbdrestart''​\\
 +''​badmin reconfig''​\\
 +''​bqueues''​\\
 +''​bhosts''​
 +
 +#9. Configure Master fail over.
 +
 +Skip this step.
 +
 +#10. "Go To 1"
 +
 +Walk through the items in #1 and enable/​reset functionality like cronjobs, tivoli, queues, ssh access ...
 +
 +Kick off Tivoli for an automated backup. ​
 +
 +Test some job submissions ...
 +
 +Document the new MPI job submission procedure ...
 +
 +Add our eLIM after a while ...
 +
 +#11. Relocate some home directories.
 +
 +  * "​pusers"​ LUNs fstarr & bstewart ... morph into "​rusers?"​ LUNs
 +  * relocate lvargarslara,​ skong & ztan to another LUN
 +
 +#12. NAT box.
 +
 +Reconfigure compute-1-1 for Scott, maybe.
 +
 +
 +----
 +
 +So how long does this take:
 +
 +  * one morning to install LSF/HPC + rebuild the ionode
 +  * one afternoon to rebuild all other nodes (and deal with unexpected hardware problems)
 +  * one morning to open every node and remove/add memory sticks
 +
 + --- //​[[hmeij@wesleyan.edu|Meij,​ Henk]] 2007/11/20 09:42//
 +
 +===== Adding Memory =====
 +
 +The depts of **CHEM** and **PHYS** will each contribute $2,400 towards the purchase of additional memory. ​  ​**ITS** will contribute $2,​880. ​ Thank you all!  Since a 1 GB DIMM cost $60 and a 2 GB DIMM costs $120, we'll buy the latter only.  A 4 GB DIMM costs $440 which is a substantial increase.
 +
 +The $7,680 is enough to purchase 64 DIMMs adding 128 GB of memory to the cluster. ​ Cluster wide then, the nodes will hold 320 GB of memory. The question is in what configuration?​ Here is a constraint ... [[http://​support.dell.com/​support/​edocs/​systems/​pe1950/​en/​hom/​html/​install.htm#​wp1229826|Link]]
 +
 +|"​Memory modules must be installed in pairs of matched memory size, speed, and technology, and the total number of memory modules in the configuration must total two, four, or eight. For best system performance,​ all four, or eight memory modules should be identical in size, speed, and technology. ... System performance can be affected if your memory configuration does not conform to the preceding installation guidelines."​| ​
 +
 +The 4 heavy weight nodes, with local dedicated fast disks, will not be changed. ​ They currently contain a DIMM configuration of 8x2, thus 16 GB of memory each. All DIMM slots are filled. All 32 light weight nodes nodes currently hold a 4x1 DIMM configuration,​ thus 4 GB of memory each.
 +
 +So the first suggestion is to remove the 1 GB DIMMs from the 16 gigE enabled nodes (queue ''​16-lwnodes''​) and add them to the Infiniband enabled nodes (queue ''​16-ilwnodes''​). ​ That would make each Infiniband enabled node hold 8 GB of memory (8x1 configuration). It would fill their slots. A parallel job could access lots of memory across these nodes.
 +
 +That then leaves 16 empty nodes and 64 2GB DIMMs to play with.  What to do?\\
 +Here are some options. ​
 +
 +
 +^ Scenario A ^^^^ uniform, matches infiniband nodes ^
 +|  64<​sup><​hi #​ffff00>​**1**</​hi></​sup> ​ |  16<​sup><​hi #​ffff00>​**2**</​hi></​sup> ​ |  4x2<​sup><​hi #​ffff00>​**3**</​hi></​sup> ​ |  128<​sup><​hi #​ffff00>​**4**</​hi></​sup> ​ | " sixteen 8 GB medium weight nodes " |
 +^ Scenario B ^^^^ add equal medium and heavy nodes ^
 +|  16  |  08  |  2x2  |  64  | " eight 4 GB light weight nodes " |
 +|  16  |  04  |  4x2  |  32  | " four 8 GB medium weight nodes " |
 +|  32  |  04  |  8x2  |   ​32 ​ | " four 16 GB heavy weight nodes " |
 +^ Scenario C ^^^^ emphasis on medium nodes ^
 +|  08  |  04  |  2x2  |  32  | " four 4 GB light weight nodes " |
 +|  40  |  10  |  4x2  |  80  | " ten 8 GB medium weight nodes " |
 +|  16  |  02  |  8x2  |  16  | " two 16 GB heavy weight nodes " |
 +^ Scenario D ^^^^ ... ^
 +<​sup><​hi #​ffff00>​**1**</​hi></​sup>​ Number of DIMMs. ​ This must total 64 within each scenario.\\
 +<​sup><​hi #​ffff00>​**2**</​hi></​sup>​ Number of nodes. ​ This must total 16 within each sceanrio.\\
 +<​sup><​hi #​ffff00>​**3**</​hi></​sup>​ Memory pairs, one of these combinations:​ **2x2**, **4x2**, or **8x2**.\\
 +<​sup><​hi #​ffff00>​**4**</​hi></​sup>​ Number of cores. ​ This must total 128 within each scenario.
 +
 +  * Personally, i was initially leaning towards **A**.  ​
 +  * But now, viewing this table, i like the distribution of cores across light, medium and heavy weight nodes in **B**.
 +  * **C** really depends on if we need 8 GB nodes. Not sure why we would do this vs **A**.
 +
 +Actually, the perfect argument for **B** was offered by Francis: ​
 +|If machines have 8 GB of RAM, 1 job locks up the node.  So two jobs lock up 2 nodes, rendering a total of 14 cores unused and unavailable. Suppose instead we have 16GB machines. ​ Two jobs would lock up just one machine, leaving only 6 cores unused and unavailable. ​ This would seem to make better use of resources.|
 +
 +
 +
 +
 +
 +===== Renaming Queues =====
 +
 +In **Scenario A** above nothing really changes but the concept of a "light weight"​ node.  It now would be an 8 GB memory footprint node versus the old value of 4 GB.  So, no queues need to be renamed. ​ One exception to that: the ''​04-hwnodes''​ and the ''​gaussian''​ queue comprise the same number of hosts, the heavy weight nodes. ​ Perhaps we should remove the ''​gaussian''​ queue. ​ Gaussian jobs can be run on any node, light or heavy weight, Infiniband or gigE enabled.
 +
 +In **Scenario B & C**, things change. Now we have light, medium and heavy weight nodes. ​ One naming convention we could adopt could be:
 +
 +| queue_name | = | number of nodes | + | which switch | + | GB mem per node | + | total cores | + | additional info | ; |
 +
 +Then our queues could be named like so:
 +
 +| **16i08g128c** | 16 nodes, infiniband enabled, each 8gb mem (medium), comprising 128 cores total | |
 +| **08e04g064c** | 08 nodes, gigE enabled, each 4 gb mem (light), comprising 64 cores total | |
 +| **04e08g032c** | 04 nodes, gigE enabled, each 8 gb mem (medium), comprising 32 cores total | |
 +| **04e16g032c** | 04 nodes, gigE enabled, each 16 gb mem (heavy), comprising 32 cores total | |
 +| **04e16g032cfd** | 04 nodes, gigE enabled, each 16 gb mem (heavy), comprising 32 cores total | fast local disk access|
 +
 +Or is this too cumbersome? Maybe.\\
 +Perhaps just an abbreviation:​
 + 
 +| **imw** | 16 nodes, infiniband enabled, each 8gb mem (medium), comprising 128 cores total | |
 +| **elw** | 08 nodes, gigE enabled, each 4 gb mem (light), comprising 64 cores total | |
 +| **emw** | 04 nodes, gigE enabled, each 8 gb mem (medium), comprising 32 cores total | |
 +| **ehw** | 04 nodes, gigE enabled, each 16 gb mem (heavy), comprising 32 cores total | |
 +| **ehwfd** | 04 nodes, gigE enabled, each 16 gb mem (heavy), comprising 32 cores total | fast local disk access |
 +
 +|  NEW QUEUES all priority = 50  ||
 +| **imw** | compute-1-1 ... compute-1-16 |
 +| **elw** | compute-1-17 ... compute-1-24 |
 +| **emw** | compute-1-25 ... compute-1-27 compute-2-28 |
 +| **ehw** | compute-2-29 ... compute-2-32 |
 +| **ehwfd** | nfs-2-1 ... nfs-2-4 |
 +| ** matlab ** | imw + emw |
 +
 +delete queues: idle, [i]debug, molscat, gaussian, nat-test
 +
 +\\
 +**[[cluster:​0|Back]]**
cluster/52.txt ยท Last modified: 2007/11/20 10:18 (external edit)