User Tools

Site Tools


cluster:155

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:155 [2017/03/30 10:17]
hmeij07 [OpenHPC]
cluster:155 [2017/04/05 08:35] (current)
hmeij07
Line 2: Line 2:
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
-==== OpenHPC ====+==== OpenHPC page 2====
  
 Additional tools for the OpenHPC environment. First add these two lines to SMS and all compute nodes. Patch CHROOT as well. Additional tools for the OpenHPC environment. First add these two lines to SMS and all compute nodes. Patch CHROOT as well.
Line 105: Line 105:
   * Skip ''Genders'' installation   * Skip ''Genders'' installation
   * Skip ''ConMan'' installation, ipmi serial consoles   * Skip ''ConMan'' installation, ipmi serial consoles
-  * Skip 'rsysslog'' forwarding of compute node logs to SMS +  * Skip ''rsysslog'' forwarding of compute node logs to SMS 
-  * Redefine ''ControlMachine'' in /etc/slurm.slrum.conf (use eth0, not public address eth1) +  * Redefine ''ControlMachine'' in /etc/slurm.slurm.conf  
-    *  and CHROOT/etc/slurm/slurm.conf to 192.168.1.24+    * use eth0, not public address eth1 
 +    * and CHROOT/etc/slurm/slurm.conf 
 +    * import file back into database
  
-Ran into a slurm config problem here.+Ran into a slurm config problem here on compue ndoesWhen issuing ''systemctl status slurm'' the output revealed a failed start and it reported could not find /var/run/slurmctl.pid ... that's the wrong pid process, compute nodes should only start slurmd. So finally got this fixed
  
-  * Reimage the whole kaboodle+<code> 
 + 
 +# ON COMPUTE NODES, that is in CHROOT 
 + 
 +# Removed file /etc/init.d/slurm 
 +mv /etc/init.d/slurm /root/  
 + 
 +# Made the following link 
 +[root@n31 ~]# ls -l /etc/systemd/system/multi-user.target.wants/slurmd.service  
 +lrwxrwxrwx 1 root root 38 Mar 30 14:05 /etc/systemd/system/multi-user.target.wants/slurmd.service -> /usr/lib/systemd/system/slurmd.service   
 + 
 +# now it starts properly 
 +Mar 31 12:41:05 n31.localdomain systemd[1]: Starting Slurm node daemon... 
 +Mar 31 12:41:05 n31.localdomain systemd[1]: PID file /var/run/slurmd.pid not readable (yet?) after start. 
 +Mar 31 12:41:05 n31.localdomain systemd[1]: Started Slurm node daemon. 
 + 
 +</code> 
 + 
 +  Recreate vnfs, Reimage the whole kaboodle 
 + 
 +Link to my previous eval of Slurm and job throughput testing: [[cluster:134|Slurm]]. Next submit some test jobs, explained on that page too. 
 + 
 +Here are my current settings on slurm.conf in OpenHPC. 
 + 
 +<code> 
 +ClusterName=linux                                                                                         
 +ControlMachine=ohpc0-slurm                                                                                
 +ControlAddr=192.168.1.249                                                                                 
 +SlurmUser=slurm                                                                                           
 +SlurmctldPort=6815-6817        <---                                                                           
 +SlurmdPort=6818                                                                                           
 +AuthType=auth/munge                                                                                       
 +StateSaveLocation=/etc/slurm/state    <---                                                                    
 +SlurmdSpoolDir=/etc/slurm/spool      <--- 
 +SwitchType=switch/none 
 +MpiDefault=none 
 +SlurmctldPidFile=/var/run/slurmctld.pid 
 +SlurmdPidFile=/var/run/slurmd.pid 
 +ProctrackType=proctrack/pgid 
 +FirstJobId=101               <--- 
 +MaxJobCount=999999           <--- 
 +SlurmctldTimeout=300 
 +SlurmdTimeout=300 
 +InactiveLimit=0 
 +MinJobAge=300 
 +KillWait=30 
 +Waittime=0 
 +SchedulerType=sched/builtin <--- 
 +SchedulerPort=7321          <--- 
 +SelectType=select/linear 
 +FastSchedule=1 
 +SlurmctldDebug=3 
 +SlurmdDebug=3 
 +JobCompType=jobcomp/none 
 +PropagateResourceLimitsExcept=MEMLOCK 
 +SlurmdLogFile=/var/log/slurm.log 
 +SlurmctldLogFile=/var/log/slurmctld.log 
 +Epilog=/etc/slurm/slurm.epilog.clean 
 +ReturnToService=1 
 +NodeName=ohpc0-slurm NodeAddr=192.168.1.249  
 +NodeName=n29 NodeAddr=192.168.102.38 
 +NodeName=n31 NodeAddr=192.168.102.40  
 +PartitionName=test Nodes=n29,n31 Default=YES MaxTime=INFINITE STATE=UP 
 + 
 +</code> 
 + 
 +Define CPUs, Cores, ThreadsPerCore, etc al later, run with Slurm self-discovered values for now. 
 + 
 +[[cluster:154|OpenHPC page 1]] - page 2 - [[cluster:156|OpenHPC page 3]] - [[cluster:160|OpenHPC page 4]]
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/155.1490883460.txt.gz · Last modified: 2017/03/30 10:17 by hmeij07