Differences

This shows you the differences between two versions of the page.

--- cluster:155 [2017/03/23 18:03]
hmeij07 [OpenHPC]
+++ cluster:155 [2017/04/05 12:35] (current)
hmeij07
@@ Line 2: / Line 2: @@
 **[[cluster:0|Back]]**
-==== OpenHPC ====
+==== OpenHPC page 2====
 Additional tools for the OpenHPC environment. First add these two lines to SMS and all compute nodes. Patch CHROOT as well.
@@ Line 67: / Line 67: @@
  yum -y groupinstall ohpc-ganglia
  yum -y --installroot=/data/ohpc/images/centos7.2 install ganglia-gmond-ohpc
+# import passwd, shadow and group files for new user account ganglia
  mv /etc/ganglia/gmond.conf /etc/ganglia/gmond.conf-orig
  cp /opt/ohpc/pub/examples/ganglia/gmond.conf /etc/ganglia/
- perl -pi -e "s/<sms>/ohpc0-test/" /etc/ganglia/gmond.conf
+# use provision IP
+ perl -pi -e "s/<sms>/192.168.1.249/" /etc/ganglia/gmond.conf
  cp /etc/ganglia/gmond.conf /data/ohpc/images/centos7.2/etc/ganglia/
  echo "gridname MySite" >> /etc/ganglia/gmetad.conf
@@ Line 76: / Line 78: @@
  systemctl start gmond
  systemctl start gmetad
- sytemctl restart httpd
  systemctl restart httpd
@@ Line 88: / Line 89: @@
   * http://localhost/ganglia
+  * Not installing ClusterShell, pdsh is already installed
+    * add compute hostnames to /etc/hosts.pdsh
+    * ''echo export WCOLL=/etc/hosts.pdsh >> /root/.bashrc''
+<code>
+[root@ohpc0-test ~]# pdsh uptime
+n31:  10:44:25 up 19:14,  1 user,  load average: 0.00, 0.01, 0.05
+n29:  10:44:25 up 19:19,  0 users,  load average: 0.00, 0.01, 0.05
+</code>
+  * Skip ''mrsh'' installation
+  * Skip ''Genders'' installation
+  * Skip ''ConMan'' installation, ipmi serial consoles
+  * Skip ''rsysslog'' forwarding of compute node logs to SMS
+  * Redefine ''ControlMachine'' in /etc/slurm.slurm.conf
+    * use eth0, not public address eth1
+    * and CHROOT/etc/slurm/slurm.conf
+    * import file back into database
+Ran into a slurm config problem here on compue ndoes. When issuing ''systemctl status slurm'' the output revealed a failed start and it reported could not find /var/run/slurmctl.pid ... that's the wrong pid process, compute nodes should only start slurmd. So finally got this fixed
+<code>
+# ON COMPUTE NODES, that is in CHROOT
+# Removed file /etc/init.d/slurm
+mv /etc/init.d/slurm /root/
+# Made the following link
+[root@n31 ~]# ls -l /etc/systemd/system/multi-user.target.wants/slurmd.service
+lrwxrwxrwx 1 root root 38 Mar 30 14:05 /etc/systemd/system/multi-user.target.wants/slurmd.service -> /usr/lib/systemd/system/slurmd.service
+# now it starts properly
+Mar 31 12:41:05 n31.localdomain systemd[1]: Starting Slurm node daemon...
+Mar 31 12:41:05 n31.localdomain systemd[1]: PID file /var/run/slurmd.pid not readable (yet?) after start.
+Mar 31 12:41:05 n31.localdomain systemd[1]: Started Slurm node daemon.
+</code>
+  * Recreate vnfs, Reimage the whole kaboodle
+Link to my previous eval of Slurm and job throughput testing: [[cluster:134|Slurm]]. Next submit some test jobs, explained on that page too.
+Here are my current settings on slurm.conf in OpenHPC.
+<code>
+ClusterName=linux
+ControlMachine=ohpc0-slurm
+ControlAddr=192.168.1.249
+SlurmUser=slurm
+SlurmctldPort=6815-6817        <---
+SlurmdPort=6818
+AuthType=auth/munge
+StateSaveLocation=/etc/slurm/state    <---
+SlurmdSpoolDir=/etc/slurm/spool      <---
+SwitchType=switch/none
+MpiDefault=none
+SlurmctldPidFile=/var/run/slurmctld.pid
+SlurmdPidFile=/var/run/slurmd.pid
+ProctrackType=proctrack/pgid
+FirstJobId=101               <---
+MaxJobCount=999999           <---
+SlurmctldTimeout=300
+SlurmdTimeout=300
+InactiveLimit=0
+MinJobAge=300
+KillWait=30
+Waittime=0
+SchedulerType=sched/builtin <---
+SchedulerPort=7321          <---
+SelectType=select/linear
+FastSchedule=1
+SlurmctldDebug=3
+SlurmdDebug=3
+JobCompType=jobcomp/none
+PropagateResourceLimitsExcept=MEMLOCK
+SlurmdLogFile=/var/log/slurm.log
+SlurmctldLogFile=/var/log/slurmctld.log
+Epilog=/etc/slurm/slurm.epilog.clean
+ReturnToService=1
+NodeName=ohpc0-slurm NodeAddr=192.168.1.249
+NodeName=n29 NodeAddr=192.168.102.38
+NodeName=n31 NodeAddr=192.168.102.40
+PartitionName=test Nodes=n29,n31 Default=YES MaxTime=INFINITE STATE=UP
+</code>
+Define CPUs, Cores, ThreadsPerCore, etc al later, run with Slurm self-discovered values for now.
+[[cluster:154|OpenHPC page 1]] - page 2 - [[cluster:156|OpenHPC page 3]] - [[cluster:160|OpenHPC page 4]]
 \\
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools