User Tools

Site Tools


cluster:155

This is an old revision of the document!



Back

OpenHPC

Additional tools for the OpenHPC environment. First add these two lines to SMS and all compute nodes. Patch CHROOT as well.

  • /etc/security/limits.conf
# added for RLIMIT_MEMLOCK warnings with libibverbs -hmeij
*                soft    memlock         unlimited
*                hard    memlock         unlimited
  • Skipping SSH restrictions for users
    • Set up password less logins ssh-keygen -t rsa
    • cat ~/.ssh/idrsa.pub » ~/.ssh/authorized_keys
    • Collect all server fingerprints and make a global known_hosts file in ~/.ssh/
  • Skipping Luster installation
  • Nagios monitoring
 yum -y groupinstall ohpc-nagios
 yum -y --installroot=/data/ohpc/images/centos7.2 install nagios-plugins-all-ohpc nrpe-ohpc
 chroot /data/ohpc/images/centos7.2 systemctl enable nrpe
 perl -pi -e "s/^allowed_hosts=/# allowed_hosts=/"  /data/ohpc/images/centos7.2/etc/nagios/nrpe.cfg
 echo "nrpe 5666/tcp # NRPE" >> /data/ohpc/images/centos7.2/etc/services
 echo "nrpe : 192.168.1.249 : ALLOW" >> /data/ohpc/images/centos7.2/etc/hosts.allow
 echo "nrpe : ALL : DENY" >> /data/ohpc/images/centos7.2/etc/hosts.allow
 chroot /data/ohpc/images/centos7.2 /usr/sbin/useradd -c "NRPE user for the NRPE service" \
       -d /var/run/nrpe -r -g nrpe -s /sbin/nologin nrpe
 mv /etc/nagios/conf.d/services.cfg.example /etc/nagios/conf.d/services.cfg
 mv /etc/nagios/conf.d/hosts.cfg.example /etc/nagios/conf.d/hosts.cfg
 perl -pi -e "s/HOSTNAME1/n29/ || s/HOST1_IP/192.168.102.38/" /etc/nagios/conf.d/hosts.cfg
 perl -pi -e "s/HOSTNAME2/n31/ || s/HOST2_IP/192.168.102.40/" /etc/nagios/conf.d/hosts.cfg
 perl -pi -e "s/ \/bin\/mail/\/usr\/bin\/mailx/g" /etc/nagios/objects/commands.cfg
 perl -pi -e "s/nagios\@localhost/root\@ohpc0-test/" /etc/nagios/objects/contacts.cfg
 chkconfig nagios on
 systemctl start nagios
 chmod u+s `which ping`
 echo "relayhost = 192.168.102.42" >> /etc/postfix/main.cf
 echo "root:           hmeij@wes..." >> /etc/aliases
 newaliases
 systemctl restart postfix

# recreate vnfs and reimage nodes, see page1
 wwvnfs -y --chroot /data/ohpc/images/centos7.2
 /root/deploy.sh
  • Page1 OpenHPC
  • Reset nagiosadmin password htpasswd -c /etc/nagios/passwd nagiosadmin
  • Open port 80 in iptables but restrict severely (plain text passwords)
  • On to Ganglia
 yum -y groupinstall ohpc-ganglia
 yum -y --installroot=/data/ohpc/images/centos7.2 install ganglia-gmond-ohpc
# import passwd, shadow and group files for new user account ganglia
 mv /etc/ganglia/gmond.conf /etc/ganglia/gmond.conf-orig
 cp /opt/ohpc/pub/examples/ganglia/gmond.conf /etc/ganglia/
# use provision IP
 perl -pi -e "s/<sms>/192.168.1.249/" /etc/ganglia/gmond.conf
 cp /etc/ganglia/gmond.conf /data/ohpc/images/centos7.2/etc/ganglia/
 echo "gridname MySite" >> /etc/ganglia/gmetad.conf
 systemctl enable gmond
 systemctl enable gmetad
 systemctl start gmond
 systemctl start gmetad
 systemctl restart httpd

 chroot /data/ohpc/images/centos7.2 systemctl enable gmond

# recreate vnfs and reimage nodes, see page1
 wwvnfs -y --chroot /data/ohpc/images/centos7.2
 /root/deploy.sh
  • Not installing ClusterShell, pdsh is already installed
    • add compute hostnames to /etc/hosts.pdsh
    • echo export WCOLL=/etc/hosts.pdsh » /root/.bashrc
[root@ohpc0-test ~]# pdsh uptime
n31:  10:44:25 up 19:14,  1 user,  load average: 0.00, 0.01, 0.05
n29:  10:44:25 up 19:19,  0 users,  load average: 0.00, 0.01, 0.05
  • Skip mrsh installation
  • Skip Genders installation
  • Skip ConMan installation, ipmi serial consoles
  • Skip rsysslog forwarding of compute node logs to SMS
  • Redefine ControlMachine in /etc/slurm.slurm.conf
    • use eth0, not public address eth1
    • and CHROOT/etc/slurm/slurm.conf
    • import file back into database

Ran into a slurm config problem here on compue ndoes. When issuing systemctl status slurm the output revealed a failed start and it reported could not find /var/run/slurmctl.pid … that's the wrong pid process, compute nodes should only start slurmd. So finally got this fixed

# Removed file /etc/init.d/slurm
mv /etc/init.d/slurm /root/ 

# Made the following link
[root@n31 ~]# ls -l /etc/systemd/system/multi-user.target.wants/slurmd.service 
lrwxrwxrwx 1 root root 38 Mar 30 14:05 /etc/systemd/system/multi-user.target.wants/slurmd.service -> /usr/lib/systemd/system/slurmd.service  

# now it starts properly
Mar 31 12:41:05 n31.localdomain systemd[1]: Starting Slurm node daemon...
Mar 31 12:41:05 n31.localdomain systemd[1]: PID file /var/run/slurmd.pid not readable (yet?) after start.
Mar 31 12:41:05 n31.localdomain systemd[1]: Started Slurm node daemon.
  • Recreate vnfs, Reimage the whole kaboodle

Link to my previous eval of Slurm and job throughput testing: Slurm. Next submit some test jobs, explained on that page too.

Here are my current settings on slurm.conf in OpenHPC.

ClusterName=linux                                                                                        
ControlMachine=ohpc0-slurm                                                                               
ControlAddr=192.168.1.249                                                                                
SlurmUser=slurm                                                                                          
SlurmctldPort=6815-6817        <---                                                                          
SlurmdPort=6818                                                                                          
AuthType=auth/munge                                                                                      
StateSaveLocation=/etc/slurm/state    <---                                                                   
SlurmdSpoolDir=/etc/slurm/spool      <---
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
FirstJobId=101               <---
MaxJobCount=999999           <---
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SchedulerType=sched/builtin <---
SchedulerPort=7321          <---
SelectType=select/linear
FastSchedule=1
SlurmctldDebug=3
SlurmdDebug=3
JobCompType=jobcomp/none
PropagateResourceLimitsExcept=MEMLOCK
SlurmdLogFile=/var/log/slurm.log
SlurmctldLogFile=/var/log/slurmctld.log
Epilog=/etc/slurm/slurm.epilog.clean
ReturnToService=1
NodeName=ohpc0-slurm NodeAddr=192.168.1.249 CPUs=2 RealMemory=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN
NodeName=n29 NodeAddr=192.168.102.38 CPUs=2 RealMemory=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN
NodeName=n31 NodeAddr=192.168.102.40 CPUs=2 RealMemory=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN
PartitionName=test Nodes=n29,n31 Default=YES MaxTime=INFINITE State=UP


Back

cluster/155.1490980282.txt.gz · Last modified: 2017/03/31 13:11 by hmeij07