DokuWiki

This is an old revision of the document!

OpenHPC page 2

Additional tools for the OpenHPC environment. First add these two lines to SMS and all compute nodes. Patch CHROOT as well.

/etc/security/limits.conf

# added for RLIMIT_MEMLOCK warnings with libibverbs -hmeij
*                soft    memlock         unlimited
*                hard    memlock         unlimited

Skipping SSH restrictions for users
- Set up password less logins ssh-keygen -t rsa
- cat ~/.ssh/idrsa.pub » ~/.ssh/authorized_keys
- Collect all server fingerprints and make a global known_hosts file in ~/.ssh/
Skipping Luster installation

Nagios monitoring

 yum -y groupinstall ohpc-nagios
 yum -y --installroot=/data/ohpc/images/centos7.2 install nagios-plugins-all-ohpc nrpe-ohpc
 chroot /data/ohpc/images/centos7.2 systemctl enable nrpe
 perl -pi -e "s/^allowed_hosts=/# allowed_hosts=/"  /data/ohpc/images/centos7.2/etc/nagios/nrpe.cfg
 echo "nrpe 5666/tcp # NRPE" >> /data/ohpc/images/centos7.2/etc/services
 echo "nrpe : 192.168.1.249 : ALLOW" >> /data/ohpc/images/centos7.2/etc/hosts.allow
 echo "nrpe : ALL : DENY" >> /data/ohpc/images/centos7.2/etc/hosts.allow
 chroot /data/ohpc/images/centos7.2 /usr/sbin/useradd -c "NRPE user for the NRPE service" \
       -d /var/run/nrpe -r -g nrpe -s /sbin/nologin nrpe
 mv /etc/nagios/conf.d/services.cfg.example /etc/nagios/conf.d/services.cfg
 mv /etc/nagios/conf.d/hosts.cfg.example /etc/nagios/conf.d/hosts.cfg
 perl -pi -e "s/HOSTNAME1/n29/ || s/HOST1_IP/192.168.102.38/" /etc/nagios/conf.d/hosts.cfg
 perl -pi -e "s/HOSTNAME2/n31/ || s/HOST2_IP/192.168.102.40/" /etc/nagios/conf.d/hosts.cfg
 perl -pi -e "s/ \/bin\/mail/\/usr\/bin\/mailx/g" /etc/nagios/objects/commands.cfg
 perl -pi -e "s/nagios\@localhost/root\@ohpc0-test/" /etc/nagios/objects/contacts.cfg
 chkconfig nagios on
 systemctl start nagios
 chmod u+s `which ping`
 echo "relayhost = 192.168.102.42" >> /etc/postfix/main.cf
 echo "root:           hmeij@wes..." >> /etc/aliases
 newaliases
 systemctl restart postfix

# recreate vnfs and reimage nodes, see page1
 wwvnfs -y --chroot /data/ohpc/images/centos7.2
 /root/deploy.sh

Page1 OpenHPC
Reset nagiosadmin password htpasswd -c /etc/nagios/passwd nagiosadmin
Open port 80 in iptables but restrict severely (plain text passwords)
http://localhost/nagios

On to Ganglia

 yum -y groupinstall ohpc-ganglia
 yum -y --installroot=/data/ohpc/images/centos7.2 install ganglia-gmond-ohpc
# import passwd, shadow and group files for new user account ganglia
 mv /etc/ganglia/gmond.conf /etc/ganglia/gmond.conf-orig
 cp /opt/ohpc/pub/examples/ganglia/gmond.conf /etc/ganglia/
# use provision IP
 perl -pi -e "s/<sms>/192.168.1.249/" /etc/ganglia/gmond.conf
 cp /etc/ganglia/gmond.conf /data/ohpc/images/centos7.2/etc/ganglia/
 echo "gridname MySite" >> /etc/ganglia/gmetad.conf
 systemctl enable gmond
 systemctl enable gmetad
 systemctl start gmond
 systemctl start gmetad
 systemctl restart httpd

 chroot /data/ohpc/images/centos7.2 systemctl enable gmond

# recreate vnfs and reimage nodes, see page1
 wwvnfs -y --chroot /data/ohpc/images/centos7.2
 /root/deploy.sh

http://localhost/ganglia

Not installing ClusterShell, pdsh is already installed
- add compute hostnames to /etc/hosts.pdsh
- echo export WCOLL=/etc/hosts.pdsh » /root/.bashrc

[root@ohpc0-test ~]# pdsh uptime
n31:  10:44:25 up 19:14,  1 user,  load average: 0.00, 0.01, 0.05
n29:  10:44:25 up 19:19,  0 users,  load average: 0.00, 0.01, 0.05

Skip mrsh installation
Skip Genders installation
Skip ConMan installation, ipmi serial consoles
Skip rsysslog forwarding of compute node logs to SMS
Redefine ControlMachine in /etc/slurm.slurm.conf
- use eth0, not public address eth1
- and CHROOT/etc/slurm/slurm.conf
- import file back into database

Ran into a slurm config problem here on compue ndoes. When issuing systemctl status slurm the output revealed a failed start and it reported could not find /var/run/slurmctl.pid … that's the wrong pid process, compute nodes should only start slurmd. So finally got this fixed

# ON COMPUTE NODES, that is in CHROOT

# Removed file /etc/init.d/slurm
mv /etc/init.d/slurm /root/ 

# Made the following link
[root@n31 ~]# ls -l /etc/systemd/system/multi-user.target.wants/slurmd.service 
lrwxrwxrwx 1 root root 38 Mar 30 14:05 /etc/systemd/system/multi-user.target.wants/slurmd.service -> /usr/lib/systemd/system/slurmd.service  

# now it starts properly
Mar 31 12:41:05 n31.localdomain systemd[1]: Starting Slurm node daemon...
Mar 31 12:41:05 n31.localdomain systemd[1]: PID file /var/run/slurmd.pid not readable (yet?) after start.
Mar 31 12:41:05 n31.localdomain systemd[1]: Started Slurm node daemon.

Recreate vnfs, Reimage the whole kaboodle

Link to my previous eval of Slurm and job throughput testing: Slurm. Next submit some test jobs, explained on that page too.

Here are my current settings on slurm.conf in OpenHPC.

ClusterName=linux                                                                                        
ControlMachine=ohpc0-slurm                                                                               
ControlAddr=192.168.1.249                                                                                
SlurmUser=slurm                                                                                          
SlurmctldPort=6815-6817        <---                                                                          
SlurmdPort=6818                                                                                          
AuthType=auth/munge                                                                                      
StateSaveLocation=/etc/slurm/state    <---                                                                   
SlurmdSpoolDir=/etc/slurm/spool      <---
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
FirstJobId=101               <---
MaxJobCount=999999           <---
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SchedulerType=sched/builtin <---
SchedulerPort=7321          <---
SelectType=select/linear
FastSchedule=1
SlurmctldDebug=3
SlurmdDebug=3
JobCompType=jobcomp/none
PropagateResourceLimitsExcept=MEMLOCK
SlurmdLogFile=/var/log/slurm.log
SlurmctldLogFile=/var/log/slurmctld.log
Epilog=/etc/slurm/slurm.epilog.clean
ReturnToService=1
NodeName=ohpc0-slurm NodeAddr=192.168.1.249 
NodeName=n29 NodeAddr=192.168.102.38
NodeName=n31 NodeAddr=192.168.102.40 
PartitionName=test Nodes=n29,n31 Default=YES MaxTime=INFINITE STATE=UP

Define CPUs, Cores, ThreadsPerCore, etc al later, run with Slurm self-discovered values for now.
Back