cluster:155
This is an old revision of the document!
OpenHPC
Additional tools for the OpenHPC environment. First add these two lines to SMS and all compute nodes. Patch CHROOT as well.
- /etc/security/limits.conf
# added for RLIMIT_MEMLOCK warnings with libibverbs -hmeij * soft memlock unlimited * hard memlock unlimited
- Skipping SSH restrictions for users
- Set up password less logins
ssh-keygen -t rsa cat ~/.ssh/idrsa.pub » ~/.ssh/authorized_keys- Collect all server fingerprints and make a global
known_hostsfile in ~/.ssh/
- Skipping Luster installation
- Nagios monitoring
yum -y groupinstall ohpc-nagios
yum -y --installroot=/data/ohpc/images/centos7.2 install nagios-plugins-all-ohpc nrpe-ohpc
chroot /data/ohpc/images/centos7.2 systemctl enable nrpe
perl -pi -e "s/^allowed_hosts=/# allowed_hosts=/" /data/ohpc/images/centos7.2/etc/nagios/nrpe.cfg
echo "nrpe 5666/tcp # NRPE" >> /data/ohpc/images/centos7.2/etc/services
echo "nrpe : 192.168.1.249 : ALLOW" >> /data/ohpc/images/centos7.2/etc/hosts.allow
echo "nrpe : ALL : DENY" >> /data/ohpc/images/centos7.2/etc/hosts.allow
chroot /data/ohpc/images/centos7.2 /usr/sbin/useradd -c "NRPE user for the NRPE service" \
-d /var/run/nrpe -r -g nrpe -s /sbin/nologin nrpe
mv /etc/nagios/conf.d/services.cfg.example /etc/nagios/conf.d/services.cfg
mv /etc/nagios/conf.d/hosts.cfg.example /etc/nagios/conf.d/hosts.cfg
perl -pi -e "s/HOSTNAME1/n29/ || s/HOST1_IP/192.168.102.38/" /etc/nagios/conf.d/hosts.cfg
perl -pi -e "s/HOSTNAME2/n31/ || s/HOST2_IP/192.168.102.40/" /etc/nagios/conf.d/hosts.cfg
perl -pi -e "s/ \/bin\/mail/\/usr\/bin\/mailx/g" /etc/nagios/objects/commands.cfg
perl -pi -e "s/nagios\@localhost/root\@ohpc0-test/" /etc/nagios/objects/contacts.cfg
chkconfig nagios on
systemctl start nagios
chmod u+s `which ping`
echo "relayhost = 192.168.102.42" >> /etc/postfix/main.cf
echo "root: hmeij@wes..." >> /etc/aliases
newaliases
systemctl restart postfix
# recreate vnfs and reimage nodes, see page1
wwvnfs -y --chroot /data/ohpc/images/centos7.2
/root/deploy.sh
- Page1 OpenHPC
- Reset nagiosadmin password
htpasswd -c /etc/nagios/passwd nagiosadmin - Open port 80 in iptables but restrict severely (plain text passwords)
- On to Ganglia
yum -y groupinstall ohpc-ganglia yum -y --installroot=/data/ohpc/images/centos7.2 install ganglia-gmond-ohpc # import passwd, shadow and group files for new user account ganglia mv /etc/ganglia/gmond.conf /etc/ganglia/gmond.conf-orig cp /opt/ohpc/pub/examples/ganglia/gmond.conf /etc/ganglia/ # use provision IP perl -pi -e "s/<sms>/192.168.1.249/" /etc/ganglia/gmond.conf cp /etc/ganglia/gmond.conf /data/ohpc/images/centos7.2/etc/ganglia/ echo "gridname MySite" >> /etc/ganglia/gmetad.conf systemctl enable gmond systemctl enable gmetad systemctl start gmond systemctl start gmetad systemctl restart httpd chroot /data/ohpc/images/centos7.2 systemctl enable gmond # recreate vnfs and reimage nodes, see page1 wwvnfs -y --chroot /data/ohpc/images/centos7.2 /root/deploy.sh
- Not installing ClusterShell, pdsh is already installed
- add compute hostnames to /etc/hosts.pdsh
echo export WCOLL=/etc/hosts.pdsh » /root/.bashrc
[root@ohpc0-test ~]# pdsh uptime n31: 10:44:25 up 19:14, 1 user, load average: 0.00, 0.01, 0.05 n29: 10:44:25 up 19:19, 0 users, load average: 0.00, 0.01, 0.05
- Skip
mrshinstallation - Skip
Gendersinstallation - Skip
ConManinstallation, ipmi serial consoles - Skip
rsysslogforwarding of compute node logs to SMS - Redefine
ControlMachinein /etc/slurm.slurm.conf- use eth0, not public address eth1
- and CHROOT/etc/slurm/slurm.conf
- import file back into database
Ran into a slurm config problem here on compue ndoes. When issuing systemctl status slurm the output revealed a failed start and it reported could not find /var/run/slurmctl.pid … that's the wrong pid process, compute nodes should only start slurmd. So finally got this fixed
# ON COMPUTE NODES, that is in CHROOT # Removed file /etc/init.d/slurm mv /etc/init.d/slurm /root/ # Made the following link [root@n31 ~]# ls -l /etc/systemd/system/multi-user.target.wants/slurmd.service lrwxrwxrwx 1 root root 38 Mar 30 14:05 /etc/systemd/system/multi-user.target.wants/slurmd.service -> /usr/lib/systemd/system/slurmd.service # now it starts properly Mar 31 12:41:05 n31.localdomain systemd[1]: Starting Slurm node daemon... Mar 31 12:41:05 n31.localdomain systemd[1]: PID file /var/run/slurmd.pid not readable (yet?) after start. Mar 31 12:41:05 n31.localdomain systemd[1]: Started Slurm node daemon.
- Recreate vnfs, Reimage the whole kaboodle
Link to my previous eval of Slurm and job throughput testing: Slurm. Next submit some test jobs, explained on that page too.
Here are my current settings on slurm.conf in OpenHPC.
ClusterName=linux ControlMachine=ohpc0-slurm ControlAddr=192.168.1.249 SlurmUser=slurm SlurmctldPort=6815-6817 <--- SlurmdPort=6818 AuthType=auth/munge StateSaveLocation=/etc/slurm/state <--- SlurmdSpoolDir=/etc/slurm/spool <--- SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/pgid FirstJobId=101 <--- MaxJobCount=999999 <--- SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 SchedulerType=sched/builtin <--- SchedulerPort=7321 <--- SelectType=select/linear FastSchedule=1 SlurmctldDebug=3 SlurmdDebug=3 JobCompType=jobcomp/none PropagateResourceLimitsExcept=MEMLOCK SlurmdLogFile=/var/log/slurm.log SlurmctldLogFile=/var/log/slurmctld.log Epilog=/etc/slurm/slurm.epilog.clean ReturnToService=1 NodeName=ohpc0-slurm NodeAddr=192.168.1.249 NodeName=n29 NodeAddr=192.168.102.38 NodeName=n31 NodeAddr=192.168.102.40 PartitionName=test Nodes=n29,n31 Default=YES MaxTime=INFINITE STATE=UP
Define CPUs, Cores, ThreadsPerCore, etc al later, run with Slurm self-discovered values for now.
Back
cluster/155.1491227541.txt.gz · Last modified: by hmeij07
