Differences

This shows you the differences between two versions of the page.

--- cluster:213 [2022/03/16 13:09]
hmeij07
+++ cluster:213 [2022/07/25 13:34]
hmeij07 [Amber22]
@@ Line 2: / Line 2: @@
 **[[cluster:0|Back]]**
-===== New Head Node =====
+==== New Head Node ====
 We're embarking on a transition to a new head/login node name ''cottontail2''. This server will be running Rocky 8 operating system. Early design ideas can be found at [[cluster:200|Cottontail2]], all pre-pandemic.  We are staying with a 1G ethernet network as we could not find 10G switches. Maybe in the near term we can upgrade.
@@ Line 8: / Line 8: @@
 Two new compute nodes (n100, n101) will be set up in a test queue. They each have four RTX5000 gpus which have the same architecture as our other gpus so all compiled software should work. These gpus have 16G memory foot print (twice as large as other gpus we have).
-OpenHPC will be deployed next and I'll make some notes. We will to Slurm scheduler. ([[cluster:207|Slurm Test Env]] and [[cluster:207|Slurm Test Env]])Any old hardware that can be reimaged with Rocky 8 will be migrated to Slurm using Warewulf. But that all will take some time.
+OpenHPC will be deployed next and I'll make some notes. We will to Slurm scheduler. ([[cluster:208|Slurm Test Env]] for users and [[cluster:207|Slurm Test Env]] techie page). Any old hardware that can be reimaged with Rocky 8 will be migrated to Slurm using Warewulf. But that all will take some time.
-====== Config Recipe ======
+Some pictures below.
+==== Config Recipe ====
 Steps. "Ala n37" ... so the RTX nodes are similar to the K20 nodes and we can put the local software in place. See [[cluster:172|K20 Redo]] page and [[cluster:192|exx96]] Recipe for CentOS 7
@@ Line 40: / Line 42: @@
 scp 10.10.102.253:/root/.ssh/authorized_keys /root/.ssh/
 /etc/ssh/sshd_config (PermitRootLogin)
+# Put the warewulf cluster key in authorized_keys
+# Put eth0 fingerprints in cottontail/greentail52 known hosts
+# add to relevant known_hosts_servername file
 # configure private subnets and ping file server
@@ Line 51: / Line 57: @@
 # make internet connection for yum
+# iptables
+dnf install -y iptables-services
+vi /etc/sysconfig/iptables
+# add 'local allow' ports  --dport 0:65535
+systemctl start iptables # and enable
+iptables -L
+systemctl stop firewalld
+systemctl disable firewalld
 # eth3 for ctt2 or eth1 for n100-101
 dnf install bind-utils
 dig google.com
+iptables -L # check!
-#rocky8
+# Rocky8
 # https://docs.fedoraproject.org/en-US/epel/#Quickstart
 dnf config-manager --set-enabled powertools
@@ Line 65: / Line 83: @@
 dnf install gnuplot
 dnf install alpine # pico
+yum groupinstall "Server" # server for compute nodes "Server with GUI"
-# iptables
-dnf install -y iptables-services
-vi /etc/sysconfig/iptables
-# add 'local allow' ports  --dport 0:65535
-systemctl start iptables # and enable
-iptables -L
-systemctl stop firewalld
-systemctl disable firewalld
 # other configs
@@ Line 105: / Line 115: @@
 echo "relayhost = 192.168.102.251" >> /etc/postfix/main.cf
+# on head node /etc/chronyc.conf
+allow 192.168.0.0/16
 # compute nodes /etc/chronyc.conf
 #pool 2.pool.ntp.org iburst
 Server 192.168.102.250
 Server 192.168.102.251
+# check
+chronyc sources
+# on head node install from epel repo
+yum install slurm-openlava
+# error on conflicting libs, too bad!
@@ Line 123: / Line 142: @@
 yum install cmake -y
 yum install libjpeg libjpeg-devel libjpeg-turbo-devel -y
-# amber
+#easybuild
+yum install libibverbs libibverbs-devel
+# amber20 cmake readline error fix needs
+yum install ncurses-c++-libs-6.1-9.20180224.el8.x86_64.rpm \
+            ncurses-devel-6.1-9.20180224.el8.x86_64.rpm \
+            readline-devel-7.0-10.el8.x86_64.rpm
+# amber20
 yum -y install tcsh make \
                gcc gcc-gfortran gcc-c++ \
@@ Line 130: / Line 158: @@
                perl perl-ExtUtils-MakeMaker util-linux wget \
                bzip2 bzip2-devel zlib-devel tar
-yum update -y
-yum clean all
 # CENTOS7 pick the kernel vendor used for now
@@ Line 142: / Line 168: @@
 # compute nodes old level 3
 systemctl set-default multi-user.target
-# remove internet, bring private back up
-reboot
 # compute nodes only
@@ Line 161: / Line 185: @@
 # openjdk version "1.8.0_322"
 rpm -qa | grep ^java  # check
+yum install java-1.8.0-openjdk java-1.8.0-openjdk-devel \
+java-1.8.0-openjdk-headless javapackages-filesystem
 # python v 3.9
 yum install python39 python39-devel
+ln -s /usr/bin/python3.9 /usr/bin/python
 # fftw 3.3.5-11.el8
 yum install fftw fftw-devel
@@ Line 173: / Line 200: @@
 # dmtcp
 yum install dmtcp dmtcp-devel
+# check status of service munge
 yum clean all
+# eth3 onboot=no, private networks only
+systemctl disable iptables
 reboot
+# now make it an ohpc compute node
+  yum repolist
+  yum  install ohpc-base-compute
+  scp cottontail2:/etc/resolv.conf /etc/resolv.conf
+  yum  install ohpc-slurm-client
+  systemctl enable munge
+  systemctl start munge
+  scp cottontail2:/etc/munge/munge.key /etc/munge/munge.key
+  echo SLURMD_OPTIONS="--conf-server 192.168.102.250" > /etc/sysconfig/slurmd
+  yum  install --allowerasing lmod-ohpc
+  grep '/var' /etc/slurm/slurm.conf
+  mkdir /var/log/slurm
+  chown slurm:munge /var/log/slurm
+  mkdir /var/spool/slurm
+  chown slurm:munge /var/spool/slurm
+  scp cottontail2:/etc/slurm/slurm.conf /etc/slurm/slurm.conf
+  scp cottontail2:/etc/slurm/gres.conf /etc/slurm/gres.conf
+  scp cottontail2:/etc/profile.d/lmod.sh /etc/profile.d/
+# /var/[log|spool|run] need to be removed from
+/usr/libexec/warewulf/wwmkchroot/gold-template
+#test
+  /usr/sbin/slurmd -D
+# start via rc.local
+chmod +x /etc/rc.d/rc.local
+#timing issue with munge
+sleep 15
+/usr/sbin/slurmd
+# slurmd ???
+	libhwloc.so.15 => /opt/ohpc/pub/libs/hwloc/lib/libhwloc.so.15 (0x00007fd6e5684000)
+# add to zenoss edit /etc/snmp/snmpd.conf, enable and start
+rocommunity public
+dontLogTCPWrappersConnects yes
 </code>
+==== Pics ====
+My data center robot thingie and node n100's gpus\\
 \\
+{{:cluster:dcrobot.jpg?400|}}
+\\
+{{:cluster:n100.jpg?400|}}\\
+\\
+==== Amber20 ====
+OpenHPC
+<code>
+# First **all the necessary packages ** (yum install...)
+  tar xvfj ../AmberTools21.tar.bz2
+  tar xvfj ../Amber20.tar.bz2
+  cd amber20_src/
+  cd build/
+  vi run_cmake
+#  Assume this is Linux:
+# serial, do on head node, with miniconda true, compile, install
+  cmake $AMBER_PREFIX/amber20_src \
+    -DCMAKE_INSTALL_PREFIX=/share/apps/CENTOS8/ohpc/software/amber/20 \
+    -DCOMPILER=GNU  \
+    -DMPI=FALSE -DCUDA=FALSE -DINSTALL_TESTS=TRUE \
+    -DDOWNLOAD_MINICONDA=TRUE -DMINICONDA_USE_PY3=TRUE \
+>&1 | tee  cmake.log
+# Env
+[hmeij@n100 ~]$ module load cuda/11.6
+[hmeij@n100 ~]$ echo $CUDA_HOME
+/usr/local/cuda
+[hmeij@n100 ~]$ which nvcc mpicc gcc
+/usr/local/cuda/bin/nvcc
+/opt/ohpc/pub/mpi/openmpi4-gnu9/4.1.1/bin/mpicc
+/opt/ohpc/pub/compiler/gcc/9.4.0/bin/gcc
+# [FIXED} cmake error on conda install, set to FALSE
+# OS native python, install on n[100-101]
+-- Python version 3.9 -- OK
+-- Found PythonLibs: /usr/lib64/libpython3.9.so (found version "3.9.6")
+-- Checking for Python package numpy -- not found
+-- Checking for Python package scipy -- not found
+-- Checking for Python package matplotlib -- not found
+-- Checking for Python package setuptools -- found
+[END FIXED]
+# mpi & cuda FALSE builds serial
+./run_cmake
+make install
+# lots and lots of warnings
+# then
+source /share/apps/CENTOS8/ohpc/software/amber/20/amber.sh
+# on n100 now, parallel, set miniconda flags to FALSE
+-MPI=TRUE
+./run_cmake
+make install
+# on n100 just change cuda flag
+-CUDA=TRUE
+./run_cmake
+make install
+#tests
+cd $AMBERHOME
+make test.serial
+export DO_PARALLEL="mpirun -np 6"
+make test.parallel
+export CUDA_VISIBLE_DEVICES=0
+make test.cuda.serial
+make test.cuda.parallel
+</code>
+==== Amber22 ====
+OpenHPC
+<code>
+# First **all the necessary packages ** (yum install...)
+  tar xvfj ../AmberTools22.tar.bz2
+  tar xvfj ../Amber22.tar.bz2
+  cd amber20_src/
+  cd build/
+  vi run_cmake
+#  Assume this is Linux:
+# serial, do on head node, with miniconda true, compile, install
+  cmake $AMBER_PREFIX/amber22_src \
+    -DCMAKE_INSTALL_PREFIX=/share/apps/CENTOS8/ohpc/software/amber/22 \
+    -DCOMPILER=GNU  \
+    -DMPI=FALSE -DCUDA=FALSE -DINSTALL_TESTS=TRUE \
+    -DDOWNLOAD_MINICONDA=TRUE \
+>&1 | tee  cmake.log
+./run_cmake
+make install
+# Note !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+The OpenMPI and MPICH system installations provided by CentOS
+(i.e., through yum install)
+are known to be somehow incompatible with Amber22.
+# OUCH !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+# GO TO node n100
+# install latest openmpi version
+# Env
+[hmeij@n100 ~]$ module load cuda/11.6
+[hmeij@n100 ~]$ echo $CUDA_HOME
+/usr/local/cuda
+[hmeij@n100 ~]$ which nvcc mpicc gcc
+/usr/local/cuda/bin/nvcc
+/opt/ohpc/pub/mpi/openmpi4-gnu9/4.1.1/bin/mpicc
+/opt/ohpc/pub/compiler/gcc/9.4.0/bin/gcc
+# [FIXED} cmake error on conda install, set to FALSE
+# OS native python, install on n[100-101]
+-- Python version 3.9 -- OK
+-- Found PythonLibs: /usr/lib64/libpython3.9.so (found version "3.9.6")
+-- Checking for Python package numpy -- not found
+-- Checking for Python package scipy -- not found
+-- Checking for Python package matplotlib -- not found
+-- Checking for Python package setuptools -- found
+[END FIXED]
+# mpi & cuda FALSE builds serial
+./run_cmake
+make install
+# lots and lots of warnings
+# then
+source /share/apps/CENTOS8/ohpc/software/amber/20/amber.sh
+# on n100 now, parallel, set miniconda flags to FALSE
+-MPI=TRUE
+./run_cmake
+make install
+# on n100 just change cuda flag
+-CUDA=TRUE
+./run_cmake
+make install
+#tests
+cd $AMBERHOME
+make test.serial
+export DO_PARALLEL="mpirun -np 6"
+make test.parallel
+export CUDA_VISIBLE_DEVICES=0
+make test.cuda.serial
+make test.cuda.parallel
+</code>
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools