Differences

This shows you the differences between two versions of the page.

--- cluster:225 [2024/05/06 13:38] – hmeij07
+++ cluster:225 [2024/05/21 14:06] (current) – hmeij07
@@ Line 63: / Line 63: @@
 <code>
-# n78 first ... (no problem, tests success)
+# n78 first ... can reimage cuda-11.6 from n101 (no problem, tests success)
 # make sure /usr/src/kernels/$(uname -r) exists else
 # scp into place from n100 (centos8, possibly caused by warewulf...)
+# however old nvidia packages still in OS (driver 510 toolkit 11.6)...
+#  rpm -qa | grep ^nvidia | wc -l results in 16 packages...
+# what happens on dfn check-update ???
+# n[100-101] skipping for now
+# this is a package install and there is no nvidia_uninstall (runfile)
+# an upgrade would require internet 'dnf check-update; dnf update')
+# switching between rpm install and runfile is NOT recommended
+# and 'dnf erase nvidia*' may leave a hung system behind
 # n79 next (no problem)
@@ Line 75: / Line 84: @@
 # disabled docker on n79 for now 04/15/2024 9:06AM
 # also rotated the memory dimms some time later, seems to have fixed issue
+# started docker back up on n79 05/06/2024 9:56AM (has been up 17 days by now)
 # n89 next (no problem)
 # but upon reboot I encountered that error for the FIRST time on this node
-# need to research it is not related to cuda install
+# need to research it is somewhat related to cuda install
+# n80 (same error upon reboot after driver install)
+# n81 (same error upon reboot after driver install)
+# n90 (same error upon reboot after toolkit install, not driver. weird)
+# n88 (failed toolkit install, ran /usr/bin/ndia-uninstall, reboot
+#      re-installed driver, reboot, re-installed tookit, reboot,
+#      no error occurs! )
+# n87 (ran nvidia-uninstall first, driver then toolkit, errors shows up)
+# n86 & n85 same as n87
+# n84 (no error shows up)
+# n82 (as n87 but error shows up after driver before toolkit install)
+# n83 (error shows up fater toolkit install, not driver install reboot)
 sh ./NVIDIA-Linux-x86_64-550.67.run
@@ Line 131: / Line 151: @@
 REBOOT and check date before launching slurm
+mv /var/spool/slurmd/cred_state /var/spool/slurmd/cred_state.bak
 ===========
@@ Line 230: / Line 251: @@
 ** CentOS 7 on n89 **
+The steps above can also be done for the default cuda installation on exx96 where the soft link ''/usr/local/bin/cuda'' would have pointed to ''/usr/local/bin/cuda-10.2''. Do not follow the soft link and use the path with the toolkit version in it when setting your cuda environment.
 Next test is to see if older software runs compatible with newer drivers. We test that by running a gpu program against new 550 driver and cuda toolkit 9.2 and see if it works (~hmeij/slurm/run.centos7.2).