Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

--- cluster:225 [2024/04/07 14:30]
hmeij07
+++ cluster:225 [2024/05/13 14:30] (current)
hmeij07
@@ Line 63: / Line 63: @@
 <code>
-# n78 first ...
+# n78 first ... (no problem, tests success)
-# n89 second
 # make sure /usr/src/kernels/$(uname -r) exists else
 # scp into place from n100 (centos8, possibly caused by warewulf...)
+# n79 next (no problem)
+# but during testing found this error which I had seen BEFORE cuda install
+reason:         BUG: unable to handle kernel NULL pointer dereference at 0000000000000108
+# follow up, nvidia driver "taints" the kernel by loading proprietary drive
+# this is a warning mostly by may interfere as does docker likely
+https://unix.stackexchange.com/questions/118116/what-is-a-tainted-linux-kernel
+# disabled docker on n79 for now 04/15/2024 9:06AM
+# also rotated the memory dimms some time later, seems to have fixed issue
+# started docker back up on n79 05/06/2024 9:56AM (has been up 17 days by now)
+# n89 next (no problem)
+# but upon reboot I encountered that error for the FIRST time on this node
+# need to research it is somewhat related to cuda install
+# n80 (same error upon reboot after driver install)
+# n81 (same error upon reboot after driver install)
+# n90 (same error upon reboot after toolkit install, not driver. weird)
+# n88 (failed toolkit install, ran /usr/bin/ndia-uninstall, reboot
+#      re-installed driver, reboot, re-installed tookit, reboot,
+#      no error occurs! )
 sh ./NVIDIA-Linux-x86_64-550.67.run
@@ Line 73: / Line 92: @@
 dkms build? yes
 rebuild initramfs? yes
+xconfig? no
 error nvidia module can not be loaded
 reboot fixed that
@@ Line 115: / Line 135: @@
 sh cuda_12.4.0_550.54.14_linux.run
+REBOOT and check date before launching slurm
+mv /var/spool/slurmd/cred_state /var/spool/slurmd/cred_state.bak
 ===========
@@ Line 214: / Line 237: @@
 ** CentOS 7 on n89 **
+The steps above can also be done for the default cuda installation on exx96 where the soft link ''/usr/local/bin/cuda'' would have pointed to ''/usr/local/bin/cuda-10.2''. Do not follow the soft link and use the path with the toolkit version in it when setting your cuda environment.
 Next test is to see if older software runs compatible with newer drivers. We test that by running a gpu program against new 550 driver and cuda toolkit 9.2 and see if it works (~hmeij/slurm/run.centos7.2).
@@ Line 245: / Line 270: @@
 </code>
-So it appears that all our cuda versions can use the new 550 driver that comes with cuda-12.4 toolkit. Two other cuda versions have not been tested but should function as well (cuda-11.2 in mwgpu and cuda-10.2 in exx96). But in these queues cuda-9.2 is present on local hard disk and software was compiled against that toolkit so which queue to use did not matter. (compilations against 10.2 did not run in 9.2, as expected). I was able to test and run lammps in 11.2 consult the file ''~hmeij/slurm/centos.2''
+So it appears that all our cuda versions can use the new 550 driver that comes with cuda-12.4 toolkit. Two other cuda versions have not been tested but should function as well (cuda-11.2 in mwgpu and cuda-10.2 in exx96). But in these queues cuda-9.2 is present on local hard disk and software was compiled against that toolkit so which queue to use did not matter. (compilations against 10.2 did not run in 9.2, as expected). I was able to test and run lammps in 10.2 consult the file ''~hmeij/slurm/centos.2''
 These compatibility results are way, way better than expected. Yea.

DokuWiki

User Tools

Site Tools

Differences

Page Tools