This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:225 [2024/04/08 08:13] hmeij07 |
cluster:225 [2024/05/13 14:26] hmeij07 |
||
---|---|---|---|
Line 63: | Line 63: | ||
< | < | ||
- | # n78 first ... | + | # n78 first ... (no problem, tests success) |
- | # n89 second | + | |
# make sure / | # make sure / | ||
# scp into place from n100 (centos8, possibly caused by warewulf...) | # scp into place from n100 (centos8, possibly caused by warewulf...) | ||
+ | |||
+ | # n79 next (no problem) | ||
+ | # but during testing found this error which I had seen BEFORE cuda install | ||
+ | reason: | ||
+ | # follow up, nvidia driver " | ||
+ | # this is a warning mostly by may interfere as does docker likely | ||
+ | https:// | ||
+ | # disabled docker on n79 for now 04/15/2024 9:06AM | ||
+ | # also rotated the memory dimms some time later, seems to have fixed issue | ||
+ | # started docker back up on n79 05/06/2024 9:56AM (has been up 17 days by now) | ||
+ | |||
+ | # n89 next (no problem) | ||
+ | # but upon reboot I encountered that error for the FIRST time on this node | ||
+ | # need to research it is somewhat related to cuda install | ||
+ | # n80 (same error upon reboot after driver install) | ||
+ | # n81 (same error upon reboot after driver install) | ||
+ | # n90 (same error upon reboot after toolkit install, not driver. weird) | ||
+ | # n88 (failed toolkit install, ran / | ||
+ | # re-installed driver, reboot, re-installed tookit, reboot, | ||
+ | # error occurs same as n90) | ||
sh ./ | sh ./ | ||
Line 116: | Line 135: | ||
sh cuda_12.4.0_550.54.14_linux.run | sh cuda_12.4.0_550.54.14_linux.run | ||
+ | |||
+ | REBOOT and check date before launching slurm | ||
+ | mv / | ||
=========== | =========== | ||
Line 215: | Line 237: | ||
** CentOS 7 on n89 ** | ** CentOS 7 on n89 ** | ||
+ | |||
+ | The steps above can also be done for the default cuda installation on exx96 where the soft link ''/ | ||
Next test is to see if older software runs compatible with newer drivers. We test that by running a gpu program against new 550 driver and cuda toolkit 9.2 and see if it works (~hmeij/ | Next test is to see if older software runs compatible with newer drivers. We test that by running a gpu program against new 550 driver and cuda toolkit 9.2 and see if it works (~hmeij/ |