This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:155 [2017/03/24 14:36] hmeij07 [OpenHPC] |
cluster:155 [2017/04/05 12:35] (current) hmeij07 |
||
---|---|---|---|
Line 2: | Line 2: | ||
**[[cluster: | **[[cluster: | ||
- | ==== OpenHPC ==== | + | ==== OpenHPC |
Additional tools for the OpenHPC environment. First add these two lines to SMS and all compute nodes. Patch CHROOT as well. | Additional tools for the OpenHPC environment. First add these two lines to SMS and all compute nodes. Patch CHROOT as well. | ||
Line 89: | Line 89: | ||
* http:// | * http:// | ||
+ | |||
+ | * Not installing ClusterShell, | ||
+ | * add compute hostnames to / | ||
+ | * '' | ||
+ | |||
+ | < | ||
+ | |||
+ | [root@ohpc0-test ~]# pdsh uptime | ||
+ | n31: 10:44:25 up 19: | ||
+ | n29: 10:44:25 up 19: | ||
+ | |||
+ | </ | ||
+ | |||
+ | * Skip '' | ||
+ | * Skip '' | ||
+ | * Skip '' | ||
+ | * Skip '' | ||
+ | * Redefine '' | ||
+ | * use eth0, not public address eth1 | ||
+ | * and CHROOT/ | ||
+ | * import file back into database | ||
+ | |||
+ | Ran into a slurm config problem here on compue ndoes. When issuing '' | ||
+ | |||
+ | < | ||
+ | |||
+ | # ON COMPUTE NODES, that is in CHROOT | ||
+ | |||
+ | # Removed file / | ||
+ | mv / | ||
+ | |||
+ | # Made the following link | ||
+ | [root@n31 ~]# ls -l / | ||
+ | lrwxrwxrwx 1 root root 38 Mar 30 14:05 / | ||
+ | |||
+ | # now it starts properly | ||
+ | Mar 31 12:41:05 n31.localdomain systemd[1]: Starting Slurm node daemon... | ||
+ | Mar 31 12:41:05 n31.localdomain systemd[1]: PID file / | ||
+ | Mar 31 12:41:05 n31.localdomain systemd[1]: Started Slurm node daemon. | ||
+ | |||
+ | </ | ||
+ | |||
+ | * Recreate vnfs, Reimage the whole kaboodle | ||
+ | |||
+ | Link to my previous eval of Slurm and job throughput testing: [[cluster: | ||
+ | |||
+ | Here are my current settings on slurm.conf in OpenHPC. | ||
+ | |||
+ | < | ||
+ | ClusterName=linux | ||
+ | ControlMachine=ohpc0-slurm | ||
+ | ControlAddr=192.168.1.249 | ||
+ | SlurmUser=slurm | ||
+ | SlurmctldPort=6815-6817 | ||
+ | SlurmdPort=6818 | ||
+ | AuthType=auth/ | ||
+ | StateSaveLocation=/ | ||
+ | SlurmdSpoolDir=/ | ||
+ | SwitchType=switch/ | ||
+ | MpiDefault=none | ||
+ | SlurmctldPidFile=/ | ||
+ | SlurmdPidFile=/ | ||
+ | ProctrackType=proctrack/ | ||
+ | FirstJobId=101 | ||
+ | MaxJobCount=999999 | ||
+ | SlurmctldTimeout=300 | ||
+ | SlurmdTimeout=300 | ||
+ | InactiveLimit=0 | ||
+ | MinJobAge=300 | ||
+ | KillWait=30 | ||
+ | Waittime=0 | ||
+ | SchedulerType=sched/ | ||
+ | SchedulerPort=7321 | ||
+ | SelectType=select/ | ||
+ | FastSchedule=1 | ||
+ | SlurmctldDebug=3 | ||
+ | SlurmdDebug=3 | ||
+ | JobCompType=jobcomp/ | ||
+ | PropagateResourceLimitsExcept=MEMLOCK | ||
+ | SlurmdLogFile=/ | ||
+ | SlurmctldLogFile=/ | ||
+ | Epilog=/ | ||
+ | ReturnToService=1 | ||
+ | NodeName=ohpc0-slurm NodeAddr=192.168.1.249 | ||
+ | NodeName=n29 NodeAddr=192.168.102.38 | ||
+ | NodeName=n31 NodeAddr=192.168.102.40 | ||
+ | PartitionName=test Nodes=n29, | ||
+ | |||
+ | </ | ||
+ | |||
+ | Define CPUs, Cores, ThreadsPerCore, | ||
+ | |||
+ | [[cluster: | ||
\\ | \\ | ||
**[[cluster: | **[[cluster: |