This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
cluster:179 [2019/06/27 17:30] hmeij07 created |
cluster:179 [2019/06/28 13:43] (current) hmeij07 [GPU Allocation Problems] |
||
---|---|---|---|
Line 6: | Line 6: | ||
GPUs predate our Openlava software stack and need to be integrated into the scheduler as resources. This, along with other issues, has raised some scheduler allocation problems detailed on this page. | GPUs predate our Openlava software stack and need to be integrated into the scheduler as resources. This, along with other issues, has raised some scheduler allocation problems detailed on this page. | ||
- | The first problem arose when we bought node '' | + | A problem arose when we bought node '' |
- | GPU resources can managed using static or dynamic monitors. The static monitors are simply defined in the configuration files. For example, if you have 4 GPUs per node you can create a resource called **gpu4** with a default value of 4 which decrements by 1 for each GPU allocate until all GPUs are used and the value is 0. You still need to inform the application the job launches which GPU to use (see below). | + | GPU resources can be managed using static or dynamic |
+ | A static monitor is simply defined in the configuration files. For example, if you have 4 GPUs per node you can create a resource called **gpu4** with a default value of 4 which decrements by 1 for each GPU allocated until all GPUs are used and the value is 0. You still need to inform the application the job launches which GPU to use (see below). The downside of static GPU resource is that your number of gpus per node or queue may vary cluster wide. The other problem is that we allow users to SSH to compute nodes and debug at the command line, hence users could be using a GPU working on problems. The scheduler would not be aware of such usage. | ||
+ | |||
+ | A dynamic monitor relies on an " | ||
+ | |||
+ | Sidebar: Running both a static and a dynamic resource is prone to cause clashes as the dynamic monitor is " | ||
+ | |||
+ | Another problem arose due to workflow issues. Imagine a job that runs for two days. After the job is dispatched and started the application starts using the allocated GPU. The application ramps the utilization of the GPU up to 99% for 4 hours. Then the GPU side of things is done and results written to disk. For the next 4 hours the job continues launching applications on the allocated CPU core(s) and processes the results. Then back to GPU, back to CPU, back to etc...until the job is done. What this means is that when the job uses the CPU the GPU utilization rate falls to 0%. Now the " | ||
+ | |||
+ | ===== Solutions Problems ===== | ||
+ | |||
+ | Somehow, using a static or dynamic resource, the physical GPU device IDs allocated need to be " | ||
+ | |||
+ | Another solution would be to break the cycling CPU/GPU job into multiple jobs. The first CPU cycle can not start until first GPU cycle is done etc. That can be done with the // | ||
+ | |||
+ | More thoughts? | ||
\\ | \\ | ||
**[[cluster: | **[[cluster: | ||