User Tools

Site Tools


cluster:179

This is an old revision of the document!



Back

GPU Allocation Problems

GPUs predate our Openlava software stack and need to be integrated into the scheduler as resources. This, along with other issues, has raised some scheduler allocation problems detailed on this page.

The first problem arose when we bought node n78 with the GTX1080Ti GPUs. That node runs CentOS7 with Cuda 9.2 software. When two GPU jobs are running the scheduler assumes there are no more idle GPUs (node contains 4 GPUs). This is despite the resource monitor reporting two free GPUs. Somewhere in the software stack there is a disconnect. My inclinition is that this has to do with the new “peer-to-peer” GPU technology. I've never found a setting to disable this protocol. Simple scripts execute hourly and look for pending GPU jobs while there are idle GPUs and forces a job submit. This behavior also showed up when nodes n33-n37 were converted to Cuda 9.2 and was not present under Cuda 5.0, go figure.

GPU resources can managed using static or dynamic monitors. The static monitors are simply defined in the configuration files. For example, if you have 4 GPUs per node you can create a resource called gpu4 with a default value of 4 which decrements by 1 for each GPU allocate until all GPUs are used and the value is 0. You still need to inform the application the job launches which GPU to use (see below).


Back

cluster/179.1561656626.txt.gz · Last modified: 2019/06/27 13:30 by hmeij07