GPUs predate our Openlava software stack and need to be integrated into the scheduler as resources. This, along with other issues, has raised some scheduler allocation problems detailed on this page.
A problem arose when we bought node n78
with the GTX1080Ti GPUs. That node runs CentOS7 with Cuda 9.2 software. When two GPU jobs are running the scheduler assumes there are no more idle GPUs (node contains 4 GPUs). This is despite the resource monitor reporting two free GPUs (see below). Somewhere in the software stack there is a disconnect. My inclination is that this has to do with the “peer-to-peer” GPU technology. I've never found a CUDA setting to disable this protocol. Simple shell scripts execute hourly and look for pending GPU jobs while there are idle GPUs and forces a job submit. Not pretty but it works. This scheduler behavior also showed up when nodes n33-n37
were converted to Cuda 9.2 (was not present under Cuda 5.0, go figure).
GPU resources can be managed using static or dynamic resource monitors.
A static monitor is simply defined in the configuration files. For example, if you have 4 GPUs per node you can create a resource called gpu4 with a default value of 4 which decrements by 1 for each GPU allocated until all GPUs are used and the value is 0. You still need to inform the application the job launches which GPU to use (see below). The downside of static GPU resource is that your number of gpus per node or queue may vary cluster wide. The other problem is that we allow users to SSH to compute nodes and debug at the command line, hence users could be using a GPU working on problems. The scheduler would not be aware of such usage.
A dynamic monitor relies on an “elim” (external information load manager) to report the idle number of GPUs for each node every 15 seconds or so. The “elim” reports a GPU as available if it's utilization rate is at or below 1%. That information, along with resources the scheduler automatically monitors (runtime load indices, memory availability, etc), can be queried with the lsload -l
utility. The advantage of a dynamic monitor is it does not matter how many GPUs there are per node cluster wide. One monitor handles all configurations.
Sidebar: Running both a static and a dynamic resource is prone to cause clashes as the dynamic monitor is “aware” of static resource GPU allocations but not vice versa.
Another problem arose due to workflow issues. Imagine a job that runs for two days. After the job is dispatched and started the application starts using the allocated GPU. The application ramps the utilization of the GPU up to 99% for 4 hours. Then the GPU side of things is done and results written to disk. For the next 4 hours the job continues launching applications on the allocated CPU core(s) and processes the results. Then back to GPU, back to CPU, back to etc…until the job is done. What this means is that when the job uses the CPU the GPU utilization rate falls to 0%. Now the “elim” reports that GPU as available and launches a new pending GPU job and allocates that same GPU. So when the first job attempts to cycle back to the allocated GPU it crashes (GPUs are in mode exclusivity on and persistence enabled on node n78
).
Somehow, using a static or dynamic resource, the physical GPU device IDs allocated need to be “remembered”. One way might for be the wrappers to remove allocated GPU device IDs from a config file when CUDA_VISIBLE_DEVICES is set. The “elim” can now report idle GPUs of non-allocated devices by consulting this file. The schedulers' post_exec
script could add that device ID back into the config file.
Another solution would be to break the cycling CPU/GPU job into multiple jobs. The first CPU cycle can not start until first GPU cycle is done etc. That can be done with the done(job_ID|“jobname”) or ended(job_ID|“jobname”) parameters of bsub
.
More thoughts?