User Tools

Site Tools


cluster:179

This is an old revision of the document!



Back

GPU Allocation Problems

GPUs predate our Openlava software stack and need to be integrated into the scheduler as resources. This, along with other issues, has raised some scheduler allocation problems detailed on this page.

A problem arose when we bought node n78 with the GTX1080Ti GPUs. That node runs CentOS7 with Cuda 9.2 software. When two GPU jobs are running the scheduler assumes there are no more idle GPUs (node contains 4 GPUs). This is despite the resource monitor reporting two free GPUs. Somewhere in the software stack there is a disconnect. My inclinition is that this has to do with the new “peer-to-peer” GPU technology. I've never found a setting to disable this protocol. Simple scripts execute hourly and look for pending GPU jobs while there are idle GPUs and forces a job submit. This behavior also showed up when nodes n33-n37 were converted to Cuda 9.2 and was not present under Cuda 5.0, go figure.

GPU resources can be managed using static or dynamic resource monitors.

A static monitor is simply defined in the configuration files. For example, if you have 4 GPUs per node you can create a resource called gpu4 with a default value of 4 which decrements by 1 for each GPU allocate until all GPUs are used and the value is 0. You still need to inform the application the job launches which GPU to use (see below). The downside of static GPU resources is that your number of gpus per node or queue may vary. The other problem is that we allow users to SSH to compute nodes and debug at the command line, hence users could be using a GPU working on problems. The scheduler would not be aware such usage.

A dynamic monitor relies on an “elim” (external information load manager) to report the idle number of GPUs per node every 15 seconds or so. The “elim” reports a GPU as available if it's utilization rate is at or below 1%. That information, along with resources the scheduler automatically monitors (runtime load indices, , memory availability, etc), can be queried with the lsload -l utility. The advantage of a dynamic monitor is it does not matter how many GPUs there are per node cluster wide. One resource handles all configurations.

Sidebar: Running both a static and a dynamic resource is prone to cause clashes as the dynamic is “aware” of static GPU allocation but not vice versa.

Another problem arose due to workflow issues in certain labs. Imagine a job that runs for two days. After the job is dispatched and started the application starts using the allocated GPU. The application ramps the utilization of the GPU to 99% for 4 hours. Then the GPU side of things is done and results written to disk. For the next 4 hours the job launches applications on the CPU allocated and processes the results. Then back to GPu, back to CPU, back to etc…until the job is done. What this means is that when the job uses the CPU the GPU utilization falls to 0%. Now the “elim” reports that GPU as available and launches a pending GPU job. So when the first jobs attempts to cycle back to GPU it crashes (GPUs are in mode exclusivity on and persistance enabled).

Solutions Problems

Somehow, using static or dynamic resource, the physical GPU device ID allocated needs to be “remembered”. One way might for the wrappers remove the allocated GPU device ID from a config file when CUDA_VISIBLE_DEVICES is set. The “elim” can now report idle GPUs of non-allocated devices. The schedulers' post_exec script could add that device ID back into the config file.

Another solution would be to break the cycling CPU/GPU job into multiple jobs. The first CPU cycle can not start until first GPU cyle is done etc. That can be done with the done(job_ID|“jobname”) or ended(job_ID|“jobname”) parameters of bsub.


Back

cluster/179.1561659089.txt.gz · Last modified: 2019/06/27 14:11 by hmeij07