User Tools

Site Tools


cluster:179

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cluster:179 [2019/06/27 17:54]
hmeij07
cluster:179 [2019/06/27 17:58]
hmeij07 [Solutions Problems]
Line 18: Line 18:
 Another problem arose due to workflow issues in certain labs. Imagine a job that runs for two days. After the job is dispatched and started the application starts using the allocated GPU. The application ramps the utilization of the GPU to 99% for 4 hours. Then the GPU side of things is done and results written to disk. For the next 4 hours the job launches applications on the CPU allocated and processes the results. Then back to GPu, back to CPU, back to etc...until the job is done. What this means is that when the job uses the CPU the GPU utilization falls to 0%.  Now the "elim" reports that GPU as available and launches a pending GPU job. So when the first jobs attempts to cycle back to GPU it crashes (GPUs are in mode exclusivity on and persistance enabled). Another problem arose due to workflow issues in certain labs. Imagine a job that runs for two days. After the job is dispatched and started the application starts using the allocated GPU. The application ramps the utilization of the GPU to 99% for 4 hours. Then the GPU side of things is done and results written to disk. For the next 4 hours the job launches applications on the CPU allocated and processes the results. Then back to GPu, back to CPU, back to etc...until the job is done. What this means is that when the job uses the CPU the GPU utilization falls to 0%.  Now the "elim" reports that GPU as available and launches a pending GPU job. So when the first jobs attempts to cycle back to GPU it crashes (GPUs are in mode exclusivity on and persistance enabled).
  
 +===== Solutions Problems =====
 +
 +Somehow, using static or dynamic resource, the physical GPU device ID allocated needs to be "remembered"
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
cluster/179.txt ยท Last modified: 2019/06/28 13:43 by hmeij07