User Tools

Site Tools


cluster:179

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
cluster:179 [2019/06/28 09:40]
hmeij07 [GPU Allocation Problems]
cluster:179 [2019/06/28 09:43]
hmeij07 [GPU Allocation Problems]
Line 16: Line 16:
 Sidebar: Running both a static and a dynamic resource is prone to cause clashes as the dynamic monitor is "​aware"​ of static resource GPU allocations but not vice versa. Sidebar: Running both a static and a dynamic resource is prone to cause clashes as the dynamic monitor is "​aware"​ of static resource GPU allocations but not vice versa.
  
-Another problem arose due to workflow issues. Imagine a job that runs for two days. After the job is dispatched and started the application starts using the allocated GPU. The application ramps the utilization of the GPU up to 99% for 4 hours. Then the GPU side of things is done and results written to disk. For the next 4 hours the job launches ​applications on the allocated CPU and processes the results. Then back to GPU, back to CPU, back to etc...until the job is done. What this means is that when the job uses the CPU the GPU utilization rate falls to 0%.  Now the "​elim"​ reports that GPU as available and launches a pending GPU job and allocates ​it. So when the first job attempts to cycle back to the allocated GPU it crashes (GPUs are in mode exclusivity on and persistence enabled).+Another problem arose due to workflow issues. Imagine a job that runs for two days. After the job is dispatched and started the application starts using the allocated GPU. The application ramps the utilization of the GPU up to 99% for 4 hours. Then the GPU side of things is done and results written to disk. For the next 4 hours the job continues launching ​applications on the allocated CPU core(s) ​and processes the results. Then back to GPU, back to CPU, back to etc...until the job is done. What this means is that when the job uses the CPU the GPU utilization rate falls to 0%.  Now the "​elim"​ reports that GPU as available and launches a new pending GPU job and allocates ​that same GPU. So when the first job attempts to cycle back to the allocated GPU it crashes (GPUs are in mode exclusivity on and persistence enabled ​on node ''​n78''​).
  
 ===== Solutions Problems ===== ===== Solutions Problems =====
cluster/179.txt ยท Last modified: 2019/06/28 09:43 by hmeij07