User Tools

Site Tools


cluster:167


Back

CPU vs GPU

So the question was raised what does our usage look like between CPU and GPU devices? I have no idea what the appropriate metrics would be but lets start with comparing the hardware deployed. We'll also need to make some assumptions

  • Data is period June 1 to June 25, 2018 (job information data ages out)
    • Maybe build monthly script if this turns out to be usable info
  • That period covers 600 hours of time
  • Assume 99% utilization of cpu core or gpu device
  • Available Hours is measured per physical CPU core but by GPU device (exclusivity and persistence modes on)
  • There is no good/bad metric
  • Never collated such data before
  • The GPU jobs are detected based on GPU resource reservations (gpu= flag)
Metric CPU Ratio GPU Notes June 2018
Device Count 72 3:1 24 cpu all intel, gpu all nvidia
Core Count 1,192 1:54 64,300 physical only
Memory 7,408 51:1 144 GB
Teraflops 38 1.5:1 25 double precision, floating point, theoretical
Job Count 2,834 3:1 1,045 processed jobs irregardless of exit status
Avail Hours 715,200 50:1 14,400 total for cpu cores, total for gpus
Job Hours 221,136 77:1 2,872 cumulative hours of consumed usage
Job Hours % 31 6:1 5 as a percentage of available
Avail Hours2 561,600 39:1 14,400 total cpu cores minus hp12's 256 cores, total gpus
Job Hours2 % 39 8:1 5 more realistic…hp12 rarely used

The logs showing gpu %util confirm the extremely low GPU usage. When concatenating the four gpu %util values into a string, since 01Jan2017, the string '0000' has occurred 10 million times out of 16 million observations. (GPUs are polled every 10 mins). The surprising strong GPU job count is due to the Amber group launching lots of small GPU jobs.

So were these 25 days in June 2018 an oddity?

Total Monthly CPU+GPU Hours
Ju17Aug17Sep17Oct17Nov17Dec17Jan18Feb18Mar18Apr18May18
313,303273,051128,390111,224280,10151,727306,453222,585437,959262,227294,724

March is Honors' Theses time so lets look at Jul17 (no GTX gpus) so we can compare that to Jul18 in august. 31 days in July is 744 hours.

Metric CPU Ratio GPU Notes July 2017
Device Count 72 4:1 20 cpu all intel, gpu all nvidia
Core Count 1,192 1:42 50,000 physical only
Memory 7,408 74:1 100 GB
Teraflops 38 1.7:1 23 double precision, floating point, theoretical
Job Count 12,798 18:1 722 processed jobs irregardless of exit status
Avail Hours 886,848 60:1 14,880 total cpu cores, total gpus
Job Hours 260,997 69:1 3,805 cumulative hours of consumed usage
Job Hours % 30 1:1 26 as a percentage of available
Avail Hours2 696,384 47:1 14,880 total for cpu cores minus hp12's 256 cores, total for gpus
Job Hours2 % 37 1.5:1 26 more realistic…hp12 rarely used
  • Some noise in this data with the inability to match start and end of job (~15% of records)
  • The assumption that hp12 was barely used in July 2017 might not be correct

Based on Jul17 we process about 60-70 times more CPU Job Hours than GPU Job Hours, that seems consistent with Jun18. The metric of Job Hours consumed versus Available Hours in %, the picture is probably more like Jul17…30-40% of CPU cycles are consumed and 25% of GPU cycles.

If we take total hours consumed from Usage Report (the 313,303 hours for Jul17) we consumed about 45% of available hours (without hp12 in the mix).

We shall wait for Jul18 metrics.

July 2018

Metric CPU Ratio GPU Notes July 2017
Device Count 74 3:1 24 cpu all intel, gpu all nvidia
Core Count 1,208 1:53 64,336 physical only
Memory 7,516 52:1 144 GB
Teraflops 38 1.5:1 25 double precision, floating point, theoretical
Job Count 12,798 18:1 722 processed jobs irregardless of exit status
Avail Hours 898,752 50:1 17,856 total cpu cores, total gpus
Job Hours 322,207 1732:1 186 cumulative hours of consumed usage
Job Hours % 36 36:1 1 as a percentage of available

Based on the utilization string '0000', meaning all gpus are idle on a single node as polled every 10 mins during July 2018, the GTX gpus were 65% completely idle and the K20s were 61% completely idle.

Back

cluster/167.txt · Last modified: 2018/08/01 14:11 by hmeij07