User Tools

Site Tools


cluster:189

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:189 [2020/01/11 13:12]
hmeij07 [Priority Access]
cluster:189 [2024/02/12 11:47] (current)
hmeij07 [Priority Policy]
Line 3: Line 3:
  
 ===== Structure and History of HPCC ===== ===== Structure and History of HPCC =====
 +
 +As promised at the CLAC HPC Mindshare event at Swarthmore College Jan 2020. Here is the Funding and Priority Policies with some context around it. Questions/Comments welcome.
  
 ==== History ==== ==== History ====
  
-In 2006, 4 Wesleyan qfaculty members approached ITS with a proposal to centrally manage a whigh performance computing center (HPCC) seeding the effort with an NSF grant (about $190K). ITS offered 0.5 FTE for a dedicated "hpcadmin". An Advisory Group was formed by these faculty plus hpcadmin (5 members, not necessarily our current "power users"). Another NSF grant reward was added in 2010 (about $105K). An alumni donation followed in 2016 (about $10K).  In 2018 the first instance of "faculty startup monies" was contribute to the HPCC (about $92K, see "Priority Policy" below. In 2019, a TrueNAS/ZFS appliance was purchased (about $40K) followed in 2020 by a GPU expansion project (about $96K). The latter two were self-funded expenditures, see "Funding Policy" below. To view the NSF grants visit [[cluster:169|Acknowledgement]]+In 2006, 4 Wesleyan faculty members approached ITS with a proposal to centrally manage a high performance computing center (HPCC) seeding the effort with an NSF grant (about $190K, two racks full of Dell PE1950, a total of 256 physical cpu cores on Infiniband). ITS offered 0.5 FTE for a dedicated "hpcadmin". An Advisory Group was formed by these faculty plus hpcadmin (5 members, not necessarily our current "power users"). Another NSF grant reward was added in 2010 (about $105K). An alumni donation followed in 2016 (about $10K).  In 2018 the first instance of "faculty startup monies" was contributed to the HPCC (about $97.4K, see "Priority Policy" below). In 2019, a TrueNAS/ZFS appliance was purchased (about $40K, [[cluster:186|Home Dir Server]]) followed in 2020 by a GPU expansion project (about $96K, [[cluster:181|2019 GPU Models]]). The latter two were self-funded expenditures, see "Funding Policy" below. To view the NSF grants visit [[cluster:169|Acknowledgement]]
  
-The Advisory Group meets with the user base yearly in reading week of the Spring semester (early May) before everybody scatters for the summer. At this meeting the hpcadmin reviews the past year, previews the coming year and users are contributing feedback on progress and problems.+The Advisory Group meets with the user base yearly during the reading week of the Spring semester (early May) before everybody scatters for the summer. At this meetingthe hpcadmin reviews the past year, previews the coming yearand the user base are contributing feedback on progress and problems.
  
 ==== Structure ==== ==== Structure ====
  
-The Wesleyan HPCC is part of the **Scientific Computing and Informatics Center** ([[https://www.wesleyan.edu/scic/| SCIC ]]).  The SCIC project leader is appointed by the Director of the **Quantitative Analysis Center** [[https://www.wesleyan.edu/qac/| QAC ]]. The Director of the QAC reports to the Associate Provost. The hpcadmin has a direct report with ITS Deputy Director and an indirect report with QAC Director.+The Wesleyan HPCC is part of the **Scientific Computing and Informatics Center** ([[https://www.wesleyan.edu/scic/| SCIC ]]).  The SCIC project leader is appointed by the Director of the **Quantitative Analysis Center** [[https://www.wesleyan.edu/qac/| QAC ]]. The Director of the QAC reports directly to the Associate Provost. The hpcadmin has a direct report with the ITS Deputy Director and an indirect report with the QAC Director.
  
-The QAC has an [[https://www.wesleyan.edu/qac/apprenticeship/index.html|Apprenticeship]] Program in which students are trained in Linux and several program languagues of their choice and other options. From this pool of students some become the QAC and SCIC helpdesk and tutors.+The QAC has an [[https://www.wesleyan.edu/qac/apprenticeship/index.html|Apprenticeship]] Program in which students are trained in Linux and several programming languages of their choice and other options (like SQL or GIS). From this pool of students the hope is some become the QAC and SCIC help desk and tutors.
  
 ==== Funding Policy ==== ==== Funding Policy ====
  
-After an 8 year run of the HPCC, and a drying up of grant opportunities at NSF, it was decided to explore self-funding so the HPCC effort could continue without external dependence on funds. A report was made of the HPCC progress including topics such as Publications, Citations, Honors Theses, Growth in Jobs Submitted, Pattern of Pending Jobs, and General Inventory. The report summary can be viewed at this page [[cluster:130| Provost Report ]]. This report was discussed between Provost and HPC Advisory Group.+After an 8 year run of the HPCC, and a drying up of grant opportunities at NSF, it was decided to explore self-funding so the HPCC effort could continue without external dependencies on funds. A report was compiled of the HPCC progress including topics such as Publications, Citations, Honors Theses, Growth in Jobs Submitted, Pattern of Pending Jobs, and General Inventory. The report summary can be viewed at this page [[cluster:130| Provost Report ]]. This report was discussed at a meeting between Provost, Associate Provost, Director of Finances and HPCC Advisory Group.
  
-Several months later a pattern emerged.  The Provost would annually contribute $25K if the HPC user base raised $15K annually.  That would amount to $160K in 4 years enough for a hardware refresh or new hardware acquisition.  Finances also contributed $10K for maintenance such as failed disks, network switches, etcbut these funds do not "roll over". Use it or loose it. All funds start July 1st.+Several months later a pattern emerged.  The Provost would annually contribute $25K **//if//** the HPCC user base raised $15K annually in contributions.  These funds would "roll over"That would amount to $160K in 4 yearsenough for a hardware refresh or new hardware acquisition.  Finances also contributed $10K annually for maintenance such as failed disks, network switches, etc ... but these funds do not "roll over". Use it or loose it. All fund cycles restart July 1st.
  
-In order for the HPC user base to raise $15K annually, CPU and GPU hourly usage was deployed. A dictionary is maintained listing PIs and their members (students majors, lab students, grads, phd candidates, collaborators, etc).  Each PI then quarterly contributes to the user fund based on  a scheme yieldingq $15K annually.+In order for the HPCC user base to raise $15K annually, CPU and GPU hourly usage monitoring was deployed (using scripts parsing the ''lsb.acct'' file). A dictionary is maintained listing PIs with their associated members (student majors, lab students, grads, phd candidates, collaborators, etc).  Each PI then quarterly contributes to the user fund based on a scheme supposedly yielding $15K annually.
  
-Here is 2019'queue usage [[cluster:188|2019 Queue Usage]] and 2019 contribution scheme.+Here is queue usage for 2019 [[cluster:188|2019 Queue Usage]] and below is listed  the 2019 contribution scheme.
  
 Contribution Scheme for 01 July 2019 onwards\\ Contribution Scheme for 01 July 2019 onwards\\
Line 37: Line 39:
 A gpu hour of usage is 3x the cpu hourly rate.\\ A gpu hour of usage is 3x the cpu hourly rate.\\
  
-We currently have about 1,450 physical cpu cores, 60 gpus, 520 gb of gpu memory and 8,560 gb cpu memory provided by about 120 compute nodes and login nodes. Scratch spaces are provide local to compute nodes (2-5 tb) or over the network via NFS (55 tb). Home directories are under quota (10 tb) but these will disappear in the future with the TrueNAS/ZFS appliance (190 tb, 475 tb effective assuming a compression rate of 2.5x). guide can be found here [[cluster:126|Brief Guide to HPCC]] and the software is located here [[cluster:73|Software Page]]+We currently have about 1,450 physical cpu cores (all Xeon)72 gpus (20x K20, 4x GTX2018Ti, 48x RTX2080S), 520 gb of gpu memory and 8,560 gb of cpu memory. Provided by about 120 compute nodes and login nodes. Scratch spaces are provided local to compute nodes (2-5 tb) or over the network via NFS (55 tb), consult [[cluster:142|Scratch Spaces]]. Home directories are under quota (10 tb) but these will disappear in the future with the TrueNAS/ZFS appliance (190 tb, 475 tb effective assuming a compression rate of 2.5x, consult [[cluster:186|Home Dir Server]] deploy in 2020). A HPCC guide can be found here [[cluster:126|Brief Guide to HPCC]] and the (endless!) software list is located here [[cluster:73|Software Page]].  We run CentOS 6.10 or 7.[6|7] flavors of OS.
  
  
 ==== Priority Policy ==== ==== Priority Policy ====
  
-This policy was put in place about 3 years ago to deal with the issues surrounding new monies infusions from for example new faculty "startup monies", new grant monies, or donations to the HPCC. +This policy was put in place about 3 years ago (2017) to deal with the issues surrounding new monies infusions from for examplenew faculty "startup monies", new grant monies (NSF, NIH, DoD, others), or donations made to the HPCC for a specific purpose (as in GTX gpus for Amber). All users have the same priority. All queues have the same priority (except the "test" queue which has the highest priority). Scheduler policy is FIFO. There is no "wall time" on any queue.
  
 There are few Principles in this Priority Access Policy There are few Principles in this Priority Access Policy
  
-  - Contributions, of any kind and from any source, immediately becomes a community wide resource.+  - Contributions, of any kind and from any source, immediately become a community wide resource.
   - Priority access is granted for 3 years starting at the date of deployment (user access).   - Priority access is granted for 3 years starting at the date of deployment (user access).
   - Only applies to newly purchased resources which should be under warranty in the priority period.   - Only applies to newly purchased resources which should be under warranty in the priority period.
-  - + 
 **The main objective is to build an HPCC community resource for all users with no (permanent) special treatment of any subgroup.** **The main objective is to build an HPCC community resource for all users with no (permanent) special treatment of any subgroup.**
  
-The first principle implies that all users have access to the new resources immidiately when deployed. Root privilege is for hpcadmin only, sudo privilge may be used if/when necessary to achieve some purpose. The hpcadmin will maintain the new resource(s) while configuration(s) of new resource(s) will be done by consent of all parties involved. Final approval by the Advisory Group initiates deployment activities. +The first principle implies that all users have access to the new resource(s) immediately when deployed. Root privilege is for hpcadmin only, sudo privilege may be used if/when necessary to achieve some purpose. The hpcadmin will maintain the new resource(s) while configuration(s) of the new resource(s) will be done by consent of all parties involved. Final approval by the Advisory Group initiates deployment activities.  
 + 
 +The second principle grants priority access to certain resource(s) for a limited time to a limited group. The same PI/users relationship will be used as is used in the CPU/GPU Usage Contribution scheme. Priority access specifically means: If during the priority period the priority members' jobs go into pending mode for more than 24 hours the hpcadmin will clear compute nodes of running jobs and force those pending jobs to run. This by now is an automated process via cron that checks every 2 hours. Steps involved are; find priority members' jobs pending for more than 24 hours, find a node with no priority members jobs running in that queue, close target node, requeue all jobs on that node, force pending job(s) to run, wait 5 mins, reopen node. 
 + 
 +All users should be aware this may happen so please checkpoint your jobs with a checkpoint interval of 24 hours. Please consult  [[cluster:190|DMTCP]].
  
-The second principle grants priority access to certain resource(s) for a limited time to a limited group. The same PI/users relationship will be used as is used in the CPU Usage Contribution scheme. Priority access means if during the priority period the priority members jobs go into pending mode for more than 24 hours the hpcadmin will clear compute nodes of running jobs and force those pending jobs to run.+==== General ====
  
-All users should be aware this may happen so please checkpoint your jobs with checkpoint interval of 24 hoursPlease consult  [[cluster:147|BLCR Checkpoint in OL3]] (serial jobs) and [[cluster:148|BLCR Checkpoint in OL3]] (parallel jobs).+There are 557 lines in ''/etc/passwd'' at this writing. Assume 25 are system accounts, 25 collaboration accounts (hpc01-hpc25, can VPN, for non Wesleyan faculty/PI, AD accounts), 100 temporary/recyclable class accounts (hpc100-hpc200, can not VPN, local accounts) which then leaves lifetime user base of roughly 400 user accounts.  Of those, which come and go, 2 to 2 dozen users may be logged in at any time.
  
 +Rstore is a platform for storing research static data. The hope is to move static data off the HPCC and mount it read-only back onto the HPCC login nodes.  440 tb, fully replicated, is provided for this purpose (Supermicro storage boxes using Rsync as replication engine). For HPCC users and other Wesleyan groups.
  
 +The Data Center has recently been renovated so the HPCC has no more cooling problems (It used to be in the event of a cooling tower failure, within 3 hours the HPCC would push temps above 85F). No more. We have sufficient rack space (5) and power for expansion. For details on that "live renovation" process visit [[cluster:178|Data Center Renovation]]. It turned out the HPCC was consuming 1/3rd of all electric and cooling capacities. Go HPCC.
  
  
cluster/189.1578766325.txt.gz · Last modified: 2020/01/11 13:12 by hmeij07