User Tools

Site Tools


cluster:189

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
cluster:189 [2020/01/11 21:27]
hmeij07
cluster:189 [2020/01/11 21:37]
hmeij07
Line 8: Line 8:
 ==== History ==== ==== History ====
  
-In 2006, 4 Wesleyan faculty members approached ITS with a proposal to centrally manage a high performance computing center (HPCC) seeding the effort with an NSF grant (about $190K). ITS offered 0.5 FTE for a dedicated "hpcadmin". An Advisory Group was formed by these faculty plus hpcadmin (5 members, not necessarily our current "power users"). Another NSF grant reward was added in 2010 (about $105K). An alumni donation followed in 2016 (about $10K).  In 2018 the first instance of "faculty startup monies" was contributed to the HPCC (about $92K, see "Priority Policy" below). In 2019, a TrueNAS/ZFS appliance was purchased (about $40K, [[cluster:186|Home Dir Server]]) followed in 2020 by a GPU expansion project (about $96K, [[cluster:181|2019 GPU Models]]). The latter two were self-funded expenditures, see "Funding Policy" below. To view the NSF grants visit [[cluster:169|Acknowledgement]]+In 2006, 4 Wesleyan faculty members approached ITS with a proposal to centrally manage a high performance computing center (HPCC) seeding the effort with an NSF grant (about $190K, two racks full of Dell PE1950, a total of 256 physical cpu cores on Infiniband). ITS offered 0.5 FTE for a dedicated "hpcadmin". An Advisory Group was formed by these faculty plus hpcadmin (5 members, not necessarily our current "power users"). Another NSF grant reward was added in 2010 (about $105K). An alumni donation followed in 2016 (about $10K).  In 2018 the first instance of "faculty startup monies" was contributed to the HPCC (about $92K, see "Priority Policy" below). In 2019, a TrueNAS/ZFS appliance was purchased (about $40K, [[cluster:186|Home Dir Server]]) followed in 2020 by a GPU expansion project (about $96K, [[cluster:181|2019 GPU Models]]). The latter two were self-funded expenditures, see "Funding Policy" below. To view the NSF grants visit [[cluster:169|Acknowledgement]]
  
 The Advisory Group meets with the user base yearly during the reading week of the Spring semester (early May) before everybody scatters for the summer. At this meeting, the hpcadmin reviews the past year, previews the coming year, and the user base are contributing feedback on progress and problems. The Advisory Group meets with the user base yearly during the reading week of the Spring semester (early May) before everybody scatters for the summer. At this meeting, the hpcadmin reviews the past year, previews the coming year, and the user base are contributing feedback on progress and problems.
Line 44: Line 44:
 ==== Priority Policy ==== ==== Priority Policy ====
  
-This policy was put in place about 3 years ago to deal with the issues surrounding new monies infusions from for example new faculty "startup monies", new grant monies, or donations to the HPCC. +This policy was put in place about 3 years ago to deal with the issues surrounding new monies infusions from for examplenew faculty "startup monies", new grant monies (NSF,NIH, DoD, others), or donations made to the HPCC for a specific purpose (as in GTX gpus for Amber)
  
 There are few Principles in this Priority Access Policy There are few Principles in this Priority Access Policy
  
-  - Contributions, of any kind and from any source, immediately becomes a community wide resource.+  - Contributions, of any kind and from any source, immediately become a community wide resource.
   - Priority access is granted for 3 years starting at the date of deployment (user access).   - Priority access is granted for 3 years starting at the date of deployment (user access).
   - Only applies to newly purchased resources which should be under warranty in the priority period.   - Only applies to newly purchased resources which should be under warranty in the priority period.
-  - + 
 **The main objective is to build an HPCC community resource for all users with no (permanent) special treatment of any subgroup.** **The main objective is to build an HPCC community resource for all users with no (permanent) special treatment of any subgroup.**
  
-The first principle implies that all users have access to the new resources immidiately when deployed. Root privilege is for hpcadmin only, sudo privilge may be used if/when necessary to achieve some purpose. The hpcadmin will maintain the new resource(s) while configuration(s) of new resource(s) will be done by consent of all parties involved. Final approval by the Advisory Group initiates deployment activities. +The first principle implies that all users have access to the new resource(s) immediately when deployed. Root privilege is for hpcadmin only, sudo privilege may be used if/when necessary to achieve some purpose. The hpcadmin will maintain the new resource(s) while configuration(s) of the new resource(s) will be done by consent of all parties involved. Final approval by the Advisory Group initiates deployment activities. 
  
-The second principle grants priority access to certain resource(s) for a limited time to a limited group. The same PI/users relationship will be used as is used in the CPU Usage Contribution scheme. Priority access means if during the priority period the priority members jobs go into pending mode for more than 24 hours the hpcadmin will clear compute nodes of running jobs and force those pending jobs to run.+The second principle grants priority access to certain resource(s) for a limited time to a limited group. The same PI/users relationship will be used as is used in the CPU/GPU Usage Contribution scheme. Priority access specifically means: If during the priority period the priority membersjobs go into pending mode for more than 24 hours the hpcadmin will clear compute nodes of running jobs and force those pending jobs to run. This by now is an automated process via cron that checks every 2 hours.
  
 All users should be aware this may happen so please checkpoint your jobs with a checkpoint interval of 24 hours. Please consult  [[cluster:147|BLCR Checkpoint in OL3]] (serial jobs) and [[cluster:148|BLCR Checkpoint in OL3]] (parallel jobs). All users should be aware this may happen so please checkpoint your jobs with a checkpoint interval of 24 hours. Please consult  [[cluster:147|BLCR Checkpoint in OL3]] (serial jobs) and [[cluster:148|BLCR Checkpoint in OL3]] (parallel jobs).
cluster/189.txt ยท Last modified: 2024/02/12 16:47 by hmeij07