DokuWiki - cluster

0

Anonymous (anonymous@undisclosed.example.com) — 2026-02-07T20:55:56+00:00

High Performance Compute Cluster At Wesleyan University, Middletown, CT * HPCC Funding Acknowledgements * the Brief User Guide & Introduction, continuously updated ... * the Structure, History, Funding & Priority Policies * the endless Software list ... (CentOS 5&6&7) * the OpenHPC Software list ... (Rocky 8) this will be a chronological archive of our progress

1

Anonymous (anonymous@undisclosed.example.com) — 2006-12-13T19:22:42+00:00

Home Some details about the proposed cluster * The cluster will be initially funded by an NSF grant. * The cluster is expected to generate large savings for Wesleyan over its lifetime. The University Cluster will reduce or eliminate the need for new individual faculty clusters.

2

Anonymous (anonymous@undisclosed.example.com) — 2007-05-15T17:12:27+00:00

Home Cluster Steering Committee (12/1/2006) * Eric Aaron * David Beveridge * Tsampikos Kottos * George Petersson * Francis Starr Home

3

Anonymous (anonymous@undisclosed.example.com) — 2017-02-14T19:07:18+00:00

Home Usage Survey (circa early Nov 2006) Brief synopsis of emerging themes * some commercial software will have to be bought outside of the grant: Matlab, Linda and Portland compilers * most current code is “coarse grain” parallel (meaning split a big problem into tiny pieces) rather than

4

Anonymous (anonymous@undisclosed.example.com) — 2006-12-07T19:10:25+00:00

Home Cluster Configuration In general, the cluster configuration should remain as “standard” as possible; this is not be a bleeding edge venture. Some general information gleaned from our anticipated user base: * Software: mostly home grown code, C/C++ and Fortran programs (probably a Portland suite of compilers), Matlab (and MatlabMPI), AMBER, NAMD, CHARMM, and Gaussian.

5

Anonymous (anonymous@undisclosed.example.com) — 2006-12-07T19:08:21+00:00

Home Description of Quotes (Nov-Dec 2006) A final round of quoting is underway now (about 8 months later) to settle on a vendor and configuration. Because of the time that's passed, all the chipsets have changed, so it's essentially a new round, rather than some type of clarification. Dell has offered a quote but it came in too high, and they agreed to offer a revision. Sun's quote is due today (12/1/06) and may come very close to the target price (roughly $190,000). We have received a quote…

6

Anonymous (anonymous@undisclosed.example.com) — 2006-12-07T19:11:13+00:00

Home Academic UNIX Support Specialist Job Description JOB TITLE: Academic UNIX Support Specialist ---------- DEPARTMENT: Information Technology Services ---------- GRADE: ? ---------- RANGE: $? ---------- TYPICAL DUTIES: Reporting to the Assistant Director of Technology Support Services with management support from the Manager of Academic Computing for Natural Sciences and Mathematics, the Academic UNIX Support Specialist will provide systems support to the user community.

7

Anonymous (anonymous@undisclosed.example.com) — 2019-12-13T13:43:22+00:00

Home TITLE OF PROPOSED PROJECT “Acquisition of Cluster Computing Facilities for Research and Education at Wesleyan University” NATIONAL SCIENCE FOUNDATION PROPOSAL AUTHORS * Francis W Starr * David L Beveridge * Kathryn V Johnston PROJECT SUMMARY

8

Anonymous (anonymous@undisclosed.example.com) — 2006-12-19T19:59:53+00:00

Home Description of Quotes (Dec 06) vendor #1 Quote 1#140 node HPCC $188,700Nodes a total of 40 nodes, 80 cpus, 160 cores, 48 port switch 36 Xeon, 2*(Dual Core 5148LV 2.3 Ghz), 2 Gb@667Mhzdual 80 Gb 7.2K RPM Satalight, 144 cores 04 Xeon, 2*(Dual Core 5160, 3.0 Ghz), 16 Gb@667Mhz

9

Anonymous (anonymous@undisclosed.example.com) — 2006-12-07T19:12:35+00:00

Home Integer & Floating Benchmarks pulled from , the SPEC2000 Results Page ... Integer and Floating Point processing benchmarks. * “speed” is relatively measured with the “results” column * “throughput” is relatively measured with the

10

Anonymous (anonymous@undisclosed.example.com) — 2006-12-07T21:39:41+00:00

Home Target config lightweight nodes (40 3.0 ghz or 60 2.0 ghz) (160 or 240 cores) * 2gb RAM * single HD * two dual-core processors (high efficiency for opteron) 4 heavyweight nodes (32 cores) * 2 with 16gb RAM 2 with 32gb RAM * single HD

11

Anonymous (anonymous@undisclosed.example.com) — 2007-02-20T14:50:56+00:00

Home What is ROCKS or Platform/ROCKS? (no idea what the acronym, if any, stands for) ROCKSPlatform Rocks:-o ROCKS is an open-source software stack that enables the consistent delivery of scale-out application clusters :-O Platform Open Cluster Stack (OCS) is a pre-integrated, vendor certified, software stack that enables the consistent delivery of scale-out application clusters using ROCKS

12

Anonymous (anonymous@undisclosed.example.com) — 2007-10-18T15:18:04+00:00

Home HPC: Will it fly? The attendance at the UUG meetings so far has been very disappointing. It's apparent that the V charisma was a horizontal penetration of the market place so to speak with little vertical leverage. The sun has set beyond the horizon, alumnification has set in. Now what? We need departmental admins to show up and others in charge of maintaining unix machines on campus. We have a solid group of ITS folks showing up so this could be a real resource for those attending.…

13

Anonymous (anonymous@undisclosed.example.com) — 2006-12-19T21:03:14+00:00

Home Down to 2 Vendors (Dec 06) vendor #1 Quote 1#164 node HPCC $196,700Nodes a total of 64 nodes, 128 cpus, 272 cores 60 Xeon, 2*(Dual Core 5130 2.0 Ghz), 2 Gb@667Mhzdual 80 Gb 7.2K RPM Satalight, 240 cores 04 Xeon, 2*(Quad Core 5355, 2.6 Ghz), 16 Gb@667Mhz

14

Anonymous (anonymous@undisclosed.example.com) — 2006-12-23T18:57:51+00:00

Home Final Configuration & Quote (yea!) At Last ... in time for Christmas Lights? * via Platform/Rocks technical support: PBS/LSF currently can manage jobs at the core unit for dual core processors. It is expected that in Q1 of 2007, PBS/LSF will support scheduling jobs at the core unit for quad core processors.

15

Anonymous (anonymous@undisclosed.example.com) — 2006-12-20T14:19:04+00:00

Home Scali/Manage * Like Platform/ROCKS (see link), Scali/Manage is a software suite of tools to manage clusters. It appears very, very versatile. Lots of stuff you can do but what attracted my interests in my brief perusals were: * heterogeous clusters (as in, manage the other clsuters on campus

16

Anonymous (anonymous@undisclosed.example.com) — 2007-01-02T21:23:17+00:00

Home Useful Documentation Platform Rocks & Dell Platform Rocks: A Cluster Software Package for Dell HPC PlatformsAdministrators can employ cluster solution packages such as Platform Rocks to help deploy, maintain, and manage high-performance computing (HPC) clusters Workload Management and Job Scheduling on Platform Rocks ClustersPlatform Lava, a free and fully functional entry-level workload manager for Platform Rocks, is becoming popular in the high-performance computing community

17

Anonymous (anonymous@undisclosed.example.com) — 2007-01-10T14:43:22+00:00

Home Webcast Demo of Platform/OCS by Platform Computing William DeSalvo, from Platform Computing, did a webcast presentation about Platform/OCS ... the administrative software layer of our cluster design. Several documents were obtained detailing administrative aspects of the Platform/OCS software stack (see below).

18

Anonymous (anonymous@undisclosed.example.com) — 2007-04-10T13:51:46+00:00

Home Thoughts on Cluster Network / Future Growth As i'm working my way through some of the Platform/OCS documentation provided, some thoughts came up that i want to keep track of. This page is not intended to detail how the cluster's final configuration will look like, but could act as a guideline. So first, the big picture. Drawn in black is the cluster as ordered, drawn in green the connections if additional switches were bought, which leads to the red drawings, the additional nodes that c…

19

Anonymous (anonymous@undisclosed.example.com) — 2007-04-10T13:54:16+00:00

Home HPCC 36 Node Design Conference with Dell Questions/Issues After the conference, with the erratic behavior of our freight elevator enlightening everybody, i think we have but a few questions to work on: Question/IssuesAnswerShall we use the 2nd disk in the compute nodes as /localscratch?

20

Anonymous (anonymous@undisclosed.example.com) — 2007-01-31T14:31:23+00:00

Home Design Issues These topics flowed out of our Design Conference with Dell, read about that on this page. 2nd disks As Detailed In Cluster QuoteHard Drive:80GB, SATA, 3.5-inch 7.2K RPM Hard DriveAdditional Storage Products:80GB, SATA, 3.5-inch 7.2K RPM Hard Drive * Shall we use the 2nd disk in the compute nodes as /localscratch?

21

Anonymous (anonymous@undisclosed.example.com) — 2007-06-28T19:08:03+00:00

Home SAN File Systems The idea of managing very large file systems has certain implications. For example * a single point of failure (if the part of the file systems goes corrupt, does the entire file systems go off line?) * fsck may take an excessive amount of time (one reference i found was 1 hour/TB for a clean file system, other references we've seen is days for 1 TB in reference to a mail spool)

22

Anonymous (anonymous@undisclosed.example.com) — 2007-02-02T16:35:32+00:00

Home Cluster Arrival Day ... 02/01/2007 Today5th floor. freight elevator out. project $ave or something. where am i?5 floors of stairs, that was hard. where are these characters?it's empty?oh,oh. it is! somebody stole our cluster?1,500 lbs, 288 cores. this is going to hurt?

23

Anonymous (anonymous@undisclosed.example.com) — 2017-10-03T13:34:32+00:00

Home Towards Deployment * I think i will close this page. The one major outstanding issue standing in the way of declaring ourselves “in production” mode is a serious backup policy. I'm snapshotting via the NetApp filers but need Tivoli backups of home directories. So lets close this page. Once you see the

24

Anonymous (anonymous@undisclosed.example.com) — 2007-02-26T12:57:17+00:00

Home Daylight Savings Time '07 IF YOU NEED ASSISTANCE PERFORMING THE SUGGESTED ACTIONS FOR LINUX, SOLARIS AND JAVA --- PLEASE EMAIL ACSUNIX@WESLEYAN.EDU --- THE STEPS OUTLINED BELOW ARE ... AT YOUR OWN RISK ... Linux How to do this will vary from distro to distro, and should be handled by your update mechanism (yum, up2date, aptitude, etc) and distro for you rather easily, but basically you need to update your zoneinfo with the new info (typically /usr/share/zoneinfo) and then make sure that /…

25

Anonymous (anonymous@undisclosed.example.com) — 2007-04-03T14:35:44+00:00

Back Zebra Swallowtail from Enchanted Learning[zebra swallowtail] Monday Dell engineer Amol Choukekar arrives to do the final configuration of the cluster. First we set up two consoles; one for walking by the compute nodes and one permanently connected to head node. Following that Amol embarks on undoing my handy work with the ethernet cables connected to the Dell switch. I had run the cables down the other side of the rack but this is also were the power cables are concentrated. To avoid …

26

Anonymous (anonymous@undisclosed.example.com) — 2007-04-19T19:28:49+00:00

Back The production copy of OpenMPI is in /share/apps/openmpi-1.2. --- Henk Meij 2007/04/19 15:27 HPLinpack Runs The purpose here is to rerun the HPLinpack benchmarks Amol ran while configuring the cluster. Before[Idle!]During[Heat!]Ooops[Burn!] FAQ External Link Problem Sizes N calculation, for example:

27

Anonymous (anonymous@undisclosed.example.com) — 2007-12-20T21:48:35+00:00

Back

Production!

We officially went into production when the backup policy was put in place. Read about that here: Link

Early - Bird Butterfly Access Period

Until we officially deploy into production with all software installed and a formally stated backup policy of home directories; access is provided to all with the following caveat:

28

Anonymous (anonymous@undisclosed.example.com) — 2014-02-21T16:23:30+00:00

Home Pretty old stuff but may be useful, up to date info is here Brief Guide to HPCC User Guide & Manuals * Account and Access * Login and Debug * Filesystems * the Queue Update page 03/01/2013 * old Queues * Job Submissions for serial jobs * Job Submissions for parallel jobs using Infiniband. * Software installed: Petaltail OCS 5.1 (since June 2009)

29

Anonymous (anonymous@undisclosed.example.com) — 2009-09-08T19:15:21+00:00

Back The information displayed here will undoubtedly change very quickly. So your mileage and output may be different. => Platform/OCS's very good [Running Jobs with Platform Lava] (read it). => In all the examples below, man command will provide you with detailed information, like for example

30

Anonymous (anonymous@undisclosed.example.com) — 2007-08-31T14:24:32+00:00

Back => Platform/OCS's very good [Running Jobs with Platform Lava] (read it). => In all the examples below, man command will provide you with detailed information, like for example man bsub. Jobs Non-Infiniband! For Infiniband submissions go to Internal Link This write up will only focus on how to submit jobs using scripts, meaning in batch mode. There is an interactive mode but in general if you create a script then you have a record of how you submitted your job.

31

Anonymous (anonymous@undisclosed.example.com) — 2007-04-19T19:45:47+00:00

Back OpenMPI ENV Tests To test your environment execute the following two binaries and compare the output. It should all be set up for you already. If not, contact the HPCadmin. #1 [hmeij@swallowtail ~]$ /share/apps/bin/hello.run Running on ilogin1 and ilogin2 with -np=16 Hello, world, I am 0 of 16 Hello, world, I am 11 of 16 Hello, world, I am 1 of 16 Hello, world, I am 2 of 16 Hello, world, I am 3 of 16 Hello, world, I am 4 of 16 Hello, world, I am 5 of 16 Hello, world, I am 6 of 16 He…

32

Anonymous (anonymous@undisclosed.example.com) — 2007-05-16T15:27:27+00:00

Back => Lava, the scheduler, is not natively capable for parallel jobs submissions. So a wrapper script is necessary. It will obtain the hosts from the LSB_HOSTS variable and build the “machines” file. Follow the TEST link below for detailed information.

33

Anonymous (anonymous@undisclosed.example.com) — 2010-07-08T23:14:14+00:00

Back Login June 2009 ... the cluster has been upgraded using a new front end node named petaltail.wesleyan.edu. The old host swallowtail.wesleyan.edu has been added and you can login and submit jobs on either host. If you change your password, please do this on host petaltail.

34

Anonymous (anonymous@undisclosed.example.com) — 2007-05-16T19:58:31+00:00

Back General After you have logged in and read the User Guides & Manuals, you should be able to get some work done. If you have large compilations to perform please use one of the login nodes. You will also speed up your compilations if you use the localscratch area. To stage data and programs you are welcome to do so in your home directory.

35

Anonymous (anonymous@undisclosed.example.com) — 2009-09-08T19:11:33+00:00

Back For the recent set of tools go directly to petaltail.wesleyan.edu ---------- The rest of this page applies to OCS 4.1.1 (the old swallowtail), now defunct. ---------- => Note that some of these tolls prompt for a password wihtout encryption (http vs https).

36

Anonymous (anonymous@undisclosed.example.com) — 2008-03-17T18:44:29+00:00

Home ok, so this story begins with ... i thought i had met my inability to comprehend new technology when i was shown that disks can run multiple raid levels simultaneously. but this multipathing eclipses that. just weird, therefore worth describing.

37

Anonymous (anonymous@undisclosed.example.com) — 2007-05-15T17:13:01+00:00

Home Cluster Steering Committee 05/09/2007 Present: James Taft, Jolee West, Henk Meij, Francis Starr, David Beveridge, Eric Aaron, Tsampikos Kottos, George Petersson ToDos * fix PE2950s (dell issued) * dm-multipath failover (fiber channel) done! 05/14/07

38

Anonymous (anonymous@undisclosed.example.com) — 2009-07-13T15:32:59+00:00

Back THIS IS THE LIST OF SOFTWARE FOR THE HOST SWALLOWTAIL UNDER OCS 4.4.1, RHEL 4, GlibC 2.3.5-2.19 The listings below will be updated as software is installed. In no particular order ... PyPat * program: python egg, version 1.0)

39

Anonymous (anonymous@undisclosed.example.com) — 2017-09-29T13:19:54+00:00

Back Matlab Update Summer 2017 we converted our Wesleyan Matlab license to a campus wide Total Academic Headcount license. This implies no more license restrictions, so you can run as many Matlab jobs as you wish using the matlab2017b binary. At this time I see no need to license the Distributed Computation Engine in R2017b.

40

Anonymous (anonymous@undisclosed.example.com) — 2007-07-26T15:49:46+00:00

Back LSF RTM Platfrom/LSF RTM Demo (Real Time monitoring) This is a slick application that monitors multiple or individual clusters. Build on top of Cacti. The demo below is somewhat heavy but if you're interested have a looksie. It “might” run on top of Platform/Lava, our current scheduler. Some features it includes are:

41

Anonymous (anonymous@undisclosed.example.com) — 2007-10-17T15:11:54+00:00

Back A parallel code example pulled from the BCCD project to probe around the notion of what is parallel computing? => This is page 1 of 3, navigation provided at bottom of page GalaxSee: N-Body Physics Default Behavior The problem is described Here. The Shodor web version of Galaxsee

42

Anonymous (anonymous@undisclosed.example.com) — 2007-08-09T20:48:02+00:00

Back ⇒ This is page 2 of 3, navigation provided at bottom of page Switch & MPI Flavors As you can see in the GalaxSee example, parallel code has the ability to provide significant speed up in job processing times. Until some saturation point is achieved when performance takes a hit because of the excessive time spend on passing messages.

43

Anonymous (anonymous@undisclosed.example.com) — 2007-08-10T15:10:50+00:00

Back This secton focuses on some debugging tools which are pretty nifty in understanding message passing. In order to use them, another flavor of MPI is introduced. Sorry. Good news is, OpenMPI is trying to replace them all. => This is page 3 of 3, navigation provided at bottom of page

44

Anonymous (anonymous@undisclosed.example.com) — 2007-09-13T14:37:05+00:00

Back Since i went to a workshop on Basic LSF 6.2 Configuration and Administration held in Boston by Platform Computing... => consider me dangerous :-P Our cluster is driven by Platform/Lava as the scheduler, which in essence is LSF 6.1 ... so i've staged all the documentation at the link below. There is a ton of it, all very good. Covering all aspects of LSF, how to use it and administer it.

45

Anonymous (anonymous@undisclosed.example.com) — 2007-08-27T13:58:42+00:00

Back Ok, so we have a data center power outage for some electrical maintenance work sunday 8/26 2am-9am. How to shut down the cluster? Here are the steps i took. Cluster Power Down * #1 * Turn all queues to inactive 24 hours before shut down.

46

Anonymous (anonymous@undisclosed.example.com) — 2007-09-19T15:52:56+00:00

Back Here is a listing of the reasons i know of, so far, why we should upgrade to LSF for HPC. The documentation for LSF v6.1 is here (although we would go to v7 right away). We are getting LSF :!: Thanks ITS. --- Henk Meij 2007/09/19 11:52 LSF for HPC

47

Anonymous (anonymous@undisclosed.example.com) — 2007-09-06T20:43:11+00:00

Back Running Gaussian To run Gaussian jobs on the cluster, read this page. It may help in identifying some errors you may encounter getting your jobs to run. It may also give you ideas to increase your overall job throughput rate. Access You must be a member of the group

48

Anonymous (anonymous@undisclosed.example.com) — 2008-09-25T18:29:50+00:00

Back SNAPSHOTS ARE NOT ENABLED AS OF 06/30/2008 --- Meij, Henk 2008/06/30 09:11 Backup Policy The backup policy of the cluster is described below. There are 2 different mechanisms. NetApp snapshots are taken and provide a convenient way to restore 'point-in-time'. Snapshots store the changes at the block level. Tivoli incremental backups store files when metadata of those files has changed. It serves as a backup for file restorations and deleted files.

49

Anonymous (anonymous@undisclosed.example.com) — 2013-08-08T19:07:49+00:00

Back Lava/LSF works via a variety of daemon processes that communicate with each other. * Load Information Manager (LIM, master & slaves) * gathers built-in resource load information directly from /dev/kmem * forwards information to master LIM

50

Anonymous (anonymous@undisclosed.example.com) — 2011-05-09T17:02:29+00:00

Back Job Arrays Just have to document this. Very handy. You can find detailed informaton at this Link A job array makes it easy to manage a sequence of jobs with “shorthand” syntax. It could be managing 20 jobs or 2,000 jobs, managed by a single command. Once submitted the array job itself can be managed or the individual jobs that make up the job array.

51

Anonymous (anonymous@undisclosed.example.com) — 2007-09-28T18:57:05+00:00

Back This is for experimental purposes only. Proof of concept type of a thing. --- Henk Meij 2007/09/28 11:38 The Story Of NAT The cluster is served file systems from our NetApp Fabric Attached Storage Device. These file systems are NFS mounted on each compute node via the IO node. The NFS traffic is isolated to one of our private networks on the cluster, the 10.3.1.xxx subnet, running across a Cisco 7000 gigabit ethernet switch.

52

Anonymous (anonymous@undisclosed.example.com) — 2007-11-20T15:18:38+00:00

Back Upgrading to LSF Why? Here is my summation of some items i wish to take advantage of: Link We're running Platform/OCS which includes the Lava scheduler. It's sorta like LSF but with functionality removed. However, it is free and very good. Our Dell cluster came pre-configured with Lava but it's time to leverage the resources of our cluster in more detail.

53

Anonymous (anonymous@undisclosed.example.com) — 2017-02-09T15:14:29+00:00

Back Acknowledgement If you publish a paper where the cluster was used for calculation, please include the following acknowledgement: “We thank Wesleyan University for computer time supported by the NSF under grant number CNS-0619508 and CNS-0959856.

54

Anonymous (anonymous@undisclosed.example.com) — 2007-10-20T00:23:15+00:00

Back Job Slots I was asked in the UUG meeting yesterday how one determines how many job slots are still available. Turns out to be a tricky question. In CluMon you might observe one host with only one JOBPID running yet it is declared 'Full' by the scheduler. This would be a parallel job claiming all job slots with the

55

Anonymous (anonymous@undisclosed.example.com) — 2007-11-02T18:09:46+00:00

Back Overloading Job Slots Typically, in a particular configuration file, you define how many “cores” a node has. This is then equated to “job slots”. In a default scenario, the number of cores and job slots are equal. The assumption behind this is that each job contains a task that will consume all resources available to that core.

56

Anonymous (anonymous@undisclosed.example.com) — 2007-11-06T15:08:27+00:00

Back [CLACReps] High Performance Cluster @ Wesleyan General answers to questions posed by the CLACReps. This wiki may have much more detailed information scattered about and i'll point to some relevant pages. Click on the Back link above to go to the main page. Our cluster resides on our internal VLAN, hence is only accessible via Active Directory guest accounts and VPN for non-wesleyan users.

57

Anonymous (anonymous@undisclosed.example.com) — 2007-11-01T14:23:45+00:00

Home Cluster Steering Committee 10/29/2007 Present: Jolee West, Henk Meij, Francis Starr, George Petersson, Ganesan Ravishanker ToDos * continue to look for the person to fill the Coordinator, Scientific Computing and Informatics Center position

58

Anonymous (anonymous@undisclosed.example.com) — 2008-09-19T15:25:05+00:00

Back Platform LSF 6.2 Documentation This is a local copy of the information available at External Link and available to connections from wesleyan.edu only. The documentation is quite good. Here are some useful links into the local site mentioned above above:

59

Anonymous (anonymous@undisclosed.example.com) — 2008-01-09T19:04:15+00:00

Back Complete Documentation It's all at this link COMPLETE DOCUMENTATION FOR LSF/HPC 6.2 and very good. New Features in LSF 6.2 This page will be expanded to show examples of LSF/HPC advanced features. The more information you can provide to the scheduler regarding run times, resources needed and when, the more efficient the scheduling will be. The examples below are just made up scenarios. Try to get familiar with them or ask for hands-on working sessions.

60

Anonymous (anonymous@undisclosed.example.com) — 2008-06-19T14:20:58+00:00

Back The basic configuration of the cluster is detailed below. This information was requested for inclusion in proposals and the like. I'm not regularly updating this information so email me if you need this page to be updated. --- Meij, Henk 2007/12/03 13:39

61

Anonymous (anonymous@undisclosed.example.com) — 2007-12-26T15:04:10+00:00

Home Cluster Steering Committee 12/18/2007 Present: Jolee West, Henk Meij, Ganesan Ravishanker, Albert Fry, Francis Starr, David Beveridge. ToDos * continue to look for the person to fill the Coordinator, Scientific Computing and Informatics Center

62

Anonymous (anonymous@undisclosed.example.com) — 2007-12-21T19:08:01+00:00

Back Automated Submissions Quanli walked into the office with a request: how can one automate the submission of tons of jobs? In his case Gaussian jobs. “Job Arrays” i answered confidently, but that turned out to be a bit of a problem. Still working on that.

63

Anonymous (anonymous@undisclosed.example.com) — 2008-02-25T16:43:55+00:00

Home Green? Christmas/New Years 07/08 Green computing it is not. Perhaps i should shut down idle hosts ;-) [save electric?] Fun! An honorable mention goes to ... [root@swallowtail ~]# bjobs -u qgu JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME

64

Anonymous (anonymous@undisclosed.example.com) — 2008-01-10T21:42:55+00:00

Home LSF & MPI The new installation of LSF supports an integrated environment for submitting parallel jobs. What this means is that the scheduler can keep track of the resource consumption of a job spawning many parallel tasks. Lava was unable to do so.

65

Anonymous (anonymous@undisclosed.example.com) — 2008-06-03T17:27:22+00:00

Back Milestone! June 3rd 2008, shortly before 1 pm, the job with JOBID 100,000 was completed. JOBIDUSERSTATQUEUEFROM_HOSTEXEC_HOSTJOB_NAMESUBMIT_TIME100000ztanDONEimwswallowtailcompute-1-23ising-spring-2008-swt/simpleRuns/data018/t0.900000/mu-2.630000

66

Anonymous (anonymous@undisclosed.example.com) — 2008-06-28T19:59:31+00:00

Back The catastrophic crash of June 08. The actual cause of the crash is the filling of the 4TB home directory file system. This happened on Sun Jun 22 16:23:54 EDT [filer3: wafl.vol.full:notice]: file system on volume cluster_home is full My notes on the recovery are below.

67

Anonymous (anonymous@undisclosed.example.com) — 2008-06-29T19:51:20+00:00

Back The catastrophic crash of June 08 A huge thank you to all for being patient, understanding, and supportive during the week of downtime! Here are some notes taken while restoring the custer: LINK Configuration Changes Previously, our home directories and sanscratch file system areas were 4 TB and 1 TB volumes, respectively, on host

68

Anonymous (anonymous@undisclosed.example.com) — 2008-08-19T17:34:11+00:00

Back RTM This is a collection of interesting graphs generated by the Real Time Monitoring tool Platform is developing. Data covers our evaluation period. What is RTM ? RTM is used to monitor and graph LSF resources (including networks, disks, applications, etc.) in a cluster, or multiple clusters. In graph or table formats, RTM displays resource-related information such as the number of jobs submitted, the details of individual jobs (like load average, cpu usage, job owner), or the hosts o…

69

Anonymous (anonymous@undisclosed.example.com) — 2008-10-22T14:54:21+00:00

Back NAMD Most of your question can be answered on the web site NAMD or subscribe to their community supported list namd-l. The rest of this page are simple instruction to get you going. Jobs The NAMD binary was compiled against the Topspin libraries, hence can only run on the

70

Anonymous (anonymous@undisclosed.example.com) — 2009-04-16T13:11:41+00:00

Back Milestone Job number 100,000 finished. June 03, 2008. JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 100000 ztan DONE imw swallowtail compute-1-23 ising-spring-2008-swt/simpleRuns/data018/t0.900000/mu-2.630000 Jun 3 12:48

71

Anonymous (anonymous@undisclosed.example.com) — 2008-12-23T16:50:37+00:00

Back How To To make a movie (mpeg4) from png files, type the following line in the console from the folder containing the png files. mencoder “mf:*.png” -mf fps=10 -o output.avi -ovc lavc -lavcopts vcodec=mpeg4 Back

72

Anonymous (anonymous@undisclosed.example.com) — 2009-06-17T15:53:31+00:00

Back Links * Home Page page with links to all relevant documentation. * Software page for petaltail. * Ganglia monitors node availability, system load, network usage, and other resource information. * Cacti is a complete network graphing solution.(need to add LSF data). Login as guest/guest to view graphs.

73

Anonymous (anonymous@undisclosed.example.com) — 2025-08-29T18:41:39+00:00

Back Software OpenHPC Software for Rocky 8 Below is all centos 6 and/or 7, which may or may not run in 8. IMPORTANT NOTE: Since moving to the TrueNAS/ZFS appliance all references to /home/apps should be replaced with /share/apps which points to /zfshomes/apps ---

74

Anonymous (anonymous@undisclosed.example.com) — 2009-03-20T19:16:49+00:00

Home Usage Here is the academic compute cluster usage starting 01/01/2008. [:cluster:bqueues_mar09.jpg] Home

75

Anonymous (anonymous@undisclosed.example.com) — 2009-04-09T17:25:32+00:00

Home OCS5.1/RHEL5.1/Lava Some run times here testing the new hosts. Ofcourse there is still contention inside the switches with swallowtail. C-00 is on Infiniband, C-01 is on Gig Ethernet switch. Please note that if N<=8 all MPI message passing remains local to the host and does not travel through the switches. So these are not benchmarks.

76

Anonymous (anonymous@undisclosed.example.com) — 2009-04-16T13:27:23+00:00

Back Milestone Job number 200,000 finished. April 15th, 2009. "JOB_FINISH" "6.2" 1239715256 200000 14846 33554450 1 1239708461 0 0 1239713000 "scoppage" "elw" "" "" "" "swallowtail" "ct2/run/v2j11Fit4" "" "2-11-2.out" "" "1239708461.200000" 0 1 "compute-1-23" 64 240.0 "" "./run 5000 2 11 2 6051223 4359.980846 4359.980846 3.054267 3.054267 0.731855 0.692813" 2250.508870 0.274958 0 0 -1 0 0 1424 0 0 0 0 -1 0 0 0 173 62771 -1 "" "default" 0 1 "" "" 0 4616 121116 "" "" "" "" 0 "" 0 "" -1 "/…

77

Anonymous (anonymous@undisclosed.example.com) — 2009-09-09T13:21:27+00:00

Back Expansion Donation of hardware by Blue Sky Studios 4 racks with blade servers, 52 servers per rack, will be dropped off tomorrow. Not all will be turned on as 5% of the servers are kaput. So the first step is to build a fully working rack. Memory footprint of the servers is either 12 or 24

78

Anonymous (anonymous@undisclosed.example.com) — 2009-07-08T20:54:30+00:00

Back Clusters That Produce: 25 Open HPC Applications Informative article on open source applications for * Bioinformatics * Molecular Dynamics * Electronic Structure/Quantum Chemistry * Environment/Weather * Computational Fluid Dyanamics * Finite Element Analysis

79

Anonymous (anonymous@undisclosed.example.com) — 2013-04-18T19:40:47+00:00

Back Deprecated. We only have rack running (on demand) offering access to 1.1 TB of memory. The bss24 queue on head node greentail represents the Blue Sky Studio job slots available. --- Meij, Henk 2013/04/18 15:39 [sharptail saltmarsh sparrow] Update: 21 Sept 09 Cluster sharptail has undergone some changes. Courtesy of ITS, 2 more blade enclosures have permanently been added. Another 3 blade enclosures have temporarily been added (destined for another ITS project so they may disappea…

80

Anonymous (anonymous@undisclosed.example.com) — 2009-08-28T14:56:52+00:00

Back Cluster: sharptail Doing some test runs for rough comparisons. Gaussian Running one small example using g03 ... * swallowtail/petaltail (N=2, imw) 65 mins * sharptail (N=2, bss12) 121 mins * sharptail (N=8!, bss12, so across nodes) --

81

Anonymous (anonymous@undisclosed.example.com) — 2010-12-09T15:46:49+00:00

Back Gaussian never fixed the connectivity with Linda so it can not be run across multiple nodes. --- Meij, Henk 2010/12/09 10:45 Gaussian & Linda (I wrote this up for a user so am sharing it here until we get clarification from Gaussian.com) Hi Anthony, I observed your job below on sharptail. This must be running with the standard g09 executable. Gaussian is program that forks itself on the same host for as many threads you define, in your case 16. You’ll notice below that the schedule…

82

Anonymous (anonymous@undisclosed.example.com) — 2011-02-03T20:08:54+00:00

Back There is a newer version of this page at this page Brief Description For inclusion in proposals. This page is not maintained. Academic High Performance Computing at Wesleyan Wesleyan University HPC environment is comprised of two clusters: The “swallowtail” Dell hardware cluster and the “sharptail” Angstrom hardware clusters. A brief description of each follows.

83

Anonymous (anonymous@undisclosed.example.com) — 2010-09-14T14:40:26+00:00

Back Overview With the second NSF proposal in a “recommended for funding” state, I'm preparing this page so we can address some looming issues and make decisions on our potential new acquisition. In general, these are the main topics: * Data Center/ITS Items

84

Anonymous (anonymous@undisclosed.example.com) — 2010-05-03T14:42:23+00:00

Back Cluster Support So, the Dell cluster (petal/swallow-tails) has run out of support January 25th of this year. As I found out when I called on 03/01/2010 with a hardware problem. So the question is what to do next? The Dell hardware is now 3 years old, but still in good condition. The failure rate during those 3 years has included: replaced 4 disks, 2 system boards and power fans, and perhaps 8-10 memory sticks. Lets assume that stays the same for now.

85

Anonymous (anonymous@undisclosed.example.com) — 2010-06-23T14:27:59+00:00

Back Recent Queue Usage 01 April thru 09 June, 2010 Cluster Petaltail/Swallowtail The Sharptail Cluster Historic Queue Usage So we know we have a need for home directory disk space. But if we got more nodes, which would they be? So here is a look at that. The 2-3 week period from Christmas to end of first week of Jan, represented by no jobs running/pending, was when the electrical work was done on our building. The tremendous spikes are from a single user w…

86

Anonymous (anonymous@undisclosed.example.com) — 2010-05-13T18:29:42+00:00

Back Cloud Or Not? There is a lot of buzz about cloud computing. Recently, this has spilled over into the HPC world. There are private and public clouds. And private clouds at external organization or internal to your own organization. I do not have the gist of it down yet, but this page explores the option: can or should we consider HPC cloud for spending our $298K NSF award on or stay the course buying new hardware?

87

Anonymous (anonymous@undisclosed.example.com) — 2010-05-19T20:55:17+00:00

Back Cluster Software * A good overview, swallowtail is platform OCS/LSF and sharptail is Kusu/Lava * HPCprojects * And I guess this is the spinoff of Kusu/Lava - UniCluster (scheduler?) * UniCluster ... works with schedulers ??? * Then there is HP's version

88

Anonymous (anonymous@undisclosed.example.com) — 2010-08-17T19:56:04+00:00

Back Blue Sky Studios Hardware We have 4 racks of which 3 are powered up. All on utility power including head/login node. Racks are surprisingly cool compared to our Dell cluster. Some digging revealed that the AMD Opteron chip cycles down to 1 Ghz if not used instead of running at 2.4 Ghz all the time (You can observe this in /proc/cpuinfo).

89

Anonymous (anonymous@undisclosed.example.com) — 2010-11-22T19:05:13+00:00

Back HP HPC Notes for the cluster design conference with HP. “do later” means we tackle after the HP on site visit. S & H * Shipping Address: 5th floor data center * No 13'6“ truck, 12'6” is ok or box truck * Delivery on standard raised dock, no ways to lift rack out of truck if not docked

90

Anonymous (anonymous@undisclosed.example.com) — 2010-09-28T20:32:54+00:00

Back Recent Queue Usage 14 Sept thru 28 Sept, 2010 Cluster Petaltail/Swallowtail The Sharptail Cluster Back

91

Anonymous (anonymous@undisclosed.example.com) — 2011-01-07T20:49:33+00:00

Back Linpack Grabbed the Linpack source and compiled against /opt/openmpi/1.4.2 ... using the Make.Linux_PII_CBLAS makefile. Had to grab the atlas libraries from another host. We changed $HOME and pointed to libmpi.so ($MPdir and $MPlib) and repointed $LAdir. Then it compiled fine.

92

Anonymous (anonymous@undisclosed.example.com) — 2011-03-30T15:59:36+00:00

Home a “bottom's up” page of tasks performed while inching towards deployment. Closing this page. Update (03/28/2011) greentail's home directories are now served up on petaltail/swallowtail cluster ... this means your home directory is the same across the Dell and HP clusters (minus one host, will work on that tomorrow as well as cluster sharptail).

93

Anonymous (anonymous@undisclosed.example.com) — 2011-01-11T20:55:58+00:00

Back greentail Greentail Time to introduce our new high performance cluster greentail, an Hewlett Packard HPC solution. If you want to read more about the details of the hardware, you can find it at Enternal Link. The name refers to the Smooth Green Snake, which no surprise, has a green tail.

94

Anonymous (anonymous@undisclosed.example.com) — 2011-01-25T16:39:32+00:00

Back * JAC and Factor_IX are two sample programs included with Amber (one memory intensive, one IO intensive, forgot what is what). * “1g6r” is a program from Surjit Dixit that should scale really well in his opinion. Swallowtail Amber What

95

Anonymous (anonymous@undisclosed.example.com) — 2013-07-24T15:00:31+00:00

Back * the Queue Update page 03/01/2013 Newest Configuration The Academic High Performance Compute Cluster is comprised of two login nodes (greentail and swallowtail, both Dell PowerEdge 2050s). Old login node petaltail (Dell PowerEdge 2950) can be used for testing code (does not matter if it crashes, it's primary duty is backup to physical tape library).

96

Anonymous (anonymous@undisclosed.example.com) — 2015-03-13T17:52:53+00:00

Home Note You can also run matlab jobs via scripts like other software. Create a file with the matlab commands and then create a shell script to submit the job. It would look like #!/bin/bash # submit via 'bsub < run.serial' #BSUB -q matlab #BSUB -J test #BSUB -o test.stdout #BSUB -e test.stderr matlab -no display < my_code.m > /dev/null

97

Anonymous (anonymous@undisclosed.example.com) — 2012-02-16T19:09:41+00:00

Home Summary The purpose of this testing is to find out how fast the storage systems respond either directly attached to compute nodes, or attached via ethernet (gigabit ethernet) or infiniband (SDR via queue imw or QDR via queue hp12). When using infiniband interconnects we use IPoIB (IP traffic over infiniband interconnects which theoretically might be 3-4 times faster than ethernet).

98

Anonymous (anonymous@undisclosed.example.com) — 2011-08-19T15:36:07+00:00

Back Recent Queue Usage Cluster Greentail (hp12 queue) Cluster Greentail (all queues) 01 Jan 2011 thru 17 Jan, 2011 Cluster Greentail Cluster Petaltail/Swallowtail The Sharptail Cluster

99

Anonymous (anonymous@undisclosed.example.com) — 2011-03-21T15:36:26+00:00

Back Milestone Bummer, i'm unable to grab the 1,000,000 or 999,999 jobpids. Wonder what happened to them. JOBIDUSERSTATQUEUEFROM_HOSTEXEC_HOSTJOB_NAMESUBMIT_TIME999997jwamplerDONEemwc26c25./run 50000 0 18 55 20.179700 203.690000 4.000000 0.325000 10.605000 0.750000 0.500000

100

Anonymous (anonymous@undisclosed.example.com) — 2011-03-28T14:23:22+00:00

Back Quotas On 03/25/2011 quotas have been enabled on cluster greentail. Because group quota numbers are weird, it is like counting duplicates or following links or looping, not sure, we have to rely on individual quotas. Just as well. Here is what was put in place, and will mature/change over time. But first, it is important to understand:

101

Anonymous (anonymous@undisclosed.example.com) — 2012-01-18T14:49:02+00:00

Back 2011 Queue Usage These data reflect the combined queues on the HP greentail cluster. 01 Jan 2011 thru 31 Dec, 2011 Cluster Greentail 01 Jan 2011 thru 31 Dec, 2011 Cluster Greentail (Subset, less than 1,000 pending jobs)

102

Anonymous (anonymous@undisclosed.example.com) — 2020-08-24T11:19:28+00:00

Back Note #1 CentOS 8.1 with the standard firewalld. If this is of interest to you this was how I managed to get it work: EXTIFACE=MASTER_NODE_EXT_INTERFACE_DEVICE (e.g. eno1) INTIFACE=MASTER_NODE_INTERNAL_INTERFACE_DEVICE (e.g. eno2) INTIPADDR=MASTER_IP_OF_INTERNAL_IFAC PREFIX=PREFIX_OF_INTERNAL_NETWORK firewall-cmd --change-interface=${EXTIFACE} --zone=public firewall-cmd --change-interface=${INTIFACE} --zone=trusted --permanent firewall-cmd --permanent --direct --passthrough ipv4 -t nat …

103

Anonymous (anonymous@undisclosed.example.com) — 2011-12-22T19:34:29+00:00

Back Some general information for SAS users. SAS SAS, the statistical analysis software (), and much more, frequently used in the social sciences, is available on the High Performance Academic Computing Cluster. It is not a parallel version of SAS, but we do offer an unlimited linux license for Teaching and Research.

104

Anonymous (anonymous@undisclosed.example.com) — 2012-01-11T19:20:58+00:00

Back Milestone Starting anew with greentail Lava scheduler JOBPID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 500000 lvargaslara EXIT hp12 greentail n23 n2l16f4L20c26 Dec 28 11:04 2011

105

Anonymous (anonymous@undisclosed.example.com) — 2012-11-29T19:02:56+00:00

Back 2012 Queue Usage These data reflect the combined queues on the HP greentail cluster. 2012 Cluster Greentail 2012 Cluster Greentail (Subset, less than 1,000 pending jobs)

106

Anonymous (anonymous@undisclosed.example.com) — 2016-11-21T20:22:47+00:00

Back Stata Some general information for Stata users. Sample * save in a script say called run and submit to scheduler bsub < run #!/bin/bash rm -rf err out # relevant commands # find available hosts: bhosts # find queues: bqueues # submit job: bsub < run # show jobs submitted: bjobs #BSUB -q stata #BSUB -J test #BSUB -o out #BSUB -e err # use n cores (job slots) ... a license limitation of 6 #BSUB -n 6 # force using all on one node (hosts=1) #BSUB -R "span[hosts=1]" # can stata-mp…

107

Anonymous (anonymous@undisclosed.example.com) — 2013-09-11T13:18:54+00:00

Back GPU History What is a GPU cluster? * CPU = Central Processing Unit * the chip on the motherboard with L1/L2 caches and often comprised of cores (like dual quad or 8) * each core typically processes one computing job * kernel also has ability to swap to disk (not a desired long term state)

108

Anonymous (anonymous@undisclosed.example.com) — 2014-02-21T15:25:36+00:00

Back This outdated page replaced by Brief Guide to HPCC Our Queues An updated on our queues ... --- Meij, Henk 2013/09/10 14:43 QueueNr Of NodesTotal GB Mem Per NodeTotal Cores In QueueSwitchHostsNotes matlab na na na either any host in hp12,elw,emw,imw max jobs 'per user' or 'per host' is 8

109

Anonymous (anonymous@undisclosed.example.com) — 2013-10-16T19:13:53+00:00

Back Lammps GPU Testing (EC) * 32 cores E2660 * 4 K20 GPU * workstation * MPICH2 flavor Same tests (12 cpu cores) using lj/cut, eam, lj/expand, and morse: AU.reduced CPU only 6 mins 1 secs 1 GPU 1 mins 1 secs (a 5-6 times speed up) 2 GPUs 1 mins 0 secs (never saw 2nd GPU used, problem set too small?)

110

Anonymous (anonymous@undisclosed.example.com) — 2013-05-24T13:39:17+00:00

Back Notes * HP cluster off support 11/30/2013 * We need greentail/disk array support maybe 2 more years? * Karen added to budget, Dave to approve ($2200/year) * We need another disk array * For robust D2D backup * Pressed HP Procurve ethernet backup switch into production

111

Anonymous (anonymous@undisclosed.example.com) — 2013-02-04T19:28:42+00:00

Back Amber GPU Testing (EC) We are interested in benchmarking the serial, MPI, cuda and cuda.MPI versions of pmemd. Results * Verified the MPI threads and GPU invocations * Verified the output data * pmemd.cuda.MPI errors * Script used is listed at end of this page

112

Anonymous (anonymous@undisclosed.example.com) — 2013-10-08T19:04:03+00:00

Back All campus utilities are within the physical plant chart of accounts. Departments are not charged. Overview Cluster Blue Sky Dell HP GPU CPU Comment 12/2006 11/2010 04/2013 04/2013 Age (yrs) 11 5.5 1.5 0 0 Nodes (Nr) 45

113

Anonymous (anonymous@undisclosed.example.com) — 2014-02-03T17:08:09+00:00

Back 2013 Queue Usage These data reflect the combined queues on the HP greentail cluster. 2013 Cluster Greentail 2013 Cluster Greentail (Subset, less than 1,000 pending jobs) 2013 Cluster Greentail (mwgpu queue jobs)

114

Anonymous (anonymous@undisclosed.example.com) — 2013-09-10T18:59:58+00:00

Back Build Hadoop (test) Cluster Use Hadoop (test) Cluster These are my notes building a test Hadoop cluster on virtual machines in VMware. They consists of a blending of instructions posted by others with my commentary added. Please review these sites so this page makes sense to you.

115

Anonymous (anonymous@undisclosed.example.com) — 2013-09-10T19:04:03+00:00

Back Use Hadoop Cluster Build Hadoop Cluster Word count, vanilla Ross writes .... I did the classic map-reduce example: a word count of a flat text file, in this case James Joyce's Ulysses. * First, I downloaded the data. * Second, I copied the data from my home folder to the Hadoop Distributed File System (HDFS).

116

Anonymous (anonymous@undisclosed.example.com) — 2014-02-04T18:57:13+00:00

Back Since deployment of sharptail the information below is out of date. /home is now the same across the entire HPCC and served out by sharptail. --- Meij, Henk 2014/02/04 13:56 Sharptail Cluster A recycle head node name, seems appropriate. The new hardware has been delivered and rack&stacked. First priority was looking around while /home was copied from greentail:/home. This cluster is comprised of one head node (sharptail) and 5 compute nodes (n33-n37). The head node has a 48 TB disk…

117

Anonymous (anonymous@undisclosed.example.com) — 2013-07-23T14:46:29+00:00

Back Milestone Starting anew with Lava scheduler (hp + dell + bss hardware, microway hardware in recess stay&play mode). JOBPID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 999998 blycette DONE bss24 greentail

118

Anonymous (anonymous@undisclosed.example.com) — 2013-08-22T18:58:07+00:00

Back PGI Accelerator Some quick notes for our trial environment. PGI compilers comes with it's own MPI flavor, Java JRE and Cuda (4.2 and 4.5). These compilers can compile straight C/C++ and Fortran code. But they can also compile that code for parallel invocation using MPI and GPU enable the code stack.

119

Anonymous (anonymous@undisclosed.example.com) — 2021-06-17T19:32:47+00:00

Back Submitting GPU Jobs Please plenty of time between multiple GPU job submissions. Like minutes. Jobs need to be submitted to the scheduler via cottontail to queues mwgpu, amber128, exx96. This page is old, the gpu resource gpu4 should be used, a more recent page can be found

120

Anonymous (anonymous@undisclosed.example.com) — 2014-02-21T15:24:39+00:00

Back This outdated page replace by Brief Guide to HPCC --- Meij, Henk 2014/02/21 10:23 Updated --- Meij, Henk 2013/09/10 14:42 * the Queue Update New Configuration The Academic High Performance Compute Cluster is comprised of two login nodes (greentail and swallowtail, both Dell PowerEdge 2050s). Old login node petaltail (Dell PowerEdge 2950) can be used for testing code (does not matter if it crashes, it's primary duty is backup to physical tape library).

121

Anonymous (anonymous@undisclosed.example.com) — 2013-09-16T15:09:00+00:00

Back Hadoop Summary Our production Hadoop Cluster is based on Cloudera's CD3U6 repository. Here are some details : * namenode (that is login node): whitetail.wesleyan.edu * whitetail also runs the Hadoop Scheduler and Health Monitor * Health Status * Job Tracker *

122

Anonymous (anonymous@undisclosed.example.com) — 2013-10-01T18:30:55+00:00

Back Workshop and Q&A Fall 2013 * Clusters! We have ... * New Configuration * Tails! We have ... * greentail (HP) primary login node and scheduler (do not run long running programs on it) * current file server, to be backup file server * swallowtail (Dell) secondary login node and scheduler (do not run long running programs on it)

123

Anonymous (anonymous@undisclosed.example.com) — 2013-10-23T18:52:02+00:00

Back Replace Dell Racks A Novella: Replace the Dell Racks with new hardware Subtitle: A win-win solution proposed by Physical Plant and ITS Once upon a time, back in 2013, two Dell racks full of compute nodes, sat noisily chewing away energy on the 5th floor of Science Tower. They drew in nicely cooled air from the floor spewing it out the back of the racks at 105-110 degrees (F). They were giving the three Liebert cooling towers a run for their BTUs. So much so that if one failed the De…

124

Anonymous (anonymous@undisclosed.example.com) — 2016-03-11T20:14:57+00:00

Back Queue tinymem supports BLCR --- Henk 2016/03/03 13:57 Adjust your PATH and LD_LIBRARY_PATH accordingly. BLCR So we need a day of down time to switch file server functionality from greentail to sharptail. It would be nice if everybody did not loose any computational progress. To do that, we need to learn to checkpoint at the application level. If a node crashes or power is lost, those applications can then restart the job from the last checkpoint.

125

Anonymous (anonymous@undisclosed.example.com) — 2014-02-26T20:32:29+00:00

Back Done! --- Meij, Henk 2014/02/21 09:54 Dell Racks Power Off Soon (Feb/2014), we'll have to power down the Dell Racks and grab one L6-30 circuit supplying power to those racks and use it to power up the new Microway servers. That leaves some spare L6-30 circuits (the Dell racks use 4 each), so we could contemplate grabbing two and powering up two more shelves of the Blue Sky Studio hardware. That would double the Hadoop cluster and the

126

Anonymous (anonymous@undisclosed.example.com) — 2025-11-22T16:08:15+00:00

Back Brief Guide to HPCC This page will be maintained and provide information to get users started using the compute cluster. It is a merger of the old “brief description” page and the “queue description” page. In General HPCC maintains and regularly updates an extensive software stack. Including provisioning tools, resource management, file transfer clients, development tools, a variety of scientific libraries, a variety of compilers (e.g. gcc/g++, OneAPI) and communication libraries (e.g.,…

127

Anonymous (anonymous@undisclosed.example.com) — 2014-04-07T13:48:33+00:00

Back Virtual HPCC services Thoughts on how to create virtual compute nodes in the HPCC stack. Specifically, trying to solve the need for tiny, but many, compute nodes for the nano physic applications. Like virtual compute nodes with a single core CPU with 128

128

Anonymous (anonymous@undisclosed.example.com) — 2014-05-27T19:31:31+00:00

Back Milestone Switching to Openlava ... starting count again Here is a history line * jul 2007 40+ accounts deployed * jun 2008 100,000 job marker 70 * may 2009 200,000 job marker 76 * mar 2011 1,000,000 job marker 99 (dell) * nov 2012 500,000 job marker

129

Anonymous (anonymous@undisclosed.example.com) — 2014-06-18T14:04:32+00:00

Back Gaussian Checkpointing When you have one or more jobs running that rely on Gaussian internal checkpoint mechanism, heavy read/write operations may result. That traffic should definitely not hit the /home file system but the /sanscratch file system. That scratch space is also NFS mounted over the Infiniband interconnects (via IPoIB). The result is that this file systems IO operations will also slow our file server down tremendously (even though /sanscratch is a 5 disk Raid 0 setup).

130

Anonymous (anonymous@undisclosed.example.com) — 2017-10-03T14:07:42+00:00

Back Jobs Pending Historic First some interesting progress graphs from our report to the provost. Report Total Jobs Submitted Just because I keep track =), 2 millionth milestone reached in July 2013. A picture of our total number of job slots availability and cumulative total of jobs processed.

131

Anonymous (anonymous@undisclosed.example.com) — 2015-04-14T14:16:10+00:00

Back For other years view: 2013 Queue Usage, 2012 Queue Usage, 2011 Queue Usage ... 2014 Queue Usage These data reflect the combined queues on the HP greentail cluster. 2014 HPC Cluster 2014 HPC Cluster (Subset, less than 1,000 pending jobs) 2014

132

Anonymous (anonymous@undisclosed.example.com) — 2014-08-11T14:08:49+00:00

Back LXC Linux Containers Ok, virtualization again. Trying this approach on a Dell PowerEdge 2950. * * * Starting with the latter. When you get the SElinux policy, create the *.te file then [root@petaltail ~]# vi lxc.te [root@petaltail ~]# semodule -l | grep lxc [root@petaltail ~]# checkmodule -M -m -o lxc.mod lxc.te checkmodul…

133

Anonymous (anonymous@undisclosed.example.com) — 2015-03-18T18:26:21+00:00

Back High Core Count - Low Memory Footprint I polled some folks with the problem described below to find a solution. Then ... We're on the cusp of a new era! Other solutions than the one described below * Amax 4U/288 cores * Microway 2U/144 cores

134

Anonymous (anonymous@undisclosed.example.com) — 2014-08-22T13:05:19+00:00

Back Slurm The Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. The architecture is described here . * Installation

135

Anonymous (anonymous@undisclosed.example.com) — 2019-04-11T13:10:04+00:00

Back RSTORE Update The rstore0/2 access points will go into read only mode early 2019. These access points will be replace by a similar but new platform rstore4/6. Each share owner will be contacted and content will be copied if needed (we have two copies of everything on the old platform so hopefully most of it can remain there). The new platform is

136

Anonymous (anonymous@undisclosed.example.com) — 2020-07-28T17:21:25+00:00

Back /home is defunct but remains for compatibility. It has been moved from sharptail to whitetail. New home directories are at /zfshomes. Although quotas are in place (starting at 1T for new accounts) users typically get what they need. Static content should eventually be migrated to our Rstore platform.

137

Anonymous (anonymous@undisclosed.example.com) — 2019-06-12T17:00:26+00:00

Back Submitting R2017+ Jobs Wesleyan has obtained a campus wide site license for Matlab since version 2017. Hence there is no need to check out a license and the queue matlab has been removed. You can use version 2017 and onward on all queues in unlimited number off jobs. Your submit script should do something simple like

138

Anonymous (anonymous@undisclosed.example.com) — 2016-06-21T18:11:12+00:00

Back Expansion We will be bringing online 280 more physical cores (560 hyper threads) with Haswell-EP E2650v3 chips 2.3 Ghz with a turbo boost speed of 3.0 Ghz. That's an 85% increase in job slots. Yea. Final Round * ExacctCorp: Quantum IXR110-512N E5-2600 v3 family

139

Anonymous (anonymous@undisclosed.example.com) — 2018-08-16T12:58:12+00:00

Back Warewulf Stateless * Warewulf is a scalable systems management suite originally developed to manage large high-performance Linux clusters. My Project Kusu replacement since IBM bought up Platform LSF and dished the hpccommunnity.org web site, grrrh. (old info at

140

Anonymous (anonymous@undisclosed.example.com) — 2015-12-04T18:53:20+00:00

Back 2015 Summer Expansion Fourteen Supermicro 1U servers were purchased each with dual 10 core processors. With hyper threading turned on that yields us 40 logical cores per 1U rack space or a total of 560 new logical cores. However, we maximized on cores and minimized our spending on memory. Each node has 32

141

Anonymous (anonymous@undisclosed.example.com) — 2016-01-13T18:05:21+00:00

Back For other years view: 2014 Queue Usage, 2013 Queue Usage, 2012 Queue Usage, 2011 Queue Usage ... 2015 Queue Usage 2015 Totals HPC Cluster 2015 before summer expansion HPC Cluster

142

Anonymous (anonymous@undisclosed.example.com) — 2020-02-27T13:59:03+00:00

Back Scratch Spaces We have different locations for scratch space. Some local to the nodes, some mounted across the network. Here is the current setup as of August 2019. * /localscratch * Local to each node, different sizes roughly around 50-80

143

Anonymous (anonymous@undisclosed.example.com) — 2015-12-10T20:36:08+00:00

Back Back Warewulf Statefull * Warewulf is a scalable systems management suite originally developed to manage large high-performance Linux clusters. So now that we can script stateless provisioning, we might also want to use stateful provisioning. That is PXE boot node once to format drive and install kernel +

144

Anonymous (anonymous@undisclosed.example.com) — 2018-07-26T18:52:53+00:00

Back Warewulf Golden Image Also read these pages and this page will make more sense: Warewulf Stateless, Warewulf Statefull. For some time now I have been looking for a provisioning tool. I've tried along the way ... * Project Kusu, now defunct, but a great, simple template driven system. No fancy gui.

145

Anonymous (anonymous@undisclosed.example.com) — 2017-04-05T15:22:48+00:00

Back IPoIB Redoing our RHEL5.5 HP Proliant blade servers with CentOS 6.7 using Warewulf Golden Image provisioning. Not quite there yet, but I'll document here how Infiniband was installed. These compute nodes are connect to a Voltaire interconnect, and aging quite a bit.

146

Anonymous (anonymous@undisclosed.example.com) — 2017-08-29T13:36:13+00:00

Back Openlava 3.1.2 Build process, switching to git approach. Prequisites (for rpm.sh) * yum install git * yum install rpm-build * yum install rpmdevtools * yum install tcl tcl-devel * yum install ncurses ncurses-devel * yum install automake libtool

147

Anonymous (anonymous@undisclosed.example.com) — 2020-02-27T18:06:48+00:00

Back BLCR Checkpoint in OL3 Deprecated since we did OS upgrades OS Update We will install DMTCP as a replacement...DMTCP --- Henk 2020/01/14 14:28 * This page concerns SERIAL jobs only; SERIAL jobs can restart on any node * Installation and what it does BLCR *

148

Anonymous (anonymous@undisclosed.example.com) — 2020-01-24T18:36:49+00:00

Back BLCR Checkpoint in OL3 Deprecated since we did OS Update We will replace it with DMTCP --- Henk 2020/01/14 14:31 * This page concerns PARALLEL mpirun jobs only; there are some restrictions * all MPI threads need to be confined to one node * restarted jobs must use the same node (not sure why)

149

Anonymous (anonymous@undisclosed.example.com) — 2016-12-06T20:13:17+00:00

Back The “information dive” into enterprise storage was an educational one. This write up is more for my note taking so I can keep track of and recall things. The Storage Problem In a commodity HPC setup deploying plain NFS, bottle necks can develop. Then the compute nodes hang and a cold reboot of the entire HPCC is needed. NFS clients on a compute node may contact NFS daemons on our file server sharptail and ask for say a file. The NFS daemon assigned the task then locates the content via …

150

Anonymous (anonymous@undisclosed.example.com) — 2016-11-29T19:36:29+00:00

Back Rsync Daemon/Rsnapshot The Problem Trying to offload heavy read/write traffic from our file server. I also did a deep information dive to assess if we could afford enterprise level storage. That answer basically means a $42K layout at the low end and up to $70K for the high end. I've detailed the result here

151

Anonymous (anonymous@undisclosed.example.com) — 2016-12-06T20:14:03+00:00

Back beeGFS A document for me to recall and make notes of what I read in the manual pages and what needs testing. Basically during the Summer of 2016 I investigated if the HPCC could afford enterprise level storage. I wanted 99.999% uptime, snapshots, high availability and other goodies such as parallel NFS. Netapp came the closest but, eh, still at $42K lots of other options show up. That story is detailed at

152

Anonymous (anonymous@undisclosed.example.com) — 2017-01-27T15:36:15+00:00

Back For other years view: 2015 Queue Usage, 2014 Queue Usage, 2013 Queue Usage, 2012 Queue Usage, 2011 Queue Usage ... 2016 Queue Usage 2016 Totals HPC Cluster [root@cottontail ~]# grep ^2016 /share/apps/logs/bjobs_done.log | tail date,lifetime_total,daily_total 20161222,3014237,17 20161223,3014389,152 20161224,3014460,71 20161225,3014512,52 20161226,3014570,58 20161227,3014595,25 20161228,3014627,32 20161229,3014652,25 20161230,3014677,25 20161231,3014682,5

153

Anonymous (anonymous@undisclosed.example.com) — 2017-12-06T15:34:52+00:00

Back Due Jan 30, 2018, totally refocused on network, killing the ideas on this page --- Henk 2017/12/06 08:50 NSF CC* * Create a $1 Million+ CC* proposal to meet the research, staff and cyberinfrastructure * Needs/Wants of small, primarily undergraduate, northeast Higher Ed institutions

154

Anonymous (anonymous@undisclosed.example.com) — 2018-08-17T12:48:59+00:00

Back OpenHPC page 1 * install vanilla CentOS 7.2 on master * find Install_guide-CentOS7.2-SLURM-1.2.1-x86_64.pdf recipe guide on * turn selinux off * next switch to iptables [root@ohpc0-test ~]# systemctl disable NetworkManager [root@ohpc0-test ~]# systemctl disable firewalld [root@ohpc0-test ~]# yum install iptables-services -y [root@ohpc0-test ~]# systemctl enable iptables [root@ohpc…

155

Anonymous (anonymous@undisclosed.example.com) — 2017-04-05T12:35:03+00:00

Back OpenHPC page 2 Additional tools for the OpenHPC environment. First add these two lines to SMS and all compute nodes. Patch CHROOT as well. * /etc/security/limits.conf # added for RLIMIT_MEMLOCK warnings with libibverbs -hmeij * soft memlock unlimited * hard memlock unlimited

156

Anonymous (anonymous@undisclosed.example.com) — 2017-04-05T14:42:53+00:00

Back OpenHPC page 3 Tools yum -y groupinstall ohpc-autotools yum -y install valgrind-ohpc yum -y install EasyBuild-ohpc yum -y install spack-ohpc yum -y install R_base-ohpc * “Valgrind is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs

157

Anonymous (anonymous@undisclosed.example.com) — 2017-04-06T19:31:59+00:00

Back Centralize SSH Key Management Lets assume we have 3 colleges (CollegeA, CollegeB, CollegeC) and we write a grant proposal and each institution will do something unique science wise. Grant gets funded and specialized hardware or software gets deployed at each college (for maybe brain scan analyses, deep learning, and engineering).

158

Anonymous (anonymous@undisclosed.example.com) — 2017-03-29T13:18:08+00:00

Back Openlava Elim Pulling some information together for Openlava users documenting the ability to write your own monitor resources which the scheduler will then manage. This is handy so jobs go in PENDing mode while waiting for custom resoures to become available instead of crashing immeditaley upon submission.

159

Anonymous (anonymous@undisclosed.example.com) — 2017-03-29T19:39:34+00:00

Back HPC Survey 2017 “High-Performance Computing,” or HPC, is the application of “compute nodes” to computational problems that are either too large for standard computers or would take too long individually. HPC typically consists of a system manager server (SMS, also know as login node or master node, or all combined) and compute nodes. HPC designs may differ but frequently offer high speed networks, large home directories, scratch space, archive space and a job scheduler. A provision applic…

160

Anonymous (anonymous@undisclosed.example.com) — 2017-05-31T15:07:49+00:00

Back OpenHPC page 4 ib0 Using Infiniband for MPI traffic involves somewhat more configurations. So from the ground up (v1.3 documentation) we start with installing packages needed on CHROOT. Be sure to follow recipe and install these on SMS too.

161

Anonymous (anonymous@undisclosed.example.com) — 2020-07-16T17:16:23+00:00

Back lammps-11Aug17 lammps-11Aug17 (n78) and now lammps-22Aug18 (n33-n37) and now lammps-5Jun19 (microway) Update: * n78/gtx1080 lammps 11aug17 (centos7, mpich3/mpic++, cuda 8/sm_61, /usr/local) * GTX 1080 Ti * n37/k20 lammps 22aug18 (centos7, openmpi 1.8.4/mpic++, cuda 9.2/sm_35, /usr/local)

162

Anonymous (anonymous@undisclosed.example.com) — 2017-09-06T18:36:03+00:00

Back * Jobs can be submitted from any node. * cottontail is primary scheduler login node. * You can login in to any tail node directly (via ssh). * All nodes are CentOS 6.x with some exceptions noted. Test Queue * Wall time (CPULIMIT) has been removed (was 8 hrs/job)

163

Anonymous (anonymous@undisclosed.example.com) — 2019-05-28T17:13:19+00:00

Back Cacti Monitor Cacti server has died, need to build new zenoss/cacti server summer 2019 --- Henk 2019/05/28 13:11 ZenOSS went on the blink last month, very old install. So I tried Zabbix but not happy with that. Went back to old friend Cacti, very easy install.

164

Anonymous (anonymous@undisclosed.example.com) — 2018-09-21T11:59:30+00:00

Back GTX 1080 Ti GPUGTX 1080 TiTransistor Count12 billionNvidia Cores3,584FP32 (single precision) Teraflops11.4FP64 (double precision) Teraflops0.355Memory Capacity11GBPower250 WattPrice$700Maximum GPU Temperature91 (in C) We have Enterprise Level Telsa K20 GPU compute nodes (graphical processing units). Four per node for a total of 20 K20s capable of roughly 23 total Teraflops (floating point, double precision). $100K in 2013.

165

Anonymous (anonymous@undisclosed.example.com) — 2018-01-16T15:16:11+00:00

Back For other years view: 2016 Queue Usage, 2015 Queue Usage, 2014 Queue Usage, 2013 Queue Usage, 2012 Queue Usage, 2011 Queue Usage ... 2017 Queue Usage 2017 Totals HPC Cluster [root@cottontail ~]# grep ^2017 /share/apps/logs/bjobs_done.log | tail date,lifetime_total,daily_total 20171222,3141784,43 20171223,3141806,22 20171224,3141816,10 20171225,3141822,6 20171226,3141852,30 20171227,3141893,41 20171228,3141910,17 20171229,3141940,30 20171230,3141966,26 20171231,31…

166

Anonymous (anonymous@undisclosed.example.com) — 2018-06-27T11:51:00+00:00

Back HPC Users Meeting * Brief history * 2006 swallowtail (Dell PE1955, Infiniband, imw, emw) * 2010 greentail (HP gen6 blade servers, hp12) * 2013 sharptail (Microway storage, K20s, Infiniband, mw256/mwgpu) * 2014 mw256fd (Dell 2006 replacement with Supermicro nodes)

167

Anonymous (anonymous@undisclosed.example.com) — 2018-08-01T14:11:24+00:00

Back CPU vs GPU So the question was raised what does our usage look like between CPU and GPU devices? I have no idea what the appropriate metrics would be but lets start with comparing the hardware deployed. We'll also need to make some assumptions

168

Anonymous (anonymous@undisclosed.example.com) — 2018-09-23T13:53:54+00:00

Back 2018 GPU Expansion Important notes ... about GeForce GTX1080Ti ☎ From Nvidia web site: Warranted Product is intended for consumer end user purposes only, and is not intended for datacenter use and/or GPU cluster commercial deployments (“Enterprise Use”). Any use of Warranted Product for Enterprise Use shall void this warranty.

169

Anonymous (anonymous@undisclosed.example.com) — 2020-11-05T19:31:24+00:00

170

Anonymous (anonymous@undisclosed.example.com) — 2019-03-18T18:11:56+00:00

Back --- Henk 2019/03/18 13:58 Note self * Host ohpc0-test + n29 + n31 did form a tiny openhpc/slurm/ww test cluster, redone -slurm * Host sharptail2 act as a Centos7 Warewulf server (host petaltail is Centos6 warewulf server) OpenHPC 1.3.1 Consult these pages for my earlier testing of OpenHPC. I simply copy&paste my way through these pages while consulting the recipe PDF for CentOS7.5 plus Warewulf. Any changes are logged on this page.

171

Anonymous (anonymous@undisclosed.example.com) — 2018-08-20T13:55:08+00:00

Back Warewulf Golden Image Build an OpenHPC provisioning server using the Warewulf/Slurm recipe CentOS 7.5 x86_64. Described at local page OpenHPC 1.3.1 and web site . Make sure stateless works. We have a standalone Warewulf 3.6.99 provisioning server on CentOS 6.10 with golden images so we can fall back if necessary.

172

Anonymous (anonymous@undisclosed.example.com) — 2020-07-15T17:52:51+00:00

Back K20 Redo In 2013 we bought five servers each with 4 K20 GPUs inside. Since then they have been used but not maintained. Since we have newer GPUs (consult page GTX 1080 Ti) usage has dropped off somewhat. So I'm taking the opportunity to redo them using latest Nvidia, CentOS and application software. After all, it provides 23 teraflops GPU compute capacity (dpfp).

173

Anonymous (anonymous@undisclosed.example.com) — 2019-03-25T12:04:15+00:00

Back K20 Redo Usage One node n37 has been redone with latest Nvidia CUDA drives during summer 2018. Please test it out before we decide to redo all of them. It is running CentOS 7.5 and I'm interested to see if programs compiled under 6.x or 5.x break.

174

Anonymous (anonymous@undisclosed.example.com) — 2018-08-28T12:38:22+00:00

Back SQL on GPU MapD built the first ever open source SQL engine to harness GPU computing for analytics. Designed for maximum performance, the MapD SQL engine dynamically compiles SQL to run across multiple GPUs and CPUs. Massively parallel database servers.

175

Anonymous (anonymous@undisclosed.example.com) — 2018-11-29T18:00:30+00:00

Back As of --- Henk 2018/10/08 08:56 The P100 with 12 GB is end of life, replaced by the P100 16 GB or V100 and The GTX 1080Ti will be replaced by the GTX 2080 (no specs yet and not certified for Amber18, yet) As of --- Henk 2018/11/29 12:55 New GROMACS performance benchmarks featuring 2x and 4x NVIDIA RTX 2080 GPUs are now available (GTX too). The RTX 2080 graphics card utilizes the new NVIDIA Turing GPU architecture and provides up to 6x the performance of the previous generation. (E…

176

Anonymous (anonymous@undisclosed.example.com) — 2019-03-06T19:29:11+00:00

Back HPC Power As part of our reevaluation of our data center cooling capacity and overhaul, we need to get a handle on non-emergency power consumption in data center. This will be done by a third party consultant by clamping power cables in the penthouse of Excley. So I bought myself a metered PDU and have been busy plugging entire racks into it one at a time. I then measure

177

Anonymous (anonymous@undisclosed.example.com) — 2019-03-08T14:48:40+00:00

Back For other years view: 2017, 2016, 2015, 2014, 2013, 2012, 2011 ... 2018 Queue Usage 2018 Totals HPC Cluster Back

178

Anonymous (anonymous@undisclosed.example.com) — 2023-09-04T20:46:36+00:00

Back New design Exley Science Center – 265 Church St. – Data Center HVAC Replacement Replace computer room air conditioning units, replace rooftop condensing units, install new economizer and pump package, install new drop ceiling and lighting, install hot and cold isle containment system. New system will replace a ~40-year-old system that is critical for the campus IT infrastructure.

179

Anonymous (anonymous@undisclosed.example.com) — 2019-06-28T13:43:22+00:00

Back GPU Allocation Problems GPUs predate our Openlava software stack and need to be integrated into the scheduler as resources. This, along with other issues, has raised some scheduler allocation problems detailed on this page. A problem arose when we bought node

180

Anonymous (anonymous@undisclosed.example.com) — 2019-07-31T18:56:08+00:00

Back OpenStructure Open-Source Computational Structural Biology Framework. “This project aims to provide an open-source, modular, flexible, molecular modelling and visualization environment. It is targeted at interested method developers in the field of structural bioinformatics.

181

Anonymous (anonymous@undisclosed.example.com) — 2019-08-13T12:15:33+00:00

Back 2019 GPU Models We do not do AI (yet). Our GPU usage pattern is mostly one job per GPU for exclusive access. So no NVlink requirements, CPI connections sufficient. The application list is Amber, Gromacs, Lammps and some python biosequencing packages. Our current per GPU memory footprint is 8

182

Anonymous (anonymous@undisclosed.example.com) — 2019-12-13T13:33:09+00:00

Back P100 vs RTX 6000 & T4 The specifications of these GPU models are detailed at this page 2019 GPU Models This page will mimic the work done on this page in 2018 P100 vs GTX & K20 Credits: This work was made possible, in part, through HPC time donated by Microway, Inc. We gratefully acknowledge Microway for providing access to their GPU-accelerated compute cluster.

183

Anonymous (anonymous@undisclosed.example.com) — 2024-10-15T19:17:13+00:00

Back We have moved away from Zenoss, it was getting too old and throwing false alerts. It relies on SNMP and we wnated to go agent based. For the speed of installion we first installed Ganglia (not developed anymore but an awesome package based tool). Then we added Zabbix for completion. Details at

184

Anonymous (anonymous@undisclosed.example.com) — 2020-01-03T13:22:57+00:00

Back Turing/Volta/Pascal * AWS deploys T4 * Look at this, the smallest Elastic Cloud Compute Instances are g4dn.xlarge yielding access to 4 vCPUs, 16GiB memory and 1x T4 GPU. The largest is g4dn.16xlarge yielding access to 64 vCPUs 256 GiB memory and 1x T4 GPUs. Now the smallest is priced at $0.526/hr, and running that card 2…

185

Anonymous (anonymous@undisclosed.example.com) — 2020-02-27T18:05:22+00:00

Back OS Update Keeping track of operating system updates and quirks. n1 CentOS release 6.10 (Final) Linux n1 2.6.32-754.18.2.el6.x86_64 #1 SMP Wed Aug 14 16:26:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux n2 kaput n3 CentOS release 6.10 (Final) Linux n3 2.6.32-754.18.2.el6.x86_64 #1 SMP Wed Aug 14 16:26:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux n4 CentOS release 6.10 (Final) Linux n4 2.6.32-754.22.1.el6.x86_64 #1 SMP Tue Sep 17 16:24:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux n5 CentOS relea…

186

Anonymous (anonymous@undisclosed.example.com) — 2019-12-13T13:36:30+00:00

Back Solution: TrueNAS, ZFS, 190T usable, RaidZ2 6 spares, read cache, 800G write cache, self healing, snapshots, compression on, deduplication off, encryption off, dual controllers (high availability), 64G ram, 6x 1Gbe RJ45, SAS drives (not SATA), three year warranty, ssh access

187

Anonymous (anonymous@undisclosed.example.com) — 2020-08-17T12:01:03+00:00

Back Slurm links: * * * * Other useful links. * * * scheduler wrapper, inside con…

188

Anonymous (anonymous@undisclosed.example.com) — 2019-12-16T14:56:19+00:00

Back For other years view: 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011 ... 2019 Queue Usage 2019 Totals HPC Cluster Back

189

Anonymous (anonymous@undisclosed.example.com) — 2024-02-12T16:47:42+00:00

Back Structure and History of HPCC As promised at the CLAC HPC Mindshare event at Swarthmore College Jan 2020. Here is the Funding and Priority Policies with some context around it. Questions/Comments welcome. History In 2006, 4 Wesleyan faculty members approached ITS with a proposal to centrally manage a high performance computing center (HPCC) seeding the effort with an NSF grant (about $190K, two racks full of Dell PE1950, a total of 256 physical cpu cores on Infiniband). ITS offered 0.5 …

190

Anonymous (anonymous@undisclosed.example.com) — 2020-09-28T11:38:31+00:00

Back DMTCP * * DMTCP (Distributed MultiThreaded Checkpointing) DMTCP Checkpoint/Restart allows one to transparently checkpoint to disk a distributed computation. It works under Linux, with no modifications to the Linux kernel nor to the application binaries. It can be used by unprivileged users (no root privilege needed). One can later restart from a checkpoint, or even migrate the processes by moving the checkpoint files to another host prior …

191

Anonymous (anonymous@undisclosed.example.com) — 2020-01-24T21:36:02+00:00

Back NewsBytes for Jan 2020 2019 Queue Usage 2019 dedicated monitoring and alerting server Zenoss 2020 upcoming changes and updates Tuesday's (1/21) power outage removed BLCR's kernel modules from the compute nodes kernels. If you need to do checkpointing the new tool is Distributed MultiThreaded Checkpointing (DMTCP). Details on how to use DMTCP can be found here

192

Anonymous (anonymous@undisclosed.example.com) — 2022-03-08T18:29:20+00:00

Back EXX96 A page for me on how these 12 nodes were build up after they arrived. To make them “ala n37” which was the test node in redoing our K20 nodes, see K20 Redo and K20 Redo Usage Page best followed bottom to top if interested in the whole process. The Usage section below is HPCC users wnatig to use queue

193

Anonymous (anonymous@undisclosed.example.com) — 2024-09-17T16:51:57+00:00

Back Docker Containers Usage Page build up from the bottom to top. We're not making a traditional “MPI” docker integration with our scheduler. We'll see what usage patterns will emergence and go from there. I can help with workflow. If more containers are desired please let me know which ones to

194

Anonymous (anonymous@undisclosed.example.com) — 2025-08-05T13:00:24+00:00

Back TrueNAS/ZFS x20ha Notes. Mainly for me but might be useful/of interest to users. Message: Our current file server is sharptail.wesleyan.edu which serves out home directories (/home, 10T). A new file server hpcstore.wesleyan.edu will be deployed taking over this function (/zfshomes, 190T). This notice is to inform you your home directory has been cut over.

195

Anonymous (anonymous@undisclosed.example.com) — 2020-07-25T18:09:09+00:00

Back Comments Here are some useful comments from lists/vendors etc * LAMMPS uses a hybrid OpenMP/MPI model. If you don't set the number of OpenMP threads (ompthreads or OMP_NUM_THREADS) explicitly, it will likely take the number of CPU cores (ncpus) as its default value and you will end up with having too many OpenMP threads and MPI processes on a physical core. You can see this by logging in to the compute node and do

196

Anonymous (anonymous@undisclosed.example.com) — 2020-11-10T14:43:57+00:00

Back Netdata We use Zenoss for monitor and alerting the whole HPC. Page can be found here Zenoss At PEARC20 conference I became aware of Netdata which seems a good tool for our “tails” (login, storage servers for example). Lots of detailed information. bash <(curl -Ss https://my-netdata.io/kickstart.sh)

197

Anonymous (anonymous@undisclosed.example.com) — 2020-08-27T14:39:03+00:00

Back XFS quotas In XFS you first enable quotas on the mountpoint (you add the options to /etc/fstab and remount) # user and group quotas example: /dev/mapper/VolGroup00-lvhome /home xfs defaults,usrquota,grpquota 0 1 # user and project quotas example: /dev/sdb1 /mindstore xfs defaults,uquota,pquota 1 2

198

Anonymous (anonymous@undisclosed.example.com) — 2020-12-03T13:09:59+00:00

Back GPU checkpoint/restart Why I thought this was an easy problem to solve I do not know. CPU checkpoint/restart has come a long way with DMTCP for serial and parallel jobs (including multi-host). But the CPU/GPU environment adds much complexity. A good

199

Anonymous (anonymous@undisclosed.example.com) — 2020-12-11T19:51:47+00:00

Back ERN 2020 Powerpoint presentation by Karen Warren for Eastern Regional Network describing HPC history and current bottleneck issues. Back

200

Anonymous (anonymous@undisclosed.example.com) — 2021-02-18T18:33:08+00:00

Back Update --- Henk 2021/02/12 14:27 ---------- For CUDA_ARCH (or nvcc -arch) versions check this Matching CUDA arch and CUDA gencode for various NVIDIA architectures web page. “When you compile CUDA code, you should always compile only one ‘-arch‘ flag that matches your most used GPU cards. This will enable faster runtime, because code generation will occur during compilation.

201

Anonymous (anonymous@undisclosed.example.com) — 2020-12-26T15:33:17+00:00

Back DMZ with DTN While attending the 2020 Eastern Regional Network conference (view Karen's slides presented at ERN) an idea surfaced around 10G network. If we deploy Cottontail2 and migrate onto 10G network speeds what if we tried for a Science DMZ with a Data Transfer Node (click on Architecture, left side, scroll down for simple setup) with a

202

Anonymous (anonymous@undisclosed.example.com) — 2021-10-20T19:01:55+00:00

Back For other years view: 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011 ... 2020 Queue Usage 2020 Totals HPC Cluster date,total,pending,running, 07/13/20_07:00,total,82192,770, 07/13/20_07:30,total,166156,796, ... 07/15/20_17:30,total,145188,1549, 07/16/20_07:30,total,137694,1197, 07/16/20_08:00,total,137124,1161, 07/16/20_08:30,total,136542,1209,

203

Anonymous (anonymous@undisclosed.example.com) — 2021-03-10T19:46:28+00:00

Back ICC vs ICX Following an inquiry on the XSEDE list about the differences between ICC and ICX I was informed that “Parallel Studio Cluster Edition” has become OneAPI, free-to-use. So I followed up on that and found a CentOS 7 server to do a local install for testing. We currently run icc/ifort 2016 version and it is time for a new set of compilers. Now there is also lots of discussions about

204

Anonymous (anonymous@undisclosed.example.com) — 2023-09-15T19:12:47+00:00

Back Lammps: MAKE or CMAKE Using make and compiling libquip.a into the lammps binary generates an error like error 1 in 'geryon/nvd_kernel.h' in line 364 when package gpu tries to set up the runtime env for a lammps gpu job. This problem disappears when compiling with

205

Anonymous (anonymous@undisclosed.example.com) — 2021-05-24T11:59:51+00:00

Back XFS panic We run a lot of XFS storage arrays with hardware raid controllers (Areca, MegaRAID). Rsync is used to pull content from active server to standby server in a continuous loop. Usually something like this happens; a disk fails, a hot spare deploys, array rebuilds parity, failed disk gets replaced, new hot spare is created. All is well.

206

Anonymous (anonymous@undisclosed.example.com) — 2021-07-08T14:18:19+00:00

Back Python to C When installing via miniconda3 the the web site states “Due to technical limitations, the conda package does not support GPUs at the moment. If you want to use a GPU, you have to build galario by hand.” A compilation by hand yields two standalone libraries and presumably GPU functionality. There is an example of an invocation using

207

Anonymous (anonymous@undisclosed.example.com) — 2023-10-27T18:47:59+00:00

Back Make sure munge/unmunge work between 1.3/2.4, that date is in sync (else you get error #16) Slurm Test Env Getting a head start on our new login node plus two cpu+gpu compute node project. Hardware has been purchased but there is long delivery time. Meanwhile it makes sense to setup a standalone Slurm scheduler and do some testing and have as a backup. Slurm will be running on

208

Anonymous (anonymous@undisclosed.example.com) — 2022-11-02T17:28:32+00:00

Back Slurm Test Env There is a techie page at this location Slurm Techie Page for those of you who are interested in the setup. This page is intended for users to get started with the Slurm scheduler. greentail52 will be the slurm scheduler test “controller” with several cpu+gpu compute nodes configured. Any jobs submitted should be simple, quick running jobs, like a

209

Anonymous (anonymous@undisclosed.example.com) — 2022-04-10T19:49:19+00:00

Back EasyBuild EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High Performance Computing (HPC) systems in an efficient way. EasyBuild 4.4.2 supports 2469 different software packages (incl. toolchains, bundles):

210

Anonymous (anonymous@undisclosed.example.com) — 2022-01-03T15:14:39+00:00

Back For other years view: 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011 ... 2021 Queue Usage 2021 Totals HPC Cluster /* import total into work.one */ /* import total into work.one */ data two; set one; if var2 eq 'total' and var3<50000; run; title1 j=c 'Running (green line) versus Pending (red line)Jobs'; title2 j=c 'Queue: All (max is 2056)' and pending < 50,000; title3 j=c 'Covers time period 01 JAN 2021 to 31 DEC 2021'; axis1 label=("…

211

Anonymous (anonymous@undisclosed.example.com) — 2022-03-01T16:50:38+00:00

Back DMTCP CRAC This is a new DMTCP() plugin to checkpoint- restart CUDA application with noval split-process architecture. * * CRAC consists of the plugin on top of DMTCP. This software runs in the original directory Compilation needs gcc

212

Anonymous (anonymous@undisclosed.example.com) — 2022-03-02T18:40:49+00:00

Back HPC SDK * * v 22.2 * /share/apps/CENTOS7/nvidia/hpc_sdk/22.2 The NVIDIA Software Development Kit (SDK) Manager is an all-in-one tool that bundles developer software and provides an end-to-end development environment setup solution for NVIDIA SDK. Think

213

Anonymous (anonymous@undisclosed.example.com) — 2026-02-19T20:54:26+00:00

Back New Head Node We're embarking on a transition to a new head/login node name cottontail2. This server will be running Rocky 8 operating system. Early design ideas can be found at Cottontail2, all pre-pandemic. We are staying with a 1G ethernet network as we could not find 10G switches. Maybe in the near term we can upgrade.

214

Anonymous (anonymous@undisclosed.example.com) — 2023-08-18T16:19:53+00:00

Back Tada Introducing our new login node cottontail2. It is a server designed to run the Slurm scheduler and will sport the OpenHPC v2.4 software stack (External Link). We are deploying the Slurm/Warewulf recipe. You can find details at External Link: Rocky 8.5 with Architecture = (x86_64).

215

Anonymous (anonymous@undisclosed.example.com) — 2025-12-10T15:07:38+00:00

Back OpenHPC Software This list of software is compiled for Rocky 8 using the OpenHPC v2.4 gnu9-openmpi4 toolchain (in your default environm,ent). For gpu applications CUDA 11.6 is the default. That module cuda/11.6 will automatically load for those applications.

216

Anonymous (anonymous@undisclosed.example.com) — 2022-06-07T20:07:38+00:00

Back Warewulf, ohpc 2.4 There are other pages to view but this is my latest ... * Warewulf Golden Image stateless First we create templates network.ww and ifcfg-eth0.ww This node n59 is bare metal with just a 16G usb stick attached to system board (DOM) to hold operating system. Legacy boot.

217

Anonymous (anonymous@undisclosed.example.com) — 2022-06-24T13:53:16+00:00

Back Slurm entangles So, vaguely I remember when redoing our K20 gpu nodes I had troubles with that ASUS hardware and Warewulf 3.6. Now I have deployed a production cluster using OpenHPC 2.4, Rocky 8.5 and Warewulf 3.9 version. Same deal. Do not know what is going on but just documenting.

218

Anonymous (anonymous@undisclosed.example.com) — 2025-06-19T15:43:44+00:00

Back Getting Started with Slurm Guide * The following resources are now (7/1/2022) managed by Slurm on our new head/login node cottontail2.wesleyan.edu * You must ssh directly to this server (like you do connecting to greentail52) via VPN * ssh username@cottontail2.wesleyan.edu

219

Anonymous (anonymous@undisclosed.example.com) — 2022-11-16T14:01:57+00:00

Back Slurm Consumable Resources It is important to define as many resources your job needs to run. This allows Slurm to run multiple jobs per node if resources are available (given our configuration). Monitoring the allocated resources to your job will provide feedback if your

220

Anonymous (anonymous@undisclosed.example.com) — 2024-07-01T14:50:38+00:00

Back NFSoRDMA Previously used IPoIB, consult this page External Link With newer hardware (storage and compute nodes) and an EDR Infiniband switch (expensive!) we will try NFSoRDMA. Remote Direct Memory Access supposedly gets better performance than IPoIB. Clients (compute nodes) fetch data directly from storage server's memory, so the remote storage

221

Anonymous (anonymous@undisclosed.example.com) — 2023-03-14T13:59:18+00:00

Back Infiniband Monitoring The NVIDIA Firmware Tools (MFT) is a toolset to generate a standard or customized NDIVIA firmware image Querying for firmware information. It is required for ibswinfo which can monitor unmanaged Infiniband switches. Our new Infiniband switch is a

222

Anonymous (anonymous@undisclosed.example.com) — 2023-03-06T14:08:13+00:00

Back mdadm recreate array Something went wrong and a compute is complaining some array is corrupt. It was 4x 1T 7.2K rpm disks arrayed together with mdadm to provide /localscratch2tb for heavy IO Gaussian jobs. This is the process ... # first wipe the disk and partitions for sd[a-d] [root@n74 ~]# wipefs --all --force /dev/sda1; wipefs --all --force /dev/sda /dev/sda1: 4 bytes were erased at offset 0x00001000 (linux_raid_member): fc 4e 2b a9 /dev/sda: 8 bytes were erased at offset 0x00000…

223

Anonymous (anonymous@undisclosed.example.com) — 2023-09-18T20:56:49+00:00

Back cuda toolkit Upgrading Cuda to latest drivers and tooltkit that supports our oldest gpu model the K20m gpus found in nodes n33-n37 (queue mwgpu). Consult the page on previous K20m upgrade K20 Redo For legacy hardware find the latest legacy driver here

224

Anonymous (anonymous@undisclosed.example.com) — 2024-01-12T14:36:16+00:00

Back Recipe for n38-n45 conversion of openlava/centos6 to slurm/centos7. First install “server with GUI” via USB installation media. Enter BIOS (delete key). Set Date&Time and boot order (Removable, USB, Cd&DVD, Hdd). Reclaim disk space.. Kdump disabled.

225

Anonymous (anonymous@undisclosed.example.com) — 2024-05-21T14:06:42+00:00

Back Cuda Upgrading Cuda to latest drivers and tooltkit that supports our GeForce RTX 2080 SUPER (and Ti) gpu models (queues exx96 and amber128). Before we embark doing all nodes, we need to test backward compatibility and assess how troublesome the upgrade might be.

226

Anonymous (anonymous@undisclosed.example.com) — 2026-05-07T16:14:42+00:00

Back TrueNAS/ZFS m40ha Notes on the deployment and production changes on our 500T IXsystem m40ha storage appliance. Fixed the date on controllers by pointing ntpd to 129.133.1.1 ES 60 middle amber light blinking which is ok, green health check on right

227

Anonymous (anonymous@undisclosed.example.com) — 2024-10-23T12:16:25+00:00

Back HPC Monitoring We used to use Zenoss as our health and alerting monitor (Zenoss). Because of a research project needing quick insight into resource consumations on compute nodes we first quickly installed Ganglia. Not developed anymore but a great tool. You can quickly download centos 8 packages and grab centos 7 packages. For the latter you need to change the yum repo URLs to (and uncomment the mirrorlist URLs)

228

Anonymous (anonymous@undisclosed.example.com) — 2025-11-17T14:55:23+00:00

Back IB BIOS settings * applies to nodes n102-n107 * NFSoRDMA, see NFSoRDMA * from support at Microway * then fix NFS mount on n103 Here are the BIOS setting we set for those system prior to shipping; Start by entering the BIOS and taking the "Optimized Defaults" (F3) Then going down through the menus on the "Advanced" tab in the BIOS... Boot Feature; Quiet Boot = disabled Wait for "F1" if Error = disabled CPU Configuration: …

229

Anonymous (anonymous@undisclosed.example.com) — 2025-02-20T18:54:30+00:00

Back Recipe for RTX4070ti nodes # image using usb stick rocky 8.10 # enter bios set date, note MAC address vi /etc/selinux/config vi /etc/ssh/sshd_config # SKIP NO WAREWULF vi /etc/default/grub # add inet.ifnames=0 to CMD LINE grub2-mkconfig -o /boot/grub2/grub.cfg reboot # add 10 to nodename for ip cd /etc/sysconfig/network-scripts/ vi ifcfg-en01 mv ifcfg-eno1 ifcfg-eth0 vi ifcfg-eno2 mv ifcfg-eno2 ifcfg-eth1 systemctl restart NetworkManager ifconfig # SKIP NO WAREWULFy # IPTABLES yum ins…

230

Anonymous (anonymous@undisclosed.example.com) — 2025-03-24T19:37:45+00:00

Back cuda-checkpoint Newly developed cuda tool to keep track of. Sounds initially good but there some items to check/test out. Also need to track DMTCP CRAC tool that almost worked. * CRIU (Checkpoint/Restore in Userspace) is an open-source checkpointing utility. Works with cuda driver 550 and higher (although I do not see it in exx96's cuda-12.4 installation). But it is present in test's (11.6) and mwgpu256's …

231

Anonymous (anonymous@undisclosed.example.com) — 2025-11-07T15:31:59+00:00

Back GPU checking Some excellent articles from * * * …

232

Anonymous (anonymous@undisclosed.example.com) — 2026-01-29T21:36:07+00:00

Back Lampy This is a fantastic forehead bang-on-desk exercise. So many applications wrapped together it is an enormous puzzle. I was greatly helped by following a recipe of student Max of the Starr Lab. Hopefully Max can report some performance results later as that was the driver to do this.

tmp

Anonymous (anonymous@undisclosed.example.com) — 2007-04-03T20:30:35+00:00

a page to feed vendors log files