Differences

This shows you the differences between two versions of the page.

--- cluster:86 [2010/04/29 18:02] – hmeij
+++ cluster:86 [2010/05/13 18:29] (current) – hmeij
@@ Line 3: / Line 3: @@
 ==== Cloud Or Not? ====
+There is a lot of buzz about cloud computing.  Recently, this has spilled over into the HPC world.  There are private and public clouds.  And private clouds at external organization or internal to your own organization.  I do not have the gist of it down yet, but this page explores the option: can or should we consider HPC cloud for spending our $298K NSF award on or stay the course buying new hardware?
+==== About Us ====
+First we need to assess our usage of our clusters and if the cloud can support that. We also need to stress that we are a small liberal arts college, primarily undergraduate, with less than 3,000 students and perhaps 450 active faculty. We do not need to do cloud computing because we're going to scale upwards to 10,000 nodes.  We also have a relatively small cluster user base (50 accounts), representing many different interests groups as opposed to one large community using a small set of tools.
+As I like to tell vendors whom quote me impressive statistics like  "1,500 HPC users across 17 buildings mostly doing the same thing, all tied together with their HPC solution" ... we have 50 users in 2 buildings all doing different things.
+==== Usage #1 ====
+  * Some users run hundreds and hundreds of serial jobs, which require a month or so to run, but only need 100 MB of memory or so and do everything in memory with their own checkpointing.
+  * Some users run hundreds and hundreds of serial jobs, which require modest amount of memory, run overnight, and whose output becomes input for a single matlab job that runs for weeks.
+  * Some users run tens and tens of jobs in serial fashion with modest IO requirements.
+  * Some users run LAMMPS parallel over the ethernet switches.
+  * Some users run Amber parallel jobs (n=32-48) which run for weeks to a month using the Infiniband interconnect.
+  * Some users run Gaussian with large IO activity and need  fast (local) disk space.
+==== Usage #2 ====
+If you follow this [[cluster:85|Queue Usage]] link you can observe how we use the compute nodes by queue.  Notice that we do have jobs in pending stages while resources are available.  So the question becomes, how do we leverage those resources better?
+==== Clusters ====
+The current problems we encounter are:
+  * Home directory disk space requirement, we need more than 5 TB (out of the $250K grant, reserve $50K for a 48TB (raw) Nexsan SataBeast?)
+  * Fast scratch space, we have none (10 TB Lustre filesystem, carve out of SataBeast)
+  * Establish a data archive for users rather than have multiple copies (10 TB, carve out of SataBeast)
+  * Only 16 out of 36 nodes on Infiniband
+  * Need more nodes with small memory footprint, or more medium (12gb) nodes so small jobs can be spread wide.
+  * Need more moderate memory footprint nodes (actually we need to get gaussian/linda fixed on sharptail's AMD Opteron chip set)
+  * We need a database server
+  * Perhaps we need a better filesystem, but for now NFS is ok
+  * Heating/Cooling and Power will again tax the data center
+Our expectations are that if we buy new hardware we expect to obtain somewhere between 300-512 job slots with say $250K, with 3 year support build in and then we do-it-ourselves during next 3 years, and at the end of 6 years consider the hardware "used up".
+==== Cloud ====
+My understanding of a private cloud at another, remote facility and the [dis]advantages of it are:
+  * Is it affordable? We do not need it to scale up for instance
+  * Cooling/Heating and Power are not our problem anymore
+  * New/Different administration tasks
+  * New/Different user experience
+  * The ability to design our private cloud based on our needs
+  * The ability to change our design based on project needs
+==== Qs ====
+  * Web based front end - how to batch submit 100's of jobs?
+  * Input/Output files - how to pre-stage input files, retrieve output files?
+  * Software - how is this provided, ie Amber(MPI)/Infiniband, Intel ifort/icc, Gnu whatever, gaussian/matlab/stata, LAMMPS(MPI)/gigE etc?  Is this inside your "own" cloud done by us?
+  * Debugging/Rendering - any interactive activities are shifted to our local environment as opposed to be "on the head node"?
+  * VMs - apart from specifying OS type, do we specify nrs of CPUs, memory, local disk space, etc? Does one job have exclusive use of requested VMs?
+  * Scratch - is local high-speed scratch available in our cloud, allocated by job?
+  * Support - what is, and what is not supported?
+  * Pricing - pay as you go or a predefined based on the set of resources in your cloud?
+  * Accounts - who manages accounts?
+  * Security.
 \\
 **[[cluster:0|Back]]**