This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:86 [2010/04/29 18:01] hmeij created |
cluster:86 [2010/05/13 18:29] (current) hmeij |
||
---|---|---|---|
Line 2: | Line 2: | ||
**[[cluster: | **[[cluster: | ||
- | === Cloud Or Not? === | + | ==== Cloud Or Not? ==== |
+ | |||
+ | There is a lot of buzz about cloud computing. | ||
+ | |||
+ | ==== About Us ==== | ||
+ | |||
+ | First we need to assess our usage of our clusters and if the cloud can support that. We also need to stress that we are a small liberal arts college, primarily undergraduate, | ||
+ | |||
+ | As I like to tell vendors whom quote me impressive statistics like "1,500 HPC users across 17 buildings mostly doing the same thing, all tied together with their HPC solution" | ||
+ | |||
+ | ==== Usage #1 ==== | ||
+ | |||
+ | * Some users run hundreds and hundreds of serial jobs, which require a month or so to run, but only need 100 MB of memory or so and do everything in memory with their own checkpointing. | ||
+ | * Some users run hundreds and hundreds of serial jobs, which require modest amount of memory, run overnight, and whose output becomes input for a single matlab job that runs for weeks. | ||
+ | * Some users run tens and tens of jobs in serial fashion with modest IO requirements. | ||
+ | * Some users run LAMMPS parallel over the ethernet switches. | ||
+ | * Some users run Amber parallel jobs (n=32-48) which run for weeks to a month using the Infiniband interconnect. | ||
+ | * Some users run Gaussian with large IO activity and need fast (local) disk space. | ||
+ | |||
+ | |||
+ | ==== Usage #2 ==== | ||
+ | |||
+ | If you follow this [[cluster: | ||
+ | |||
+ | ==== Clusters ==== | ||
+ | |||
+ | The current problems we encounter are: | ||
+ | |||
+ | * Home directory disk space requirement, | ||
+ | * Fast scratch space, we have none (10 TB Lustre filesystem, carve out of SataBeast) | ||
+ | * Establish a data archive for users rather than have multiple copies (10 TB, carve out of SataBeast) | ||
+ | * Only 16 out of 36 nodes on Infiniband | ||
+ | * Need more nodes with small memory footprint, or more medium (12gb) nodes so small jobs can be spread wide. | ||
+ | * Need more moderate memory footprint nodes (actually we need to get gaussian/ | ||
+ | * We need a database server | ||
+ | * Perhaps we need a better filesystem, but for now NFS is ok | ||
+ | * Heating/ | ||
+ | |||
+ | Our expectations are that if we buy new hardware we expect to obtain somewhere between 300-512 job slots with say $250K, with 3 year support build in and then we do-it-ourselves during next 3 years, and at the end of 6 years consider the hardware "used up". | ||
+ | |||
+ | ==== Cloud ==== | ||
+ | |||
+ | My understanding of a private cloud at another, remote facility and the [dis]advantages of it are: | ||
+ | |||
+ | * Is it affordable? We do not need it to scale up for instance | ||
+ | * Cooling/ | ||
+ | * New/ | ||
+ | * New/ | ||
+ | * The ability to design our private cloud based on our needs | ||
+ | * The ability to change our design based on project needs | ||
+ | |||
+ | ==== Qs ==== | ||
+ | |||
+ | * Web based front end - how to batch submit 100's of jobs? | ||
+ | * Input/ | ||
+ | * Software - how is this provided, ie Amber(MPI)/ | ||
+ | * Debugging/ | ||
+ | * VMs - apart from specifying OS type, do we specify nrs of CPUs, memory, local disk space, etc? Does one job have exclusive use of requested VMs? | ||
+ | * Scratch - is local high-speed scratch available in our cloud, allocated by job? | ||
+ | * Support - what is, and what is not supported? | ||
+ | * Pricing - pay as you go or a predefined based on the set of resources in your cloud? | ||
+ | * Accounts - who manages accounts? | ||
+ | * Security. | ||
\\ | \\ | ||
**[[cluster: | **[[cluster: |