This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:86 [2010/04/29 18:14] hmeij |
cluster:86 [2010/05/13 18:29] (current) hmeij |
||
---|---|---|---|
Line 4: | Line 4: | ||
==== Cloud Or Not? ==== | ==== Cloud Or Not? ==== | ||
- | There is a lot of buzz about cloud computing. | + | There is a lot of buzz about cloud computing. |
==== About Us ==== | ==== About Us ==== | ||
- | First we need to assess our usage of our clusters and if the cloud can support that. We also need to stress that we are small liberal arts college, primarily undergraduate, | + | First we need to assess our usage of our clusters and if the cloud can support that. We also need to stress that we are a small liberal arts college, primarily undergraduate, |
+ | |||
+ | As I like to tell vendors whom quote me impressive statistics like "1,500 HPC users across 17 buildings mostly doing the same thing, all tied together with their HPC solution" | ||
+ | |||
+ | ==== Usage #1 ==== | ||
* Some users run hundreds and hundreds of serial jobs, which require a month or so to run, but only need 100 MB of memory or so and do everything in memory with their own checkpointing. | * Some users run hundreds and hundreds of serial jobs, which require a month or so to run, but only need 100 MB of memory or so and do everything in memory with their own checkpointing. | ||
Line 14: | Line 18: | ||
* Some users run tens and tens of jobs in serial fashion with modest IO requirements. | * Some users run tens and tens of jobs in serial fashion with modest IO requirements. | ||
* Some users run LAMMPS parallel over the ethernet switches. | * Some users run LAMMPS parallel over the ethernet switches. | ||
- | * Some users run Amber parallel jobs which run for weeks to month using the Infiniband interconnect. | + | * Some users run Amber parallel jobs (n=32-48) |
- | * Some users run Gaussian with large IO activity and need local fast disk space. | + | * Some users run Gaussian with large IO activity and need fast (local) |
+ | |||
+ | |||
+ | ==== Usage #2 ==== | ||
+ | |||
+ | If you follow this [[cluster: | ||
+ | |||
+ | ==== Clusters ==== | ||
+ | |||
+ | The current problems we encounter are: | ||
+ | |||
+ | * Home directory disk space requirement, | ||
+ | * Fast scratch space, we have none (10 TB Lustre filesystem, carve out of SataBeast) | ||
+ | * Establish a data archive for users rather than have multiple copies (10 TB, carve out of SataBeast) | ||
+ | * Only 16 out of 36 nodes on Infiniband | ||
+ | * Need more nodes with small memory footprint, or more medium (12gb) nodes so small jobs can be spread wide. | ||
+ | * Need more moderate memory footprint nodes (actually we need to get gaussian/ | ||
+ | * We need a database server | ||
+ | * Perhaps we need a better filesystem, but for now NFS is ok | ||
+ | * Heating/ | ||
+ | |||
+ | Our expectations are that if we buy new hardware we expect to obtain somewhere between 300-512 job slots with say $250K, with 3 year support build in and then we do-it-ourselves during next 3 years, and at the end of 6 years consider the hardware "used up". | ||
+ | |||
+ | ==== Cloud ==== | ||
+ | |||
+ | My understanding of a private cloud at another, remote facility and the [dis]advantages of it are: | ||
+ | |||
+ | * Is it affordable? We do not need it to scale up for instance | ||
+ | * Cooling/ | ||
+ | * New/ | ||
+ | * New/ | ||
+ | * The ability to design our private cloud based on our needs | ||
+ | * The ability to change our design based on project needs | ||
+ | ==== Qs ==== | ||
+ | * Web based front end - how to batch submit 100's of jobs? | ||
+ | * Input/ | ||
+ | * Software - how is this provided, ie Amber(MPI)/ | ||
+ | * Debugging/ | ||
+ | * VMs - apart from specifying OS type, do we specify nrs of CPUs, memory, local disk space, etc? Does one job have exclusive use of requested VMs? | ||
+ | * Scratch - is local high-speed scratch available in our cloud, allocated by job? | ||
+ | * Support - what is, and what is not supported? | ||
+ | * Pricing - pay as you go or a predefined based on the set of resources in your cloud? | ||
+ | * Accounts - who manages accounts? | ||
+ | * Security. | ||
\\ | \\ | ||
**[[cluster: | **[[cluster: |