User Tools

Site Tools


cluster:86

This is an old revision of the document!



Back

Cloud Or Not?

There is a lot of buzz about cloud computing. Recently, this has spilled over into the HPC world. There are private and public clouds. And private clouds at external organization or internal to your own organization. I do not have the gist of it down yet, but this page explores the option: can or should we consider HPC cloud instead of spending our $298K NSF award on cloud computing or stay the course.

About Us

First we need to assess our usage of our clusters and if the cloud can support that. We also need to stress that we are small liberal arts college, primarily undergraduate, with less than 3,000 students and perhaps 450 active faculty. We do not need to do cloud computing because were going to scale upwards to 10,000 nodes. We also have a relatively small cluster user base, representing many different interests groups as opposed to one large community using a small set of tools.

Users

  • Some users run hundreds and hundreds of serial jobs, which require a month or so to run, but only need 100 MB of memory or so and do everything in memory with their own checkpointing.
  • Some users run hundreds and hundreds of serial jobs, which require modest amount of memory, run overnight, and whose output becomes input for a single matlab job that runs for weeks.
  • Some users run tens and tens of jobs in serial fashion with modest IO requirements.
  • Some users run LAMMPS parallel over the ethernet switches.
  • Some users run Amber parallel jobs which run for weeks to month using the Infiniband interconnect.
  • Some users run Gaussian with large IO activity and need local fast disk space.

Usage

If you follow this Queue Usage link you can observe how we use the compute nodes by queue. Notice that we do have jobs in pending stages while resources are available. So the question becomes, how do we leverage those resources better.

Clusters

The current problems we encounter are:

  • Home directory disk space requirement, we need more than 5 TB (out of the $298K grant, reserve $50K for a 48TB (raw) Nexsan SataBeast?)
  • Fast scratch space, we have none (10 TB Lustre filesystem, carve out of SataBeast)
  • Establish a data archive for users rather than have multiple copies (10 TB, carve out of SataBeast)
  • Only 16 out of 36 nodes on Infiniband
  • Need more nodes with small memory footprint
  • Need more moderate memory footprint nodes (actually we need to get gaussian/linda fixed on sharptail)
  • We need a database server node
  • Perhaps we need a better filesystem, but for now NFS is ok
  • Heating/Cooling and Power will again tax the data center

Our expectations are that if we buy new hardware we expect to obtain somewhere between 300-512 job slots, with 3 year support build in and then we do-it-ourselves in next 3 years, and at the end of 6 years consider the hardware “used up”.

Cloud

My understanding of a private cloud at another, remote facility and the [dis]advantages of it are:

  • Is it affordable? We do not need it to scale up for instance
  • Cooling/Heating and Power are not our problem anymore
  • New/Different administration tasks
  • The ability to design our private cloud based on our needs
  • The ability to change our design based on project needs

Qs

  • Web based front end - how to batch submit 100's of jobs?
  • Input/Output files - how to prestage input files, retrieve output files?
  • Software - how is this provide, ie Amber(MPI)/Infiniband, Intel ifort/icc, Gnu whatever, gaussian/matlab/stata, LAMMPS(MPI)/gigE etc? Is this in your “own” cloud done by us?
  • Debugging/Rendering - any interactive activities are shifted to our local environment as opposed to be “on the head node”?
  • VMs - apart form specifying OS type, do we specify nrs of CPUs, memory etc? Does one job have exclusive use of requested VMs?
  • Scratch - is local high-speed scratch available in our cloud, allocated by job?
  • Support - what is, and what is not supported?
  • Pricing - how is this organized, a pay as you go, or a predefined set of resources (use them or loose them)?
  • Accounts - can accounts be tied organizations single sign on?


Back

cluster/86.1272566476.txt.gz · Last modified: 2010/04/29 14:41 by hmeij