DokuWiki

This is an old revision of the document!

Slurm

The Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. The architecture is described here https://computing.llnl.gov/linux/slurm/quickstart.html.

Installation
- begins with installing Munge
  - fairly straightforward, build RPMs from tarball
  - installed on head node and all compute nodes
  - copied the munge.key from head node to all compute nodes
- slum installed from source code with
  - –prefix=/opt/slurm-14 –sysconfdir=/opt/slurm-14/etc
  - launched the configurator web page and set up a simple setup
    - created the openssl key and cert (see slurm web pages)
    - logs to files not mysql for now
    - change some settings in slurm.conf, particularly
      - FirstJObId, MaxJobId
      - MaxJobCount=100000
      - MaxTaskPerNode=65533
      - SRunEpilog/SRunProlog (creates and removes work directories in /scratch/SLUM_JOB_ID)

Then I created a simple file to test Slurm

#!/bin/bash

#SBATCH --time=1:30:10
#SBATCH --job-name="NUMBER"
#SBATCH --output="tmp/outNUMBER"
#SBATCH --begin=10:35:00

echo "$SLURMD_NODENAME JOB_PID=$SLURM_JOB_ID"
echo DONE

Slurm is installed on a PE2950 with dual quad cores and 16 GB. It is part of my high priority queue and allocated to Openlava (v2.2).

My nodes are created in a virtual KVM environment also on a PE2950 (2.6 Ghz, 16 GB ram) dual quad core with hyperthreading and virtualization turned on in the BIOS. Comments on how to build that KVM environment are here LXC Linux Containers

Back

DokuWiki

User Tools

Site Tools

Slurm

Page Tools