Slurm is a resource manager and job scheduler. Users can submit jobs (i.e. scripts containing execution instructions) to slurm so that it can schedule their execution and allocate the appropriate resources (CPU, RAM, etc..) on the basis of a user's preferences or the limits imposed by the system administrators. The advantages of using slurm on a computational cluster are multiple. For an overview of them please read these pages.
Slurm is free software distributed under the GNU General Public License.
A parallel job consists of tasks that run simultaneously. Parallelization can be achieved in different ways. Please read the relevant wiki page here to know more.
Slurm is made of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with the possibility of setting up a fail-over twin). If `accounting' is enabled, then there is a third daemon called slurmdbd managing the communications between an accounting database and the management node.
Common slurm user commands include sacct, salloc, sattach, sbatch, sbcast, scancel, scontrol, sinfo, smap, squeue, srun, strigger and sview and they are available on each compute node. Manual pages for each of these
commands are accessible in the usual manner, that is man sbatch
for example (see below).
A node in slurm is a compute resource. This is usually defined by particular consumable resources, i.e. memory, CPU, etc…
A partition (or queue) is a set of nodes with usually common characteristics and/or limits.
Partitions group nodes into logical (even overlapping if necessary) sets.
Jobs are allocations of consumable resources assigned to a user under specified conditions.
A job step is a single task within a job. Each job can have multiple tasks (steps) even parallel ones.
Determine what partitions exist on the system, what nodes they include, and the general system state.
$ sinfo
A * near a partition name indicates the default partition. See man sinfo
Display all active jobs by user bongo?
$squeue -u <username>
See man squeue
.
Report more detailed information about partitions, nodes, jobs, job steps, and configuration
$ scontrol show partition notebook
$scontrol show node maris004
novamaris [1087] $ scontrol show jobs 1052
See man scontrol
.
Create three tasks running on different nodes
novamaris [1088] $ srun -N3 -l /bin/hostname 2: maris007 0: maris005 1: maris006
Create three tasks running on the same node
novamaris [1090] $ srun -n3 -l /bin/hostname 2: maris005 1: maris005 0: maris005
Create three tasks running on different nodes specifying which nodes should at least be used
srun -N3 -w "maris00[5-6]" -l /bin/hostname 1: maris006 0: maris005 2: maris007
Allocate resources and spawn job steps within that allocation
novamaris [1094] $ salloc -n2 salloc: Granted job allocation 1061 novamaris [997] $ srun /bin/hostname maris005 maris005 novamaris [998] $ exit exit salloc: Relinquishing job allocation 1061 novamaris [1095] $
Create a job script and submit it to slurm for execution
Suppose batch.sh
has the following contents
#!/bin/env bash #SBATCH -n 2 #SBATCH -w maris00[5-6] srun hostname
then submit it using sbatch script.sh
.
See man sbatch
.
Display info on configured qos
$ sacctmgr show qos format=Name,MaxCpusPerUser,MaxJobsPerUser,Flags
Displays information pertaining to CPU, Task, Node, Resident Set Size (RSS) and Virtual Memory (VM) of a running job
$sstat -o JobID,MaxRSS,AveRSS,MaxPages,AvePages,AveCPU,MaxDiskRead 8749.batch JobID MaxRSS AveRSS MaxPages AvePages AveCPU MaxDiskRead ------------ ---------- ---------- -------- ---------- ---------- ------------ 8749.batch 196448K 196448K 0 0 01:00.000 7.03M
Note that in the example above the job is identified by id 8749.batch
in which the word `batch' is appended to the id displayed using the squeue command. This is a necessary addition whenever a running program is not parallel i.e. not using `srun'.
Display the shares associated to a particular user
$ sshare -U -u <username> Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- xxxxx yyyyyy 1 0.024390 37733389 0.076901 0.112428
Display priority information of a pending job id xxx
sprio -l -j xxx
To find what priority a running job was given type
squeue -o %Q -j <jobid>
It displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database. For instance
sacct -o JobID,JobName,User,AllocNodes,AllocTRES,AveCPUFreq,AveRSS,Start,End -j 13180,13183 JobID JobName User AllocNodes AllocTRES AveCPUFreq AveRSS Start End ------------ ---------- --------- ---------- ---------- ---------- ---------- ------------------- ------------------- 13180 test2 xxxxxxx 1 cpu=8,mem+ 2017-04-10T13:34:33 2017-04-10T14:08:24 13180.batch batch 1 cpu=8,mem+ 116.13M 354140K 2017-04-10T13:34:33 2017-04-10T14:08:24 13183 test3 xxxxxxx 1 cpu=8,mem+ 2017-04-10T13:54:52 2017-04-10T14:26:34 13183.batch batch 1 cpu=8,mem+ 2G 10652K 2017-04-10T13:54:52 2017-04-10T14:26:34 13183.0 xpyxmci 1 cpu=8,mem+ 1.96G 30892K 2017-04-10T13:54:53 2017-04-10T14:26:34
Use –noconvert
if you want sacct to display consistent units across jobs.
To minimize the time your job spends in the queue you could specify multiple partitions so that the job could start as soon as possible. Use –partition=notebook,playground,computation
for instance.
To have a rough estimate of when your queued job will start type squeue –start
To translate a job script written for a scheduler different than slurm to slurm's own syntax consider using http://www.schedmd.com/slurmdocs/rosetta.pdf
Should you want to monitor the usage of the cluster nodes in a top-like fashion type
sinfo -i 5 -S"-O" -o "%.9n %.6t %.10e/%m %.10O %.15C"
To monitor the resources consumed by your running job type
watch -n1 sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,AveCpuFreq <jobid>[.batch]
To transmit a file to all nodes allocated to the currently active Slurm job use sbcast
. For instance
> cat my.job #!/bin/env bash sbcast my.prog /tmp/my.prog srun /tmp/my.prog > sbatch --nodes=8 my.job srun: jobid 145 submitted
For instance #SBATCH –nodelist=maris0xx
Type printenv | grep -i slurm
to display them.