Slurm Commands
Original Source: https://ecs.rutgers.edu/slurm_commands.html
SLURM commands
The following table shows SLURM commands on the SOE cluster.
| Command | Description |
|---|---|
| sbatch | Submit batch scripts to the cluster |
| scancel | Signal jobs or job steps that are under the control of Slurm. |
| sinfo | View information about SLURM nodes and partitions. |
| squeue | View information about jobs located in the SLURM scheduling queue |
| smap | Graphically view information about SLURM jobs, partitions, and set configurations parameters |
| sqlog | View information about running and finished jobs |
| sacct | View resource accounting information for finished and running jobs |
| sstat | View resource accounting information for running jobs |
For more information, run man on the commands above. See some examples below.
1. Info about the partitions and nodes
List all the partitions available to you and the nodes therein:
sinfo
Nodes in state idle can accept new jobs.
Show a partition configuratuin, for example, SOE_main
scontrol show partition
Show current info about a specific node:
scontrol show node=<nodename>
You can also specify a group of nodes in the command above. For example, if your MPI job is running across soenode05,06,35,36, you can execute the command below to get the info on the nodes you are interested in:
scontrol show node=soenode[05-06,35-36]
An informative parameter in the output to look at would be CPULoad. It allows you to see how your application utilizes the CPUs on the running nodes.
2. Submit scripts
The header in a submit script specifies job name, partition (queue), time limit, memory allocation, number of nodes, number of cores, and files to collect standard output and error at run time, for example
#!/bin/bash
#SBATCH --job-name=OMP_run # job name, "OMP_run"
#SBATCH --partition=SOE_main # partition (queue)
#SBATCH -t 0-2:00 # time limit: (D-HH:MM)
#SBATCH --mem=32000 # memory per node in MB
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=16 # number of cores
#SBATCH --output=slurm.out # file to collect standard output
#SBATCH --error=slurm.err # file to collect standard errors
You can submit your job to the cluster with sbatch command:
sbatch myscript.sh
3. Query job information
List all currently submitted jobs in running and pending states for a user:
squeue -u <username>
Command squeue can be run with format options to expose specific information, for example, when pending job #706 is scheduled to start running:
squeue -j 706 --format="%S"
START_TIME 2015-04-30T09:54:32
More info can be shown by placing additional format options, for example:
squeue -j 706 --format="%i %P %j %u %T %l %C %S"
JOBID PARTITION NAME USER STATE TIMELIMIT CPUS START_TIME 706 SOE_main Par_job_3 mike PENDING 3-00:00:00 64 2015-04-30T09:54:32
To see when all the jobs, pending in the queue, are scheduled to start:
squeue --start
List all running and completed jobs for a user
sqlog -u <username>
or
sqlog -j <JobID>
The following appreviations are used for the job states:
| 약자 | 상태 | 설명 |
|---|---|---|
| CA | CANCELLED | Job was cancelled |
| CD | COMPLETED | Job completed normally |
| CG | COMPLETING | Job is in the process of completing |
| F | FAILED | Job termined abnormally |
| NF | NODE_FAIL | Job terminated due to node failure |
| PD | PENDING | Job is pending allocation |
| R | RUNNING | Job currently has an allocation |
| S | SUSPENDED | Job is suspended |
| TO | TIMEOUT | Job terminated upon reaching its time limit. |
You can specify the fields you would like to see in the output of sqlog:
sqlog --format=list
The command below, for example, provides Job ID, user name, exit state, start date-time, and end date-time for job #2831:
sqlog -j 2831 --format=jid,user,state,start,end
List status info for a currently running job:
sstat -j <jobid>
A formatted output can be used to gain only a specific info, for example, the maximum resident RAM usage on a node:
sstat --format="JobID,MaxRSS" -j <jobid>
To get statistics on completed jobs by jobID:
sacct --format="JobID,JobName,MaxRSS,Elapsed" -j <jobid>
To view the same information for all jobs of a user:
sacct --format="JobID,JobName,MaxRSS,Elapsed" -u <username>
To print a list of fields that can be specified with the –format option:
sacct --helpformat
For example, to get Job ID, Job name, Exit state, start date-time, and end date-time for job #2831:
sacct -j 2831 --format="JobID,JobName,State,Start,End"
Another useful command to gain information about a running job is scontrol:
scontrol show job=<jobid>
4. Cancel a job
To cancel one job:
scancel <jobid>
To cancel one job and delete the TMP directory created by the submit script on a node:
sdel <jobid>
To cancel all the jobs for a user:
scancel -u <username>
To cancel one or more jobs by name:
scancel --name <myJobName>