Slurm tips

Frequently used commands

  • scontrol show jobs [job id]
  • scontrol show partition [queue]
  • scontrol show nodes
  • squeue
  • sacct -j [job-id]
  • scontrol hold job-id
  • scontrol release job-id
  • scancel job-id
  • sbatch
  • srun –pty bash
  • sinfo
  • sview
  • seff
  • sstat

Job Submission example 1

#SBATCH --job-name=example
#SBATCH --output=output.txt
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=6
#SBATCH --mem-per-cpu=40G
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1 
#SBATCH --time=0 

echo "on Hostname = $(hostname)"
echo "on GPU      = $CUDA_VISIBLE_DEVICES"
echo "@ $(date)"

conda activate py38

Job Submission example 2

#SBATCH -J slurm_example
#SBATCH -o example.output
# change to working directory default for SLURM
# Mail all events
#SBATCH --mail-type=ALL
# set your email address
#SBATCH --mail-user [email protected]
# Request 8 hours run time
#SBATCH -t 8:0:0
#SBATCH --mem=4000  #MB
#specify 2 cores
#SBATCH --ntasks-per-node=2

echo "start job SLURM `date`"
sleep 120  # run your script here
echo "Finished `date`" 

Slurm status


We know that slurm has three status: allocated, mix and idle. But what type resources do these status mean? CPU, GPU or RAM? What status it would be if CPU is run out but GPU and RAM are still available?


From my practice I found that the status for gpu partition depends on the use of GPU resource. In a gpu node, there are 4 gpus lets say. Then if 4 gpus are ran out, then the state becomes allocated. If 2 gpus are run, and the state is mix, if no gpus are run, the state is idle.

RAM can not be ran out at all, because default: MaxRAMPercent = 98.0%


In cpus-per-task, what does cpus mean? thread or proceesor?


CPU (CR_CPU): CPU as a consumable resource. No notion of sockets, cores, or threads. On a multi-core system CPUs will be cores. On a multi-core/hyperthread system CPUs will be threads. On a single-core systems CPUs are CPUs. ;-) —from slurm website

Slurm sbatch exclude nodes or node list

