Child pages
  • Using the Cluster – Advanced Usage

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


This document will demonstrate some more advanced usage such as how to write a “parameter-sweep” bash script that submits many jobs, how to submit your jobs at various Qualities-of-Service, and how to construct series of jobs that depends on each others completion to progress.




Researchers typically do not come to Research Computing to submit a single job, wait for it to complete, look at the results, and call it a day. More typically, new users come because they’ve reached a point where they have developed a model they think is feasibly and they want to simulate it under different conditions (often under many different conditions!).

For instance, say that you had a model that (you think) predicts the weather and you have weather data over a 10 week period. You can write a script that submits 10 cluster jobs, each of which reads in the data from the previous week and predicts the weather for the next.

You may instead have a model that simulates a black hole. You could write a script that submits 1000 cluster jobs, each of which performs the simulation with a different quantity of initial mass for the simulation.

We’ll proceed here with a more abstract example that loops over two parameters, alpha and beta.

Change into the directory of example 2 of the example slurm files from Using the Cluster – Introduction:

[abc1234@tropos example-2-basic-looping []]$ cd ~/slurm-examples-2011-12-09/example-2-basic-looping/
[abc1234@tropos example-2-basic-looping []]$ ls -alh
total 128K
drwxr-x--- 2 abc1234 abc1234 2.0K Dec  9 16:36 .
drwxrwx--- 5 abc1234 abc1234 2.0K Dec  9 16:36 ..
-rwxr-x--- 1 abc1234 abc1234 1.1K Dec  9 16:36
-rwxr-x--- 1 abc1234 abc1234 1.3K Dec  9 16:36

Take a look first at It’s a SLURM submission file that contains metadata about a job, just like in Using the Cluster – Introduction:

[abc1234@tropos example-2-basic-looping []]$ cat
#!/bin/bash -l
# NOTE the -l flag!

# Where to send mail...
#SBATCH --mail-user

# notify on state change: BEGIN, END, FAIL or ALL
#SBATCH --mail-type=ALL

# Request 5 minutes run time MAX, anything over will be KILLED
#SBATCH -t 0:5:0

# Put the job in the "debug" partition and request one core
# You probably want to change "debug" to "work" once you have a sense of how
# this is working.
#SBATCH -p debug -n 1

# Job memory requirements in MB
#SBATCH --mem=300

echo "I am a job..."
echo "My value of alpha is $alpha"
echo "My value of beta is $beta"
echo "And now I'm going to simulate doing work based on those parameters..."

sleep 20

echo "All done with my work.  Exiting."

Nothing much new here, except for the appearance of $alpha and $beta. Those are bash environment variables that represent... something. It suffices to say that they can either have values and that they can be used to run your program and workload with particular parameters instead of the benign sleep 20 statement. That part is up to you!

Let’s take a look at the other (more interesting) file:

[abc1234@tropos example-2-basic-looping []]$ cat

# Just a constant variable used throughout the script to name our jobs
#   in a meaningful way.

# Another constant variable used to name the slurm submission file that
#   this script is going to submit to slurm.


# Make an output directory if it doesn't already exist.
mkdir -p output

# Loop and submit all the jobs
echo " * Getting ready to submit a number of jobs:"
for alpha in $(seq 1 $param_limit_alpha); do
    for beta in $(seq 1 $param_limit_beta); do
        # Give our job a meaningful name
        echo "Submitting job $jobname"

        # Setup where we want the output from each job to go

        # "exporting" variables in bash make them available to your slurm
        # workload.
        export alpha;
        export beta;

        # Actually submit the job.
        sbatch --qos=rc-normalfree -J $jobname -o $outfile $jobfile

echo " * Done submitting all those jobs (whew!)"

This file is a standard basic script (not a SLURM submission file itself). If you were to run it, it would set up some variables, enter a nested loop over two variables (our $alpha and $beta, and submit a job for each combination.

Let’s look more closely at the sbatch ... line. You’ve seen the --qos=... option before; that’s how you specify the priority level at which your job should be scheduled. The -Everyone is a member of the 'free' quality of service group. Other QOS groups can be accessed by purchasing cores (email if interested). The -J some_name and -o some_file options give your job a name and an output file location, respectively. You’ve You've seen them before, but in a different context.

In Using the Cluster – Introduction, these options were found inside the SLURM submission/payload script itself as #SBATCH meta commands. Now they appear as command-line options for the sbatch command itself. Every metacommand can be used interchangably as a command line option to sbatch and there are a lot of them. For the purposes of these tutorials, we’ve tried to prune down and show you only what’s necessary to get off the ground, but to see the whole list of options/metacommands, issue the command $ man sbatch to get the user-manual for SLURM submission.

Let’s see what this script does. Give it a run:

[abc1234@tropos example-2-basic-looping []]$ ./

 * Getting ready to submit a number of jobs:

Submitting job test-1-1
Submitted batch job 732
Submitting job test-1-2
Submitted batch job 733
... (snip) ...
Submitting job test-5-4
Submitted batch job 755
Submitting job test-5-5
Submitted batch job 756

 * Done submitting all those jobs (whew!)

Now check the slurm queue to see what kind of effect this had, if any. Run:

[abc1234@tropos example-2-basic-looping []]$ squeue
    736     debug test-1-5   abc1234  PD       0:00      1 (Resources)
    737     debug test-2-1   abc1234  PD       0:00      1 (Priority)
    738     debug test-2-2   abc1234  PD       0:00      1 (Priority)
    739     debug test-2-3   abc1234  PD       0:00      1 (Priority)
    740     debug test-2-4   abc1234  PD       0:00      1 (Priority)
    741     debug test-2-5   abc1234  PD       0:00      1 (Priority)
    742     debug test-3-1   abc1234  PD       0:00      1 (Priority)
    743     debug test-3-2   abc1234  PD       0:00      1 (Priority)
    744     debug test-3-3   abc1234  PD       0:00      1 (Priority)
    745     debug test-3-4   abc1234  PD       0:00      1 (Priority)
    746     debug test-3-5   abc1234  PD       0:00      1 (Priority)
    747     debug test-4-1   abc1234  PD       0:00      1 (Priority)
    748     debug test-4-2   abc1234  PD       0:00      1 (Priority)
    749     debug test-4-3   abc1234  PD       0:00      1 (Priority)
    750     debug test-4-4   abc1234  PD       0:00      1 (Priority)
    751     debug test-4-5   abc1234  PD       0:00      1 (Priority)
    752     debug test-5-1   abc1234  PD       0:00      1 (Priority)
    753     debug test-5-2   abc1234  PD       0:00      1 (Priority)
    754     debug test-5-3   abc1234  PD       0:00      1 (Priority)
    755     debug test-5-4   abc1234  PD       0:00      1 (Priority)
    756     debug test-5-5   abc1234  PD       0:00      1 (Priority)
    734     debug test-1-3   abc1234   R       0:01      1 escher
    735     debug test-1-4   abc1234   R       0:01      1 escher
    732     debug test-1-1   abc1234   R       0:05      1 bach
    733     debug test-1-2   abc1234   R       0:05      1 bach
    731      work ONPNTN-I   luvmet   R   21:04:16      1 einstein
    730      work ONPNTN-I   luvmet   R   21:04:35      1 einstein

Great. We can see that 25 jobs were submitted, all to the debug partition. Four of them are running right now and the other 21 are in the pending state.

We set the jobs to each direct their own output to their own file in an output/ directory. Let’s take a look:

[abc1234@tropos example-2-basic-looping []]$ ls -alh output/
total 192K
drwxrwx--- 2 abc1234 abc1234 2.0K Dec 14 10:39 .
drwxr-x--- 3 abc1234 abc1234 2.0K Dec 14 10:38 ..
-rw-rw---- 1 abc1234 abc1234  349 Dec 14 10:39 output-alpha.1-beta.1.txt
-rw-rw---- 1 abc1234 abc1234  219 Dec 14 10:39 output-alpha.1-beta.2.txt
-rw-rw---- 1 abc1234 abc1234  221 Dec 14 10:39 output-alpha.1-beta.3.txt
-rw-rw---- 1 abc1234 abc1234  221 Dec 14 10:39 output-alpha.1-beta.4.txt

And there are the first four output files, slowly accumulating output from their execution on the remote nodes bach and escher.

In conclusion, this template includes:

  1. A slurm payload script that describes constant metadata about each job and how to run it.
  2. An outer-loop script that actually submits each parameterized job

You can use the template and modify it to do the computations you’re really setting out to do (instead of just printing out the parameters and sleeping for 20 seconds).


If you’ve you've been following this tutorial from the beginning, you have so far only submitted jobs under the rc-normal quality free quality of service (QOS). There is another one: rc-nopreempt. (Note, if your account is registered under a subscribed group, the name of your QOSes will have a diffferent prefix. For instance, salvaggio-normal and salvaggio-nopreempt.)are many others,

Preemption. n. Definition: when one job is cancelled, reinserted into the pending queue, and suspended, another job takes its place, and the first job is said to have been preemptedresumed after the second completes.

The rc-nopreempt QOS has non-free QOSs have access to a lower overall number of cores, but it has have the advantage advantages of being non-preemptable and are able to preempt other jobs. Say, hypothetically, that the tropos cluster boasted a total core-count of 1000100. You could submit 1000 100 single-core jobs under the rc-normal QOS free QOS and, if no other users were running at the moment, all 1000 100 of your jobs would run. Say next that another user (who has paid good money) submits 500 50 single-core jobs under the "big-guy-nopreempt. “big-guy” look-at-me". "big-guy" has paid for the guaranteed right to 250 25 cores at any given time. Through a mechanism called chainsaw described belowsuspension, 250 25 of your 1000 jobs are preempted leaving only 750 running, and 250 of “big-guy“‘s are will be sent the "STOP" signal to free up 25 cores. As big-guy's initial 25 jobs finish up, the other 25 are submitted. As those finish up, your suspended jobs are automatically resumed. 

Say that your jobs are very sensitive; you don’t want them to be preempted! At the outset of the above situation, you could have submitted them under the rc-nopreempt QOS instead of rc-normala non-free QOS to the "premium" partition. Instead of all 1000 100 of yours your jobs starting immediately, only 250 would run you would only start up however many CPU cores your QOS allows for – but those 250 are non-preemptable. No matter how many other paying users submit jobs, your 250 jobs are safe and will run to completion .

To make use of preemption yourself, you need to use the schainsaw command. It has a number of options:

[abc1234@tropos ~ []]$ schainsaw --help
usage: schainsaw [options]

  -h, --help            show this help message and exit
  -n CORES, --cores=CORES
                        Number of cores to free up.
  -b QOS_BASE, --qos-base=QOS_BASE
                        QOS base.  i.e., if you submit to "salvaggio-normal"
                        or "salvaggio-nopreempt", the the QOS base is

You could use it like so:

[abc1234@tropos ~ []]$ schainsaw -n 4 -b rc
ERROR: You asked to free up 4 cpus but you have only 0 cpus-worth of jobs
sitting in the queue.  Please reduce the requested amount.

Here, schainsaw is complaining that you have no jobs waiting to run, so there is no point in killing other users’ preemptable jobs; that would be wasteful! If you actually do have jobs waiting in the queue, and there are preemptable jobs running on the cluster then those jobs will be cancelled and reschedule and your jobs will be promoted to running.

when you submit to the premium partition.

Dependent Jobs

Sometimes you are running a large number of jobs, some of which need to begin and complete in a specific order. The problem here is that one job might get stalled out on a slow operation (like I/O) and not complete before a subsequent and dependant job is scheduled to start.

To solve this, SLURM provides a --dependency flag which argues that a particular job should not start until another specified job has completed. Here is an example of the modified script from Using the Cluster – Introduction that demonstrates its usage:

#   Submit 15 jobs, each of which will depend on its predecessor.
# Authors:  Justin Talbot (jdt0127)
#           Ralph Bean (rjbpop)

# Just a constant variable used throughout the script to name our jobs
#   in a meaningful way.

# Another constant variable used to name the slurm submission file that
#   this script is going to submit to slurm.

# Stores the command to execute in the variable $command for easy reading
#   More information on this below inside the 'for' loop
command="sbatch --qos=rc-normalfree -o $outfile -e $outfile -J $jobname-0 $jobfile"

# Use the command above to submit our first job.  It doesn't depend on any
#   other jobs.  Its output to the terminal will be something like:
#                 Successfully submitted job 4023
#   We are going to use that output and grab the number 4023 (the fourth word
#   in the output, hence the $4) and stuff that number in the variable latest_id
latest_id=$($command | awk ' { print $4 }')

# Now submit the remaining jobs so that they depend on their predecessors.
# Loop from 1 to 14 putting the value in the variable $i
for i in `seq 1 14` ; do
    # Do a little output to see what's up as this executes.
    echo "Submitting my ${i}th job that depends on SLURM job $latest_id"

    # Name a file dynamically where we want all of our 'messages' to go.

    # Submit the next job and..
    #   1) Do not start the job until the last was finished SUCCESSFULLY
    #   2) Name it based on the sequence of jobs
    #   3) Use $outfile for standard output and error
    #   4) Use the job described in filename $jobfile
    # The backslashes '\' here only allow us to break the command into
    #   multiple lines to make it easier to read and understand as humans.
    sbatch \
            --qos=rc-normalfree \
            --dependency=afterok:$latest_id \
            -J $jobname-$i \
            --output=$outfile \
            --error=$outfile \

    # Increment the job on which the next should depend by one
    latest_id=`echo $latest_id + 1 | bc`

Claiming the nVidia GPUs

There is an nVidia card installed on a compute node of tropos. If you’d like to claim the GPUs for your job, you need to specify --gres=gpu:1 as an argument to your sbatch command. The GPU card is only available on the work partition:

$ sbatch --qos=rc-normal --gres=gpu:1