Child pages
  • Using the Cluster – Advanced Usage

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


This document will demonstrate some more advanced usage such as how to write a “parameter-sweep” bash script that submits many jobs, how to submit your jobs at various Qualities-of-Service, and how to construct series of jobs that depends on each others completion to progress.


You might also want to check out these screencasts of the workshops we ran introducing how to use tropos:



Researchers typically do not come to Research Computing to submit a single job, wait for it to complete, look at the results, and call it a day. More typically, new users come because they’ve reached a point where they have developed a model they think is feasibly and they want to simulate it under different conditions (often under many different conditions!).

For instance, say that you had a model that (you think) predicts the weather and you have weather data over a 10 week period. You can write a script that submits 10 cluster jobs, each of which reads in the data from the previous week and predicts the weather for the next.

You may instead have a model that simulates a black hole. You could write a script that submits 1000 cluster jobs, each of which performs the simulation with a different quantity of initial mass for the simulation.

We’ll proceed here with a more abstract example that loops over two parameters, alpha and beta.

Change into the directory of example 2 of the example slurm files from Using the Cluster – Introduction:

[abc1234@tropos example-2-basic-looping []]$ cd ~/slurm-examples-2011-12-09/example-2-basic-looping/
[abc1234@tropos example-2-basic-looping []]$ ls -alh
total 128K
drwxr-x--- 2 abc1234 abc1234 2.0K Dec  9 16:36 .
drwxrwx--- 5 abc1234 abc1234 2.0K Dec  9 16:36 ..
-rwxr-x--- 1 abc1234 abc1234 1.1K Dec  9 16:36
-rwxr-x--- 1 abc1234 abc1234 1.3K Dec  9 16:36

Take a look first at It’s a SLURM submission file that contains metadata about a job, just like in Using the Cluster – Introduction:

[abc1234@tropos example-2-basic-looping []]$ cat
#!/bin/bash -l
# NOTE the -l flag!

# Where to send mail...
#SBATCH --mail-user

# notify on state change: BEGIN, END, FAIL or ALL
#SBATCH --mail-type=ALL

# Request 5 minutes run time MAX, anything over will be KILLED
#SBATCH -t 0:5:0

# Put the job in the "debug" partition and request one core
# You probably want to change "debug" to "work" once you have a sense of how
# this is working.
#SBATCH -p debug -n 1

# Job memory requirements in MB
#SBATCH --mem=300

echo "I am a job..."
echo "My value of alpha is $alpha"
echo "My value of beta is $beta"
echo "And now I'm going to simulate doing work based on those parameters..."

sleep 20

echo "All done with my work.  Exiting."

Nothing much new here, except for the appearance of $alpha and $beta. Those are bash environment variables that represent... something. It suffices to say that they can either have values and that they can be used to run your program and workload with particular parameters instead of the benign sleep 20 statement. That part is up to you!

Let’s take a look at the other (more interesting) file:

[abc1234@tropos example-2-basic-looping []]$ cat

# Just a constant variable used throughout the script to name our jobs
#   in a meaningful way.

# Another constant variable used to name the slurm submission file that
#   this script is going to submit to slurm.


# Make an output directory if it doesn't already exist.
mkdir -p output

# Loop and submit all the jobs
echo " * Getting ready to submit a number of jobs:"
for alpha in $(seq 1 $param_limit_alpha); do
    for beta in $(seq 1 $param_limit_beta); do
        # Give our job a meaningful name
        echo "Submitting job $jobname"

        # Setup where we want the output from each job to go

        # "exporting" variables in bash make them available to your slurm
        # workload.
        export alpha;
        export beta;

        # Actually submit the job.
        sbatch --qos=rc-normal -J $jobname -o $outfile $jobfile

echo " * Done submitting all those jobs (whew!)"

This file is a standard basic script (not a SLURM submission file itself). If you were to run it, it would set up some variables, enter a nested loop over two variables (our $alpha and $beta, and submit a job for each combination.

Let’s look more closely at the sbatch ... line. You’ve seen the --qos=... option before; that’s how you specify the priority level at which your job should be scheduled. The -J some_name and -o some_file options give your job a name and an output file location, respectively. You’ve seen them before, but in a different context.

In Using the Cluster – Introduction, these options were found inside the SLURM submission/payload script itself as #SBATCH meta commands. Now they appear as command-line options for the sbatch command itself. Every metacommand can be used interchangably as a command line option to sbatch and there are a lot of them. For the purposes of these tutorials, we’ve tried to prune down and show you only what’s necessary to get off the ground, but to see the whole list of options/metacommands, issue the command $ man sbatch to get the user-manual for SLURM submission.

Let’s see what this script does. Give it a run:

[abc1234@tropos example-2-basic-looping []]$ ./

 * Getting ready to submit a number of jobs:

Submitting job test-1-1
Submitted batch job 732
Submitting job test-1-2
Submitted batch job 733
... (snip) ...
Submitting job test-5-4
Submitted batch job 755
Submitting job test-5-5
Submitted batch job 756

 * Done submitting all those jobs (whew!)

Now check the slurm queue to see what kind of effect this had, if any. Run:

[abc1234@tropos example-2-basic-looping []]$ squeue
    736     debug test-1-5   abc1234  PD       0:00      1 (Resources)
    737     debug test-2-1   abc1234  PD       0:00      1 (Priority)
    738     debug test-2-2   abc1234  PD       0:00      1 (Priority)
    739     debug test-2-3   abc1234  PD       0:00      1 (Priority)
    740     debug test-2-4   abc1234  PD       0:00      1 (Priority)
    741     debug test-2-5   abc1234  PD       0:00      1 (Priority)
    742     debug test-3-1   abc1234  PD       0:00      1 (Priority)
    743     debug test-3-2   abc1234  PD       0:00      1 (Priority)
    744     debug test-3-3   abc1234  PD       0:00      1 (Priority)
    745     debug test-3-4   abc1234  PD       0:00      1 (Priority)
    746     debug test-3-5   abc1234  PD       0:00      1 (Priority)
    747     debug test-4-1   abc1234  PD       0:00      1 (Priority)
    748     debug test-4-2   abc1234  PD       0:00      1 (Priority)
    749     debug test-4-3   abc1234  PD       0:00      1 (Priority)
    750     debug test-4-4   abc1234  PD       0:00      1 (Priority)
    751     debug test-4-5   abc1234  PD       0:00      1 (Priority)
    752     debug test-5-1   abc1234  PD       0:00      1 (Priority)
    753     debug test-5-2   abc1234  PD       0:00      1 (Priority)
    754     debug test-5-3   abc1234  PD       0:00      1 (Priority)
    755     debug test-5-4   abc1234  PD       0:00      1 (Priority)
    756     debug test-5-5   abc1234  PD       0:00      1 (Priority)
    734     debug test-1-3   abc1234   R       0:01      1 escher
    735     debug test-1-4   abc1234   R       0:01      1 escher
    732     debug test-1-1   abc1234   R       0:05      1 bach
    733     debug test-1-2   abc1234   R       0:05      1 bach
    731      work ONPNTN-I   luvmet   R   21:04:16      1 einstein
    730      work ONPNTN-I   luvmet   R   21:04:35      1 einstein

Great. We can see that 25 jobs were submitted, all to the debug partition. Four of them are running right now and the other 21 are in the pending state.

We set the jobs to each direct their own output to their own file in an output/ directory. Let’s take a look:

[abc1234@tropos example-2-basic-looping []]$ ls -alh output/
total 192K
drwxrwx--- 2 abc1234 abc1234 2.0K Dec 14 10:39 .
drwxr-x--- 3 abc1234 abc1234 2.0K Dec 14 10:38 ..
-rw-rw---- 1 abc1234 abc1234  349 Dec 14 10:39 output-alpha.1-beta.1.txt
-rw-rw---- 1 abc1234 abc1234  219 Dec 14 10:39 output-alpha.1-beta.2.txt
-rw-rw---- 1 abc1234 abc1234  221 Dec 14 10:39 output-alpha.1-beta.3.txt
-rw-rw---- 1 abc1234 abc1234  221 Dec 14 10:39 output-alpha.1-beta.4.txt

And there are the first four output files, slowly accumulating output from their execution on the remote nodes bach and escher.

In conclusion, this template includes:

  1. A slurm payload script that describes constant metadata about each job and how to run it.
  2. An outer-loop script that actually submits each parameterized job

You can use the template and modify it to do the computations you’re really setting out to do (instead of just printing out the parameters and sleeping for 20 seconds).