Child pages
  • Using the Cluster – Introduction

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Submitting your first job

Submitting jobs to the cluster requires you to have written a script that defines your workload and metadata about it. Lucky for you, we’ve written a handy example-creator called slurm-make-examples.sh. It will copy some examples into your home directory.

Note

This script is currently broken but the NON modified files live in /shared/slurm/examples. Be sure to edit these examples and replace any instance of USER with your uername!

Run:

[abc1234@tropos ~ []]$ slurm-make-examples.sh
** Placing examples in /home/abc1234/slurm-examples-2011-12-09
...
** Done copying to /home/abc1234/slurm-examples-2011-12-09
** Replacing all instances of 'USER' with 'abc1234'.
** Done replacing in /home/abc1234/slurm-examples-2011-12-09

Now take a look at your home directory and change into the newly created examples directory:

[abc1234@tropos ~ []]$ ls -alh
drwx------ 106 abc1234 abc1234  36K Dec  9 16:36 .
drwxr-xr-x   5 root    root      0  Dec  9 16:28 ..
drwxrwx---   5 abc1234 abc1234 2.0K Dec  9 16:36 slurm-examples-2011-12-09

[abc1234@tropos ~ []]$ cd slurm-examples-2011-12-09/

[abc1234@tropos slurm-examples-2011-12-09 []]$ ls -alh
total 192K
drwxrwx---   5 abc1234 abc1234 2.0K Dec  9 16:36 .
drwx------ 106 abc1234 abc1234  36K Dec  9 16:36 ..
drwxr-x---   2 abc1234 abc1234 2.0K Dec  9 16:36 example-1-simple-jobs
drwxr-x---   2 abc1234 abc1234 2.0K Dec  9 16:36 example-2-basic-looping
drwxr-x---   2 abc1234 abc1234 2.0K Dec  9 16:36 example-3-job-dependency-and-dynamic-node-claiming

There are three examples there. Change directory into the first one and list its contents:

[abc1234@tropos slurm-examples-2011-12-09 []]$ cd example-1-simple-jobs/

[abc1234@tropos example-1-simple-jobs []]$ ls -alh
total 160K
drwxr-x--- 2 abc1234 abc1234 2.0K Dec  9 16:36 .
drwxrwx--- 5 abc1234 abc1234 2.0K Dec  9 16:36 ..
-rwxrwx--- 1 abc1234 abc1234 1.3K Dec  9 16:36 slurm-mpi.sh
-rwxrwx--- 1 abc1234 abc1234 1.2K Dec  9 16:36 slurm-single-core.sh
-rwxrwx--- 1 abc1234 abc1234 1.3K Dec  9 16:36 slurm-smp.sh

The file we’re going to be working with first is slurm-single-core.sh. It is a SLURM job file that describes...

  1. Metadata about the job we’re going to submit
  2. The payload of the job; the actual work we want to get done.

Let’s take a look at it. Run the following command, less slurm-single-core.sh:

[abc1234@tropos example-1-simple-jobs []]$ less slurm-single-core.sh
#!/bin/bash -l
# NOTE the -l flag!
#

# This is an example job file for a single core CPU bound program
# Note that all of the following statements below that begin
# with #SBATCH are actually commands to the SLURM scheduler.
# Please copy this file to your home directory and modify it
# to suit your needs.
#
# If you need any help, please email rc-help@rit.edu
#

# Name of the job - You'll probably want to customize this.
#SBATCH -J test

# Standard out and Standard Error output files
#SBATCH -o test.output
#SBATCH -e test.output

#SBATCH --mail-user abc1234@rit.edu

# notify on state change: BEGIN, END, FAIL or ALL
#SBATCH --mail-type=ALL

# Request 5 minutes run time MAX, anything over will be KILLED
#SBATCH -t 0:5:0

# Put the job in the "debug" partition and request one core
# "debug" is a limited partition.  You'll likely want to change
# it to "work" once you understand how this all works.
#SBATCH -p debug -n 1

# Job memory requirements in MB
#SBATCH --mem=300

#
# Your job script goes below this line.
#
echo "(${HOSTNAME}) sleeping for 1 minute to simulate work (ish)"
sleep 60
echo "(${HOSTNAME}) Ahhh, alarm clock!"

You’ll see by the first line, #!/bin/bash, that this is a bash script. As you might already know, any line in a bash script that begins with a # is a comment and is therefore disregarded when the script is running.

However, in this context, any line that begins with #SBATCH is actually a meta-command to the SLURM scheduler that informs it how to prioritize, schedule, and place your job.

The last three lines are the ‘payload’ of the job. In this case it just prints out a statement, goes to sleep for 60 seconds (pretending to work) and then wakes up and prints one last statement. Very important scientific work, don’t you agree?


Let’s give this script a run. We’ll submit it to the SLURM scheduler using the sbatch command, but we need one more piece of information before we do.

Research Computing divvies out resources to users by way of Qualities-of-Service (or QOSes). If you don’t know what QOS your account is in, you can run the show-my-qos command. If things are still unclear, you can email rc-help@rit.edu to ask, but you are most likely in the rc or free QOS. For each grouping of users, we define two different priority-levels under which you can submit jobs.

Find your QOSes with:

[abc1234@tropos example-1-simple-jobs []]$ show-my-qos
QOSes: rc-normal, rc-nopreempt
       4 cores

Submit your job with the following command:

[abc1234@tropos example-1-simple-jobs []]$ sbatch --qos=rc-normal slurm-single-core.sh
Submitted batch job 727

You can now check to see that your job is really running in the debug partition by running squeue:

[abc1234@tropos example-1-simple-jobs []]$ squeue
JOBID PARTITION     NAME      USER  ST       TIME  NODES NODELIST(REASON)
  727     debug     test   abc1234   R       0:27      1 einstein

Now that our script is running, we should be able to see its output. Check for it with ls -alh:

[abc1234@tropos example-1-simple-jobs []]$ ls -alh
total 192K
drwxr-x--- 2 abc1234 abc1234 2.0K Dec  9 17:08 .
drwxrwx--- 5 abc1234 abc1234 2.0K Dec  9 16:36 ..
-rwxrwx--- 1 abc1234 abc1234 1.3K Dec  9 16:36 slurm-mpi.sh
-rwxrwx--- 1 abc1234 abc1234 1.2K Dec  9 16:36 slurm-single-core.sh
-rwxrwx--- 1 abc1234 abc1234 1.3K Dec  9 16:36 slurm-smp.sh
-rw-rw---- 1 abc1234 abc1234   86 Dec  9 17:09 test.output

And check its contents with the cat command:

[abc1234@tropos example-1-simple-jobs []]$ cat test.output
(einstein) sleeping for 1 minute to simulate work (ish)
(einstein) Ahhh, alarm clock!

Neat! This is the output that would normally be printed to the screen, printed instead to the contents of the output file we specified in our SLURM job script slurm-single-core.sh. Our code was executed on the remote compute node called einstein and its results were redirected over NFS back to us.


If you’ve been able to follow the above steps and successfully submit and monitor a job, you might want to check out Using the Cluster – Advanced Usage.