Cluster Resource Scheduler

Our cluster currently uses a combination of schedulers, that work in tandem with one another to manage the utilization of resources in the cluster.

TORQUE

The Terascale Open-source Resource and QUEue Manager (TORQUE) is a distributed resource manager providing control over batch jobs and distributed compute nodes. TORQUE can integrate with the non-commercial Maui Cluster Scheduler or the commercial Moab Workload Manager to improve overall utilization, scheduling and administration on a cluster.

The TORQUE community has extended the original PBS to extend scalability, fault tolerance, and functionality. Contributors include NCSA, OSC, USC, the US DOE, Sandia, PNNL, UB, TeraGrid, and other HPC organizations. TORQUE is described by its developers as open-source software, using the OpenPBS version 2.3 license and as non-free software by the Debian Free Software Guidelines due to license issues.

Maui Cluster Scheduler

Maui was most heavily developed during the mid-90s. Development slowed into the 2000s, although an active community around the usage of Maui still exists. Its development was made possible by the support of Cluster Resources, Inc. (now Adaptive Computing) and the contributions of many individuals and sites including the U.S. Department of Energy, PNNL, the Center for High Performance Computing at the University of Utah (CHPC), Ohio Supercomputer Center (OSC), University of Southern California (USC), SDSC, MHPCC, BYU, NCSA, and many others. It may be downloaded, modified and redistributed.

Maui Cluster Scheduler is currently maintained and supported by Adaptive Computing, Inc., although most new development has come to a standstill. A next-generation non-open-source scheduler is part of the Moab Cluster Suite and borrows many of the same concepts found in Maui. Maui's developers state that the licence satisfies some definitions of open-source software and that it is not available for commercial usage.

Adaptive Computing's Maui project is not associated with the Maui Scheduler Molokini Edition, which was developed as a project on the SourceForge site independent of the original Maui scheduler, under the GNU Lesser General Public License.  The Molokini Edition's most recent release was in 2005.

  • Using the Torque job scheduler
  • Preparing a Submission Script
  • Submitting a job to the Queue
  • Monitoring Job Execution
  • Deleting Jobs with the command qdel
  • Environment Variables
  • Array Jobs
  • Using scratch disk space
  • Jobs with conditional execution
  • Interactive Jobs

Using the Torque job scheduler

To run an application, the user launches a job on nic.mst.edu.  A job contains both the details of the processing to carry out (name and version of the application, input and output, etc.) and directives for the computer resources needed (number of cpus, amount of memory).

Jobs are run as batch jobs, i.e. in an unattended manner.  Typically, a user logs in to nic.mst.edu using an ssh client. sends a job to the execution queue and often logs out.

Jobs are managed by a job scheduler, a piece of software which is in charge of

  • allocating the computer resources requested for the job,
  • running the job and
  • reporting back to the user the outcome of the execution.

Running a job involves at the minimum the following steps

The NIC cluster uses a job scheduler called Torque Resource Manager, an advanced open-source product based on the original PBS project.  Torque integrates with the Maui Workload Manager in order to improve overall utilisation, scheduling and administration on a cluster.

This guide describes basic job submission and monitoring for Torque:

In addition, some more advanced topics are covered:

Preparing a Submission Script

A submission script is a shell script that

  • describes the processing to carry out (e.g. the application, its input and output, etc.) and
  • requests computer resources (number of cpus, amount of memory) to use for processing.

Suppose we want to run a molecular dynamics MPI application called foo with the following requirements

  • the run uses 32 processes,
  • the job will not run for more than 100 hours,
  • the job is given the name "protein123" and
  • the user should be emailed when the job starts and stops or aborts.

Assuming the number of cores available on each cluster node is 16, so a total of 2 nodes are required to map one MPI process to a physical core.  Supposing no input needs to be specified, the following submission script runs the application in a single job

 
#!/bin/bash 

# set the execution queue name 
# PBS -q default@nic-cluster.mst.edu 

# set the number of nodes and processes per node 
#PBS -l nodes=2:ppn=16 

# set max wallclock time 
#PBS -l walltime=100:00:00
 
# set name of job 
#PBS -N protein123 

# mail alert at start, end and abortion of execution 
#PBS -m bea 

# send mail to this address 
#PBS -M joeminer@mst.edu 

# use submission environment 
#PBS -V 

# start job from the directory it was submitted 
cd $PBS_O_WORKDIR 

# Output job information to STDERR (optional NIC specific script) 
/share/apps/job_data_prerun.py 

# run through the mpirun launcher 
mpirun -n 32 ./foo 

The script starts with #!/bin/bash (also called a shebang), which makes the submission script also a Linux bash script.

The script continues with a series of lines starting with #.  For bash scripts these are all comments and are ignored.  For Torque, the lines starting with #PBS are directives requesting job scheduling resources.  (NB: it's very important that you put all the directives at the top of a script, before any other commands; any #PBS directive coming after a bash script command is ignored!)

The final part of a script is normal Linux bash scripting and describes the set of operations to follow as part of the job.  In this case, this involves running the MPI-based application foo through the MPI utility mpirun.

The resource request #PBS -l nodes=n:ppn=m is the most important of the directives in a submission script.  The first part (nodes=n) is imperative and is determines how many compute nodes a job is allocated by the scheduler. The second part (ppn=m) is used by the scheduler to prepare the environment for a MPI parallel run with m processes per each compute nodes (e.g. writing a hostifile for the job, pointed to by $PBS_NODEFILE).  However, it is up to the user and the submission script to use the environment generated from ppn adequately. 

Examples of Torque submission scripts are given for some of the more popular applications in the software section of this site.

PBS job submission directives

Directives are job specific requirements given to the job scheduler.

The most important directives are those that request resources.  The most common are the wallclock time limit (the maximum time the job is allowed to run) and the number of processors required to run the job.  For example, to run an MPI job with 16 processes for up to 100 hours on a cluster with 8 cores per compute node, the PBS directives are


#PBS -l walltime=100:00:00
#PBS -l nodes=2:ppn=16

A job submitted with these requests runs for 100 hours at most; after this limit expires, the job is terminated regardless of whether the processing finished or not.  Normally, the wallclock time should be conservative, allowing the job to finish normally (and terminate) before the limit is reached.

Also, the job is allocated two compute nodes (nodes=2) and each node is scheduled to run 16 MPI processes (ppn=16).  (ppn is an abbreviation of Processes Per Node.)  It is the task of the user to instruct mpirun to use this allocation appropriately, i.e. to start 32 processes which are mapped to the 32 cores available for the job.

Submitting a job to the Queue

Submitting jobs with the command qsub

Supposing you already have a submission script ready (call it submit.sh), the job is submitted to the execution queue with the command qsub script.sh.  The queueing system prints a number (the job id) almost immediately and returns control to the linux prompt.  At this point the job is already in the submission queue.

Once you have submitted the job it will sit in a pending queue for some time (how long depends on the demands of your job and the demand on the service).  You can monitor the progress of the job using the command qstat or showq.

Once the job is complete you will see files with names like "jobname.e1234" and "jobname.o1234", either in your home directory or in the directory you submitted the job from (depending on how your job submission script is written).  The ".e" files contain error messages.  The ".o" files contain "standard output" which is essentially what the application you ran would normally have printed onto the screen.  The ".e" file contains the possible error messages issued by the application; on a correct execution without errors, this file can be empty.  The "jobname" part of the file name will match the Jobname you specified with the #PBS -N jobname directive.

Read all the options for qsub on the Linux manual using the command man qsub.

Monitoring Job Execution

Monitoring jobs with the command qstat

qstat is the main command for monitoring the state of systems, groups of jobs or individual jobs.  The simple qstat command gives a list of jobs which looks something like this:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
462828.nic-p1              O3_161618Avec    dawesr          4899970: R chem_cpu       
462838.nic-p1              O3_161617Avec    dawesr          9384185: R chem_cpu       
462839.nic-p1              O3_161618vec     dawesr          5302993: R chem_cpu
467710.nic-p1              722              sh7t8           103:00:4 E zhou_cpu       
467711.nic-p1              722              sh7t8           102:51:4 E zhou_cpu
467963.nic-p1              casino           adpkd6                 0 Q multi_cpu

The first column gives the job ID, the second the name of the job (specified by the user in the submission script) and the third the owner of the job.  The fourth column gives the elapsed time for each particular job.  The fifth column is the status of the job (R=running, Q=waiting, E=exiting, H=held, S=suspended).  The last column is the queue for the job (a job scheduler can manage different queues serving different purposes).

Some other useful qstat features include:

  • -u for showing the status of all the jobs of a particular user, e.g. qstat -u bob for user bob
  • -n for showing the nodes allocated by the schedulerr for a running job
  • -i for showing the status of a particular job, e.g. qstat -i 1121 for job with the id 1121.

Read all the options for qstat on the Linux manual using the command man qstat.

Monitoring jobs with the command showq

showq is the main commad for monitoring the state of jobs in the maui scheduler. It is useful for determining if a job is just waiting in the queue or if it has been blocked for some reason.  The simple showq command gives a list of jobs which looks something like this:

ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

468357               jasmz6    Running    48  1:05:27:27  Thu Nov  6 09:20:18
468358               jasmz6    Running    48  1:05:27:58  Thu Nov  6 09:20:49
468359               jasmz6    Running    48  1:05:27:58  Thu Nov  6 09:20:49
468350                wjm84    Running    90  1:13:49:48  Wed Nov  5 23:42:39
467599                ml89c    Running     8  1:14:30:16  Fri Oct 31 13:23:07
467600                ml89c    Running     8  1:14:33:22  Fri Oct 31 13:26:13
76 Active Jobs    1551 of 2715 Processors Active (57.13%)
                        65 of  128 Nodes Active      (50.78%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

468008                ml89c       Idle     8  7:12:00:00  Wed Nov  5 02:32:47
468009                ml89c       Idle     8  7:12:00:00  Wed Nov  5 02:33:26
467169               ajby96       Idle   480  4:04:00:00  Mon Oct 27 12:27:57

3 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

467963               adpkd6       Idle   480 21:00:00:00  Tue Nov  4 16:13:03

Total Jobs: 80   Active Jobs: 76   Idle Jobs: 3   Blocked Jobs: 1

 

Some other useful showq features include:

  • -u for showing the status of all the jobs of a particular user, e.g. showq -u bob for user bob
  • -i for showing the idle queue
  • -r for showing the running queue
  • -b for showing the blocked queue

Deleting Jobs with the command qdel

Use the qdel command to delete a job, e.g. qdel 1121 to delete job with id 1121.  A user can delete their own jobs at any time, whether the job is pending (waiting in the queue) or running.  A user cannot delete the jobs of another user.  Normally, there is a (small) delay between the execution of the qdel command and the time when the job is dequeued and killed.  Occasionally a job may not delete properly, in which case, the ITRSS team can delete it for you.  If you have a job that will not delete contact the helpdesk at help.mst.edu

Environment Variables

At the time a job is launched into execution, Torque defines multiple environment variables, which can be used from within the submission script to define the correct workflow of the job.  The most useful of these environment variables are the following:

  • PBS_O_WORKDIR, which points to the directory where the qsub command is issued,
  • PBS_NODEFILE, which point to a file that lists the hosts (compute nodes) on which the job is run,
  • PBS_JOBID, which is a unique number PBS assigns to a job and
  • TMPDIR, which points to a directory on the scratch (local and fast) disk space that is unique to a job.
  • PBS_JOBID, which is the job identifier assigned to the job by the batch system.
  • PBS_JOBNAME, which is the job name supplied by the user.
  • PBS_QUEUE, which is the name of the queue from which the job is executed.

PBS_O_WORKDIR is typically used at the beginning of a script to go to the directory where the qsub command was issued, which is frequently also the directory containing the input data for the job, etc.  The typical use is

cd $PBS_O_WORKDIR

inside a submission script.

PBS_NODEFILE is typically used to define the environment for the parallel run, for mpirun in particular.  Normally, this usage is hidden from users inside a script (e.g. enable_arcus_mpi.sh), which defines the environment for the user.

PBS_JOBID is useful to tag job specific files and directories, typically output files or run directories.  For instance, the submission script line

myApp > $PBS_JOBID.out

runs the application myApp and redirects the standard output to a file whose name is given by the job id.  (NB: the job id is a number assigned by Torque and differs from the character string name given to the job in the submission script by the user.)

TMPDIR is the name of a scratch disk directory unique to the job.  The scratch disk space typically has faster access than the disk space where the user home and data areas reside and benefits applications that have a sustained and large amount of I/O.  Such a job normally involves copying the input files to the scratch space, running the application on scratch and copying the results to the submission directory.  This usage is discussed in a separate section here.

Array Jobs

Arrays are a feature of Torque which allows users to submit a series of jobs using a single submission command and a single submission script.  A typical use of this is the need to batch process a large number of very similar jobs, which have similar input and output, e.g. a parameter sweep study.

A job array is a single job with a list of sub-jobs.  To submit an array job, use the -t flag to describe a range of sub-job indices.  For example

qsub -t 1-100 script.sh

submits a job array whose sub-jobs are indexed from 1 to 100.  Also,

qsub -t 100-200 script.sh

submits a job array whose sub-jobs are indexed from 100 to 200.  Furthermore,

qsub -t 100,200,300 script.sh

submits a job array whose sub-jobs indices are 100, 200 and 300.

The typical submission script for a job array uses the index of each sub-job to define the task specific for each sub-job, e.g. the name of the input file or of the output directory.  The sub-job index is given by the PBS variable PBS_ARRAYID.  To illustrate its use, consider the application myApp processes some files named input_*.dat (taken as input), with * ranging from 1 to 100.  This processing is described in a single submission script called submit.sh, which contains the following line

myApp < input_$PBS_ARRAYID.dat > output_$PBS_ARRAYID.dat

A job array is submitted using this script, with the command qsub -t 1-100 script.sh.  When a sub-job is executed, the file names in the line above are expanded using the sub-job index, with the result that each sub-job processes a unique input file and outputs the result to a unique output file.

Once submitted, all the array sub-jobs in the queue can be monitored using the extra -t flag to qstat

Using scratch disk space

Currently we have two options for higher speed scratch space that is not tied directly to the user's home folder.

You can use:

TMPDIR=/state/partition1/tmp

which will use local disk storage on the node as the tmp space.  This works very well for jobs on a single node, or jobs that each node has independent scratch files.

TMPDIR=/scatch/$USER

which will use your private scratch space on the high speed Lustre file system.  This is much faster than the home directory and offers scratch space that is shared across all the nodes in the cluster.  This is good for jobs that require a temp space shared across the nodes.

 

Jobs with conditional execution

It is possible to start a job on the condition that another one completes beforehand; this may be necessary for instance if the input to one job is generated by another job. Job dependency is defined  using the -W flag.

To illustrate with an example, suppose you need to start a job using the script second_job.sh after another job finished successfully. Assume the first job is started using script first_job.sh and the command to start the first job

qsub first_job.sh

returns the job ID 7777. Then, the command to start the second job is

qsub -W depend=after:7777 second_job.sh

This job dependency can be further automated (possibly to be included in a bash script) using environment variables:

JOB_ID_1=`qsub first_job.sh`
JOB_ID_2=`qsub -W depend=after:$JOB_ID_1 second_job.sh`

Furthermore, the conditional execution above can be changed so that the execution of the second job starts on the condition that the execution of the first was successful.  This is achieved replacing after with afterok, e.g.

JOB_ID_2=`qsub -W depend=afterok:$JOB_ID_1 second_job.sh`

Conditional submission (as well as conditional submission after successful execution) is also possible with job arrays.  This is useful, for example, to submit a "synchronization" job (script sync_job.sh) after the successful execution of an entire array of jobs (defined by array_job.sh).  The conditional execution uses afterokarray instead of afterok:

JOB_ARRAY_ID=`qsub -t 2-6 array_job.sh`
JOB_SYNC_ID=`qsub -W depend=afterokarray:$JOB_ARRAY_ID sync_job.sh`

Interactive Jobs

You will normally use an interactive job along with X-forwarding to view your results or run a jobs that requires user interaction.  Thus it is "interactive".

The place to start is with learning to use Putty/SSH with Xforwarding.

Once you have your session started with the cluster you will want to request your node(s) with the scheduler.

$ qsub -I -X -l nodes=1:ppn=1 -l walltime=00:05:00 -q @nic-cluster.mst.edu

The line above will request a single node with one processor for a period of 5 minutes.  You can feel free to change this line to any supported number of nodes and processors as documented in the job file descriptions.  You can use any of the jobfile flags on the qsub command line. 

You will recieve a message:

qsub: waiting for job 514934.nic-p1.local to start
qsub: job 514934.nic-p1.local ready

[userid@somenode-#-# ~]$

This will indicate that you can run your interactive commands at any time.  It may take a while for resources to become available for interactive jobs so be patient.