Using the Sun Grid Engine
From Research Computing
| Table of contents |
SGE commands users should know
Cluster users should be familiar with the following SGE commands: qsub, qstat, qhost, and qdel. These are briefly described in the SGE basic usage online documentation (http://gridengine.sunsource.net/project/gridengine/howto/basic_usage.html), and also below. Complete documentation of the SGE commands, including many more not mentioned in this basic list, is available online here (http://gridengine.sunsource.net/nonav/source/browse/~checkout~/gridengine/doc/htmlman/manuals.html?content-type=text/html).
Batch Job Submission
The syntax for batch job submission in SGE is close to that of PBS (http://www.openpbs.org/), the Portable Batch System. Batch jobs are submitted in the form of executable scripts to the cluster using the qsub command, which has the syntax
qsub [zero or more options] scriptfile.
Sample executable scripts are available in /usr/local/sge/examples.
You can experiment with the qsub command by copying the sample script /usr/local/sge/eamples/jobs/simple.sh to your home directory and running the command
qsub simple.sh
You can use the qstat command to view the status of the job as it waits for a queue, transfers to the queue (if found) and runs. The job will produce two output files, simple.sh.oJobID and simple.sh.eJobID. The first file contains any standard I/O output produced by the job, and the second file contains any standard error output produced.
In general, the qsub command will produce two such output files for each job. Any command-line argument to qsub can be provided within a batch job script using the syntax "#$ -argument" at the beginning of the script line, where -argument is a command line argument to qsub. For example, the default name for standard I/O output files can be overriden using the batch job script syntax
#$ -o some_other_name
Monitoring jobs
the command qstat -f JobID Provides a full listing of the job that has the provided Job ID (or all jobs if no Job ID argument is provided). For each queue, qstat produces a line consisting of:
· the queue name · the queue type: Types or combinations of types can be * B(atch) * P(arallel) · The number of used and available job slots · The load average on the queue host . The architecture of the queue host · The state of the queue - Queue states or combinations of states can be * a(larm) * A(larm) * d(isable) * D(isable) * E(rror) * s(uspended)
The command
qstat -j [job_list]
prints either for all pending jobs or the jobs contained in job_list the reason for not being scheduled. See the man pages for qstat for additional information.
Example
To see the currently running jobs of every user, run qstat with no arguments. Here is an example:
flengyel@m254 ~ $ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m12.gc.cuny.edu 1 2 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m13.gc.cuny.edu 1 3 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m14.gc.cuny.edu 1 4 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m15.gc.cuny.edu 1 5 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m16.gc.cuny.edu 1 6 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m17.gc.cuny.edu 1 7 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m19.gc.cuny.edu 1 8 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m20.gc.cuny.edu 1 9 10688 0.25005 p2Bz_2CycB ychen2 r 10/10/2006 16:59:17 p4.q@m34.gc.cuny.edu 1 10676 0.28619 a.sh agreer r 10/10/2006 10:43:17 p4.q@m38.gc.cuny.edu 1 10669 0.29325 p2py_pym1_ ychen2 r 10/10/2006 09:29:47 p4.q@m40.gc.cuny.edu 1 10661 0.34769 CM02 drmootoo r 10/10/2006 00:03:17 p4.q@m42.gc.cuny.edu 1 10683 0.28073 p2Bz_2COC3 ychen2 r 10/10/2006 11:40:02 p4.q@m43.gc.cuny.edu 1 10642 0.37437 P60_C2_2gl ychen2 r 10/09/2006 19:25:47 p4.q@m46.gc.cuny.edu 1 10686 0.27781 p2py_pym1_ ychen2 r 10/10/2006 12:10:32 p4.q@m49.gc.cuny.edu 1 10591 0.65675 a.sh agreer r 10/07/2006 18:27:17 p4.q@m50.gc.cuny.edu 1 10682 0.28119 p2Bz_2CONH ychen2 r 10/10/2006 11:35:17 p4.q@m53.gc.cuny.edu 1 10684 0.28038 p4Bz_2COC3 ychen2 r 10/10/2006 11:43:47 p4.q@m54.gc.cuny.edu 1 10664 0.30914 a.sh agreer r 10/10/2006 06:44:32 p4.q@m55.gc.cuny.edu 1 10672 0.29093 p2py_pym1_ ychen2 r 10/10/2006 09:54:02 p4.q@m59.gc.cuny.edu 1 10678 0.28547 a.sh agreer r 10/10/2006 10:50:47 p4.q@m60.gc.cuny.edu 1 10599 0.57677 job53 rdisch r 10/08/2006 08:19:32 p4.q@m62.gc.cuny.edu 1 10638 0.39463 B4Bz_2NHC3 ychen2 r 10/09/2006 15:54:47 p4.q@m63.gc.cuny.edu 1 10687 0.25980 a.sh agreer r 10/10/2006 15:18:02 p4.q@m64.gc.cuny.edu 1 10634 0.39564 B90_C2_2GL ychen2 r 10/09/2006 15:44:17 p4.q@m65.gc.cuny.edu 1 10605 0.52486 job54 rdisch r 10/08/2006 17:19:47 p4.q@m67.gc.cuny.edu 1 10685 0.27843 a.sh agreer r 10/10/2006 12:04:02 p4.q@m68.gc.cuny.edu 1 10636 0.39513 B90_2gly2a ychen2 r 10/09/2006 15:49:47 p4.q@m69.gc.cuny.edu 1 flengyel@m254 ~ $
The user flengyel is running an array job. The command qstat -u clusteruser will show all the jobs of the user clusteruser. For example,
flengyel@m254 ~ $ qstat -u flengyel job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m12.gc.cuny.edu 1 2 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m13.gc.cuny.edu 1 3 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m14.gc.cuny.edu 1 4 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m15.gc.cuny.edu 1 5 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m16.gc.cuny.edu 1 6 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m17.gc.cuny.edu 1 7 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m19.gc.cuny.edu 1 8 10582 0.75000 fah.job flengyel r 10/07/2006 02:17:02 p3.q@m20.gc.cuny.edu 1 9 flengyel@m254 ~ $
This was an array job, which originally had sixteen subtasks; it's down to eight.
Host status
The qhost command, with no arguments, shows a list of execution hosts and their configuration parameters:
[flengyel@monad flengyel]$ qhost HOSTNAME ARCH NPROC LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - g2 glinux 1 0.00 501.8M 141.3M 1.0G 0.0 g4 glinux 1 0.00 501.8M 139.9M 1.0G 0.0 g5 glinux 1 - 501.8M - 1.0G - grid glinux 1 0.00 501.8M 89.3M 1.0G 1.6M m1 glinux 2 1.00 1006.1M 93.2M 1.0G 4.3M m10 glinux 1 0.00 1006.7M 84.3M 1.0G 3.0M m11 glinux 1 0.00 1006.7M 90.8M 1.0G 0.0 m12 glinux 1 0.00 1006.7M 88.8M 1.0G 0.0 m13 glinux 1 0.00 1006.7M 91.0M 1.0G 0.0 m14 glinux 1 0.00 1006.7M 90.7M 1.0G 0.0 m15 - - - - - - - m16 glinux 1 0.00 1006.7M 90.2M 1.0G 0.0 m17 glinux 1 0.00 1006.7M 90.6M 1.0G 0.0 m18 glinux 1 0.00 1006.7M 91.8M 1.0G 0.0 m19 glinux 2 0.00 1006.1M 94.7M 1.0G 0.0 m02 glinux 2 0.00 1006.4M 29.6M 1.0G 2.8M m20 glinux 1 0.00 1006.7M 90.7M 1.0G 0.0 m21 glinux 1 0.00 1006.7M 90.6M 1.0G 0.0 m22 glinux 1 0.00 1006.7M 89.8M 1.0G 0.0 m23 glinux 1 - 1006.7M - 1.0G - m24 glinux 1 0.00 1006.7M 89.3M 1.0G 0.0 m25 glinux 1 0.00 1006.7M 93.6M 1.0G 0.0 m26 glinux 1 0.00 1006.7M 90.4M 1.0G 0.0 m27 glinux 1 0.00 1006.7M 89.9M 1.0G 0.0 m28 glinux 1 0.00 1006.7M 89.5M 1.0G 0.0 m29 glinux 1 0.00 1006.7M 89.2M 1.0G 0.0 m03 glinux 1 0.00 1006.7M 89.8M 1.0G 0.0 m30 glinux 1 2.11 1006.7M 102.1M 1.0G 0.0 m04 glinux 1 2.11 1006.7M 100.7M 1.0G 0.0 m05 glinux 1 0.00 1006.7M 89.5M 1.0G 0.0 m06 glinux 1 0.00 1006.7M 91.0M 1.0G 0.0 m07 glinux 2 - 1006.1M - 1.0G - m08 glinux 1 0.00 1006.7M 89.9M 1.0G 0.0 m09 glinux 1 0.00 1006.7M 95.0M 1.0G 0.0 n4 glinux 1 0.00 501.8M 121.6M 1.0G 0.0 n5 glinux 1 0.00 501.8M 122.8M 1.0G 0.0 n6 glinux 1 0.00 501.8M 120.5M 1.0G 0.0 n7 glinux 1 0.00 501.8M 123.3M 1.0G 0.0 n8 glinux 1 0.00 501.8M 123.7M 1.0G 0.0 neptune glinux 1 0.01 123.2M 27.6M 266.6M 21.0M
[flengyel@monad flengyel]$
The qhost command with the -j argument displays information about running and pending jobs:
[flengyel@monad flengyel]$ qhost -j HOSTNAME ARCH NPROC LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------ global - - - - - - - g2 glinux 1 0.00 501.8M 141.3M 1.0G 0.0 g4 glinux 1 0.00 501.8M 139.9M 1.0G 0.0 g5 glinux 1 - 501.8M - 1.0G - grid glinux 1 0.00 501.8M 89.3M 1.0G 1.6M m1 glinux 2 1.00 1006.1M 93.2M 1.0G 4.3M job-ID prior name user state submit/start at queue master ja-task-ID --------------------------------------------------------------------------------------------- 461 0 ga054.sh jsvitak r 08/26/2003 13:58:58 idle1.q MASTER m10 glinux 1 0.00 1006.7M 84.3M 1.0G 3.0M m11 glinux 1 0.00 1006.7M 90.8M 1.0G 0.0 m12 glinux 1 0.00 1006.7M 88.8M 1.0G 0.0 m13 glinux 1 0.00 1006.7M 91.0M 1.0G 0.0 m14 glinux 1 0.00 1006.7M 90.7M 1.0G 0.0 m15 - - - - - - - m16 glinux 1 0.00 1006.7M 90.2M 1.0G 0.0 m17 glinux 1 0.00 1006.7M 90.6M 1.0G 0.0 m18 glinux 1 0.00 1006.7M 91.8M 1.0G 0.0 m19 glinux 2 0.00 1006.1M 94.7M 1.0G 0.0 m02 glinux 2 0.03 1006.4M 29.5M 1.0G 2.8M m20 glinux 1 0.00 1006.7M 90.7M 1.0G 0.0 m21 glinux 1 0.00 1006.7M 90.6M 1.0G 0.0 m22 glinux 1 0.00 1006.7M 89.8M 1.0G 0.0 m23 glinux 1 - 1006.7M - 1.0G - m24 glinux 1 0.00 1006.7M 89.3M 1.0G 0.0 m25 glinux 1 0.00 1006.7M 93.6M 1.0G 0.0 m26 glinux 1 0.00 1006.7M 90.4M 1.0G 0.0 m27 glinux 1 0.00 1006.7M 89.9M 1.0G 0.0 m28 glinux 1 0.00 1006.7M 89.5M 1.0G 0.0 m29 glinux 1 0.00 1006.7M 89.2M 1.0G 0.0 m03 glinux 1 0.00 1006.7M 89.8M 1.0G 0.0 m30 glinux 1 2.11 1006.7M 102.1M 1.0G 0.0 465 0 ga058.sh jsvitak r 08/26/2003 14:07:53 m30.q MASTER 467 0 ga055.sh jsvitak r 08/26/2003 14:07:53 m30.q MASTER m04 glinux 1 2.07 1006.7M 100.7M 1.0G 0.0 464 0 ga057.sh jsvitak r 08/26/2003 14:00:14 idle4.q MASTER 463 0 ga056.sh jsvitak r 08/26/2003 13:59:59 m4.q MASTER m05 glinux 1 0.00 1006.7M 89.5M 1.0G 0.0 m06 glinux 1 0.00 1006.7M 91.0M 1.0G 0.0 m07 glinux 2 - 1006.1M - 1.0G - m08 glinux 1 0.00 1006.7M 89.9M 1.0G 0.0 m09 glinux 1 0.00 1006.7M 95.0M 1.0G 0.0 n4 glinux 1 0.00 501.8M 121.6M 1.0G 0.0 n5 glinux 1 0.00 501.8M 122.8M 1.0G 0.0 n6 glinux 1 0.00 501.8M 120.5M 1.0G 0.0 n7 glinux 1 0.00 501.8M 123.3M 1.0G 0.0 n8 glinux 1 0.00 501.8M 123.7M 1.0G 0.0 neptune glinux 1 0.00 123.2M 27.6M 266.6M 21.0M
[flengyel@monad flengyel]$
Deleting jobs
Jobs can be cancelled using the qdel command. The most common form is
qdel JobID,
which kills the job that matches the provided Job ID.
An X-windows interface called qmon for the Sun Grid Engine is available. Further documentation is available here (http://gridengine.sunsource.net/?JServSessionIdservlets=erch02qtl2).
