Submitting parallel Gaussian 03/Linda jobs to the Sun Grid Engine

From Research Computing

Table of contents

Running parallel Gaussian rev. D jobs under SGE

The Graduate Center cluster supports Gaussian D1 with TCP Linda on all dual-processor P3, single-processor P4 and dual core Intel E6600 systems. Gaussian has simplified parallel jobs in revision D of g03l, the Linda-enabled version of Gaussian 03. Parallel Gaussian jobs now require some changes to SGE scripts, as documented below.

Gaussian input file

This article assumes familiarity with Linda-enabled parallel Gaussian; see Using Gaussian 03 with Linda (http://www.gaussian.com/g_tech/linda_use.htm) at the Gaussian (http://www.gaussian.com) web site for further details. Gaussian input files for parallel Gaussian on the E6600 dual core systems or the dual-processor Pentium 3 systems should have two additional statements, which must occur before the route section of the input file, as follows. The value of the first statement is recommended; the value of the second statement will vary depending on the number of worker tasks needed.

%nproc=2
%NProcLinda=4

The first line, %nproc=2, instructs g03l to use both cores or both processors on a dual-core or dual-processor system. ALL Gaussian .com files intended for dual-core or dual-processor systems should contain this line. It is permissible to use this line even with the single-processor systems; this will generate a warning that may be ignored.

The second line, %NProcLinda=4, will cause the job to be executed on the master node and three worker nodes; for this reason, the number of processors specified in the gauss parallel execution environment in the SGE batch job submission script (see below) must be equal the value of NProcLinda. The Gaussian input file in this example follows.

%nprocl=4
%mem=400MB
%chk=t_cp
#p b3lyp/cc-pvtz counterpoise=3 opt=z-matrix optcyc=999 
 
t
 
0 1
O 1
O 1 oo 2
O 2 oo 1 60.0 3
H 1 oh 3 hoo 2 0.0 0 1
H 2 oh 1 hoo 3 0.0 0 2
H 3 oh 2 hoo 1 0.0 0 3
H 1 ho 2 oho 3 d1 0 1
H 2 ho 3 oho 1 d2 0 2
H 3 ho 1 oho 2 d2 0 3
 
  oo          2.7873
  oh          0.9771
  hoo        78.2005
  ho          0.9609
  oho       113.8022
  d1        112.335
  d2       -113.7495

SGE job script for parallel Gaussian 03 revision D

The following is an SGE job script for submitting the previous parallel Gaussian 03 (g03l) job to the cluster. This script uses the SGE parallel execution environment called gauss, based on the sample mpi parallel execution environment that comes with the SGE under /usr/local/sge/mpi. Comments in the script precede SGE specific command options.

NOTE: in the script below, comment character sequences '#' and '#$' appear in the first column. The character sequence #$ is used by the Sun Grid Engine (SGE) job submission commands (such as qsub), which interpret the characters following the #$ (ignored when the script is running, but interpreted when SGE reads the script, prior to execution) as if they were command line arguments to an SGE job submission command, such as qsub. For example, instead of entering “qsub –N benchmark jobname” on the command line, you can include the line #$ -N benchmark in the script file jobname and use the command qsub jobname.

NOTE: It is imperative that #$ control sequences start at the beginning of the line, since they are not ordinary comments that can start anywhere on the line.


#!/bin/bash
#$ -S /bin/bash
# Use the current working directory (recommended)
#$ -cwd
# Use the queue for 64-bit core duo E6600 systems
#$ -q x86_64.q
# name of job goes after -N
#$ -N benchmark
# Request the gauss parallel execition environment, followed by
# the value of %NProcLinda in the gaussian input .com file.
#$ -pe gauss 4
 
# source the Gaussian environment
export g03root=/usr/local/gaussian
. ${g03root}/g03/bsd/g03.profile
# source SGE variables
export SGE_ROOT=/usr/local/sge
. /usr/local/sge/default/common/settings.sh
.. ..
export PATH=$TMPDIR:$PATH
export GAUSS_SCRDIR=/tmp
export NODES=\"`cat $TMPDIR/machines`\"
export GAUSS_LFLAGS="-v -nodelist ${NODES}"
 
g03l g0316glytitan2.com


g03l t_cp.com


In this example, the C-shell is used (first two lines). The line

 #$ -cwd

instructs SGE to write the standard output file <JOBNAME>.o<JOB_ID> and the standard error file <JOBNAME>.e<JOB_ID> to the current directory. This is recommended. A best practice is to keep .com files for separate molecules (or a group of related molecules) in different subdirectories of your account. The #$ -cwd syntax will ensure that your SGE output and error files end up in the directory where the job was submitted.

The line

 #$ -q x86_64.q 

requests the machines from the dual core E6600 64-bit machine queue. To request machines from the Pentium IV queue, use the following line instead.

 #$ -q p4.q

These lines could be omitted if Pentium III machines are desired; alternatively, dual-processor Pentium II machines can be requested with the following syntax.

 #$ -q p3.q

The –N command is used to give the name of the SGE job the name of the Gaussian script; any reasonable name can be chosen.

The crucial lines for parallelized Gaussian specify the SGE parallel execution environment, which is used to request from SGE the number of processors required by the job. These lines begin with

 #$ -pe gauss 4

and include the lines

export NODES=\"`cat $TMPDIR/machines`\"
export GAUSS_LFLAGS="-v -nodelist ${NODES}"

which pass the list of processors allocated by SGE (if successful) to the g03l program (Gaussian03 with Linda) through the Gaussian/Linda environment variable $GAUSS_LFLAGS.

NOTE: if we get ambitious, we'll modify the gauss parallel execution environment so that it sets $GAUSS_LFLAGS directly, instead of writing the file $TMPDIR/machines. This inelegant step is a quick-and-dirty adaptation from the SGE sample mpi parallel execution environment, which include the prologue and epilogue scripts startmpi.sh and stopmpi.sh. Among other functions, the scripts startmpi.sh and stopmpi.sh create and clean up the machines file, which lists the machines allocated for the job.

Submitting the job to SGE

Batch jobs are submitted to the cluster using the qsub command, as follows:

flengyel@m248 ~/benchmark $ qsub t_cp.sh
Your job 17211 ("benchmark") has been submitted

Note that the job id is 171211, in this case. Your job id is used to create standard output and error output files for the job; these files will be discussed below. The job id is also useful to refer to in case the job encounters errors.

Monitoring the job

Immediately after submission, if the SGE qstat command is issued, one can see that the job waits for resources; if these are available, the job transfers into the alloctated queues. Here we see the job running. Only the master node is shown; the number of nodes requested is indicated.

m248 benchmark # qstat
 job-ID prior   name       user         state submit/start at     queue                 slots 
------------------------------------------------------------------------------------------------
  17211 0.60500 benchmark  flengyel     r     09/22/2007 01:46:48 p3.q@m20.gc.cuny.edu  4


The following is a brief explanation of the output. Without arguments, qstat will print the status of all jobs of all users. The output shows the following, in order.

The job ID number
Priority of job
Name of job
ID of user who submitted job
State of the job: The job state can be
  r(unning)
  t(ransferring)
Submit or start time and date of the job
If running - the queue in which the job is running
The function of the running job (MASTER or SLAVE)
The job array task ID (if an array job)

Output files produced by SGE

In this example, the master node is m20; the other nodes allocated to the job are listed in the output file benchmark.po, as follows:

m248 benchmark # more benchmark.po17211
-catch_rsh /usr/local/sge/spool/m20/active_jobs/17211.1/pe_hostfile
m20
m22
m23
m24

In general, submitting a job with name $JOB_NAME will produce the following output files, where $JOB_ID is the id of the submitted job. Such files are associated with the SGE parallel execution environment called gauss.

$JOB_NAME.o$JOB_ID         standard output of the job
$JOB_NAME.e$JOB_ID         standard error output of the job
$JOB_NAME.po$JOB_ID        parallel execution standard output (hostfile list)
$JOB_NAME.pe$JOB_ID        parallel execution standard error output (hostfile list)

For example, the following files are produced by SGE by our job, which has $JOB_ID 26924.

m248 benchmark # ls -la benchmark.*17211
-rw-r--r--  1 flengyel domusers 2167 Sep 22 01:52 benchmark.e17211
-rw-r--r--  1 flengyel domusers   65 Sep 22 01:46 benchmark.o17211
-rw-r--r--  1 flengyel domusers    0 Sep 22 01:46 benchmark.pe17211
-rw-r--r--  1 flengyel domusers   84 Sep 22 01:46 benchmark.po17211

The standard error file contains errors or informational messages. In this case, we specified the -v option to GAUSS_LFLAGS above, which gives us debugging information. Ordinarily the -v flag may be omitted.

m248 benchmark # more benchmark.e17211
setenv GAUSS_EXEDIR /usr/local/gaussian/g03/linda-exe:/usr/local/gaussian/g03/bs
d:/usr/local/gaussian/g03/private:/usr/local/gaussian/g03
g03 t_cp.com
ntsnet: using executable file /usr/local/gaussian/g03/linda-exe/l302.exel
ntsnet: trying to schedule 3 workers
ntsnet: scheduled a total of 3 workers
ntsnet: starting master process on m20.gc.cuny.edu
ntsnet: starting 1 worker on m22.gc.cuny.edu
ntsnet: starting 1 worker on m23.gc.cuny.edu
ntsnet: starting 1 worker on m24.gc.cuny.edu
ntsnet: using executable file /usr/local/gaussian/g03/linda-exe/l502.exel
ntsnet: trying to schedule 3 workers
ntsnet: scheduled a total of 3 workers
ntsnet: starting master process on m20.gc.cuny.edu
ntsnet: starting 1 worker on m22.gc.cuny.edu
ntsnet: starting 1 worker on m23.gc.cuny.edu
ntsnet: starting 1 worker on m24.gc.cuny.edu
ntsnet: using executable file /usr/local/gaussian/g03/linda-exe/l701.exel
ntsnet: trying to schedule 3 workers
ntsnet: scheduled a total of 3 workers
ntsnet: starting master process on m20.gc.cuny.edu
ntsnet: starting 1 worker on m22.gc.cuny.edu

Output files produced by Gaussian

There are also the associated log files and checkpoint files produced by Gaussian:

 m248 benchmark # ls -la t_cp.*
 -rw-r--r--  1 flengyel domusers 917504 Sep 22 01:58 t_cp.chk
 -rw-r--r--  1 flengyel domusers  82800 Sep 22 01:59 t_cp.log

The Gaussian log file: while the job runs

Here is the end of the t_cp.log log file for this job; one can also use a computational chemistry program such as molden to determine whether the job is converging (to be documented later). The convergence of a Gaussian job is often difficult to judge from an unaided reading of the log file.

m248 benchmark # tail t_cp.log
NBasis=   174 RedAO= T  NBF=   174
NBsUse=   174 1.00D-06 NBFU=   174
Precomputing XC quadrature grid using
IXCGrd= 2 IRadAn=           0 IRanWt=          -1 IRanGd=           0.
NRdTot=     567 NPtTot=       71832 NUsed=       74403 NTot=       74419
NSgBfM=   195   195   195   195.
Leave Link  302 at Sat Sep 22 02:00:01 2007, MaxMem=  104857600 cpu:       2.5
(Enter /usr/local/gaussian/g03/l303.exe)
DipDrv:  MaxL=1.
Leave Link  303 at Sat Sep 22 02:00:04 2007, MaxMem=  104857600 cpu:       0.2

Execution on the MASTER host

Running a top command on the MASTER exection host m36, we see one l502.exel link running.

top - 02:01:42 up 117 days, 11:05,  1 user,  load average: 1.57, 1.42, 0.93
Tasks:  75 total,   2 running,  73 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.5% us,  0.5% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:   1034676k total,  1031424k used,     3252k free,    62724k buffers
Swap:  1052216k total,      144k used,  1052072k free,    97428k cached
 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
27969 flengyel  25   0  856m 810m 2804 R 99.3 80.2   1:10.31 l502.exel
 2645 sgeadmin  16   0  7772 2692 2116 S  1.0  0.3 130:18.68 sge_execd
28018 flengyel  16   0  5564 1448 1120 R  0.3  0.1   0:00.08 top
    1 root      16   0  1692  516  440 S  0.0  0.0   0:00.53 init
    2 root      RT   0     0    0    0 S  0.0  0.0   0:01.23 migration/0
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.03 ksoftirqd/0
    4 root      RT   0     0    0    0 S  0.0  0.0   0:01.66 migration/1
    5 root      34  19     0    0    0 S  0.0  0.0   0:00.03 ksoftirqd/1
    6 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 events/0
    7 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 events/1

The Gaussian log file: when the job ends

We show the a few lines of the beginning and end of the log file for this job. The end of the log file t_cp.log, produced by Gaussian 03 for the input file t_cp.com above.

flengyel@m248 ~/benchmark $ view t_cp.log 
 
Entering Gaussian System, Link 0=g03
Input=t_cp.com
Output=t_cp.log
Initial command:
/usr/local/gaussian/g03/l1.exe /tmp/Gau-27708.inp -scrdir=/tmp/
Entering Link 1 = /usr/local/gaussian/g03/l1.exe PID=     27709.
 
Copyright (c) 1988,1990,1992,1993,1995,1998,2003,2004, Gaussian, Inc.
                 All Rights Reserved.

Over sixty-two thousand lines later, the log file ends with a reassuring final line, preceded by a random quotation.

flengyel@m248 ~/benchmark $ tail t_cp.log
85506,-0.0027655\PG=C01 [X(H6O3)]\\@
   
  
A MUSTACHE HAIR ACROSS A CHIP IS LIKE A REDWOOD
TREE FALLING THROUGH A HOUSING PROJECT.
- AN IBM MANAGER IN EAST FISHKILL, NEW YORK
  AS QUOTED IN "FROM SANDS TO CIRCUITS", IBM INNOVATION JANUARY 1985.
Job cpu time:  0 days  1 hours 35 minutes 48.2 seconds.
File lengths (MBytes):  RWF=     39 Int=      0 D2E=      0 Chk=     13 Scr=      1
Normal termination of Gaussian 03 at Sat Sep 22 13:19:48 2007.

Problems can be due to errors in the input file

NOTE: A problem with a Gaussian job doesn't necessarily indicate that the job scheduling system (SGE) or the operating system (Red Hat LINUX) has had a failure. Often unexpected results, such as abnormal link termination, can be caused by errors in the Gaussian .com script.

A suggestion: use the line

#$ -q p3.q

to test gaussian linda jobs on the p3 systems before running them on the 64-bit machines.