Submitting MPICH jobs to the Sun Grid Engine

From Research Computing

This describes how to compile a simple test mpich program and submit it to the Sun Grid Engine.

The program source is called pingpong.f; its full path is

/home/m254/apps/mpich/mpich-1.2.7/examples/test/pt2pt/pingpong.f

mpich version 1.2.7 has been installed under

~apps/mpich/mpich-1.2.7

which expands to

/home/m254/apps/mpich/mpich-1.2.7

Add the bin subdirectory of this installation to your path; in the bash shell, this can be done as follows.

export PATH=~apps/mpich/mpich-1.2.7/bin:$PATH

The program pingpong.f can be compiled with the script mpif77, which hides some of the details of compiling and linking mpi programs. Since the PATH was set to refer to the mpich-1.2.7 install of mpich, this is the version against which the program will be compiled and linked, using mpif77.

mpif77 -o pong pingpong.f

The following sample script, pongtest.job, can be used to submit the job to the SGE with qsub:

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -N pong
#$ -pe mpich 2
#$ -v MPICH_HOME=/home/m254/apps/mpich/mpich-1.2.7,SGE_QMASTER_PORT
 
$MPICH_HOME/bin/mpirun -machinefile  $TMPDIR/machines -np $NSLOTS ./pong

Note added May 24, 2008: the line

#$ -S /bin/bash

is needed so that the user's identity can be determined.


The line

#$ -pe mpich 2

in the script invokes the mpich parallel execution environment, which calls /usr/local/sge/mpi/startmpi.sh before the pongtest.job script runs; the mpich parallel execution environment calls /usr/local/sge/mpi/stopmpi.sh after pongtest.job exits. The startmpi.sh script defines the environment variable $NSLOTS. The preceding line sets $NSLOTS equal to 2, which is the number of SLAVE processes (slots) desired; this number is always one less than the number of slots used (note: it does not include the master). The startmpi.sh script also defines $TMPDIR/machines on the master node of the job; SGE communicates the list of nodes allocated to mpirun through this file.

If you create this script, remember to make it executable:

 chmod +x pongtest.job

Here is a sample job submission to the SGE on monad:

flengyel@m254 flengyel]$ qsub pongtest.job
your job 32173 ("pong") has been submitted

A qstat shows the job transferring:

[flengyel@m254 flengyel]$ qstat -u flengyel
job-ID  prior name       user         state submit/start at     queue       
master  ja-task-ID
---------------------------------------------------------------------------------------------
 32173     0 pong       flengyel     t     10/05/2005 00:11:40 default01. MASTER
           0 pong       flengyel     t     10/05/2005 00:11:40 default01. SLAVE
 32173     0 pong       flengyel     t     10/05/2005 00:11:40 default07. SLAVE
[rossi@m254 parallel_examples]$ qstat -u rossi
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID
---------------------------------------------------------------------------------------------
  26897     0 pingpongf9 rossi        qw    04/23/2005 13:10:57
[rossi@m254 parallel_examples]$                               


This shows that $NSLOTS should be set to the number of SLAVE processes (not including the MASTER process).

Another qstat shoes the job running:

[flengyel@m254 flengyel]$ qstat -u flengyel
job-ID  prior name       user         state submit/start at     queue       master  ja-task-ID
---------------------------------------------------------------------------------------------
 32173     0 pong       flengyel     r     10/05/2005 00:11:40 default01. MASTER
           0 pong       flengyel     r     10/05/2005 00:11:40 default01. SLAVE
 32173     0 pong       flengyel     r     10/05/2005 00:11:40 default07. SLAVE

The output file shows the result:

[flengyel@m254 flengyel]$ cat pong.o32173
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
/usr/local/sge/bin/glinux/qrsh -inherit -nostdin m07 /home/m254/flengyel/./pong
m01.gc.cuny.edu 51579 \-p4amslave \-p4yourname m07 \-p4rmrank 1
PE = 1, iter = 1, sample =   1, num =   125
PE = 1, iter = 1, sample =   2, num =   250
PE = 1, iter = 1, sample =   3, num =   375
PE = 1, iter = 1, sample =   4, num =   500
PE = 1, iter = 1, sample =   5, num =   625
PE = 1, iter = 1, sample =   6, num =   750
PE = 1, iter = 1, sample =   7, num =   875
PE = 1, iter = 1, sample =   8, num =  1000
PE = 1, iter = 1, sample =   9, num =  1125
PE = 1, iter = 1, sample =  10, num =  1250
PE = 1, iter = 1, sample =  11, num =  1375
PE = 1, iter = 1, sample =  12, num =  1500
PE = 1, iter = 1, sample =  13, num =  1625
PE = 1, iter = 1, sample =  14, num =  1750
PE = 1, iter = 1, sample =  15, num =  1875
PE = 1, iter = 1, sample =  16, num =  2000
PE = 1, iter = 1, sample =  17, num =  2125
PE = 1, iter = 1, sample =  18, num =  2250
PE = 1, iter = 1, sample =  19, num =  2375
PE = 1, iter = 1, sample =  20, num =  2500
PE = 1, iter = 1, sample =  21, num =  2625
PE = 1, iter = 1, sample =  22, num =  2750
PE = 1, iter = 1, sample =  23, num =  2875
PE = 1, iter = 1, sample =  24, num =  3000
PE = 1, iter = 1, sample =  25, num =  3125
PE = 1, iter = 1, sample =  26, num =  3250
PE = 1, iter = 1, sample =  27, num =  3375
PE = 1, iter = 1, sample =  28, num =  3500
PE = 1, iter = 1, sample =  29, num =  3625
PE = 1, iter = 1, sample =  30, num =  3750
PE = 1, iter = 1, sample =  31, num =  3875
PE = 1, iter = 1, sample =  32, num =  4000
PE = 1, iter = 1, sample =  33, num =  4125
PE = 1, iter = 1, sample =  34, num =  4250
PE = 1, iter = 1, sample =  35, num =  4375
PE = 1, iter = 1, sample =  36, num =  4500
PE = 1, iter = 1, sample =  37, num =  4625
PE = 1, iter = 1, sample =  38, num =  4750
PE = 1, iter = 1, sample =  39, num =  4875
PE = 1, iter = 1, sample =  40, num =  5000
PE = 1, iter = 2, sample =   1, num =   125
PE = 1, iter = 2, sample =   2, num =   250
PE = 1, iter = 2, sample =   3, num =   375
PE = 1, iter = 2, sample =   4, num =   500
PE = 1, iter = 2, sample =   5, num =   625
PE = 1, iter = 2, sample =   6, num =   750
PE = 1, iter = 2, sample =   7, num =   875
PE = 1, iter = 2, sample =   8, num =  1000
PE = 1, iter = 2, sample =   9, num =  1125
PE = 1, iter = 2, sample =  10, num =  1250
PE = 1, iter = 2, sample =  11, num =  1375
PE = 1, iter = 2, sample =  12, num =  1500
PE = 1, iter = 2, sample =  13, num =  1625
PE = 1, iter = 2, sample =  14, num =  1750
PE = 1, iter = 2, sample =  15, num =  1875
PE = 1, iter = 2, sample =  16, num =  2000
PE = 1, iter = 2, sample =  17, num =  2125
PE = 1, iter = 2, sample =  18, num =  2250
PE = 1, iter = 2, sample =  19, num =  2375
PE = 1, iter = 2, sample =  20, num =  2500
PE = 1, iter = 2, sample =  21, num =  2625
PE = 1, iter = 2, sample =  22, num =  2750
PE = 1, iter = 2, sample =  23, num =  2875
PE = 1, iter = 2, sample =  24, num =  3000
PE = 1, iter = 2, sample =  25, num =  3125
PE = 1, iter = 2, sample =  26, num =  3250
PE = 1, iter = 2, sample =  27, num =  3375
PE = 1, iter = 2, sample =  28, num =  3500
PE = 1, iter = 2, sample =  29, num =  3625
PE = 1, iter = 2, sample =  30, num =  3750
PE = 1, iter = 2, sample =  31, num =  3875
PE = 1, iter = 2, sample =  32, num =  4000
PE = 1, iter = 2, sample =  33, num =  4125
PE = 1, iter = 2, sample =  34, num =  4250
PE = 1, iter = 2, sample =  35, num =  4375
PE = 1, iter = 2, sample =  36, num =  4500
PE = 1, iter = 2, sample =  37, num =  4625
PE = 1, iter = 2, sample =  38, num =  4750
PE = 1, iter = 2, sample =  39, num =  4875
PE = 1, iter = 2, sample =  40, num =  5000
 MPI pong test
 samples =  40
 initsamplesize =  125
 samplesizeinc =  125
 msgspersample =  100
 ibufcount =  5000
clock resolution = .20000E-05
 
PE = 0, iter = 1, sample =   1, num =   125
PE = 0, iter = 1, sample =   2, num =   250
PE = 0, iter = 1, sample =   3, num =   375
PE = 0, iter = 1, sample =   4, num =   500
PE = 0, iter = 1, sample =   5, num =   625
PE = 0, iter = 1, sample =   6, num =   750
PE = 0, iter = 1, sample =   7, num =   875
PE = 0, iter = 1, sample =   8, num =  1000
PE = 0, iter = 1, sample =   9, num =  1125
PE = 0, iter = 1, sample =  10, num =  1250
PE = 0, iter = 1, sample =  11, num =  1375
PE = 0, iter = 1, sample =  12, num =  1500
PE = 0, iter = 1, sample =  13, num =  1625
PE = 0, iter = 1, sample =  14, num =  1750
PE = 0, iter = 1, sample =  15, num =  1875
PE = 0, iter = 1, sample =  16, num =  2000
PE = 0, iter = 1, sample =  17, num =  2125
PE = 0, iter = 1, sample =  18, num =  2250
PE = 0, iter = 1, sample =  19, num =  2375
PE = 0, iter = 1, sample =  20, num =  2500
PE = 0, iter = 1, sample =  21, num =  2625
PE = 0, iter = 1, sample =  22, num =  2750
PE = 0, iter = 1, sample =  23, num =  2875
PE = 0, iter = 1, sample =  24, num =  3000
PE = 0, iter = 1, sample =  25, num =  3125
PE = 0, iter = 1, sample =  26, num =  3250
PE = 0, iter = 1, sample =  27, num =  3375
PE = 0, iter = 1, sample =  28, num =  3500
PE = 0, iter = 1, sample =  29, num =  3625
PE = 0, iter = 1, sample =  30, num =  3750
PE = 0, iter = 1, sample =  31, num =  3875
PE = 0, iter = 1, sample =  32, num =  4000
PE = 0, iter = 1, sample =  33, num =  4125
PE = 0, iter = 1, sample =  34, num =  4250
PE = 0, iter = 1, sample =  35, num =  4375
PE = 0, iter = 1, sample =  36, num =  4500
PE = 0, iter = 1, sample =  37, num =  4625
PE = 0, iter = 1, sample =  38, num =  4750
PE = 0, iter = 1, sample =  39, num =  4875
PE = 0, iter = 1, sample =  40, num =  5000
 
 iter =  1
 
 least squares fit:  time = a + b * (msg length)
    a = latency =   879.56 microseconds
    b = inverse bandwidth =  0.08785 secs/Mbyte
    1/b = bandwidth =    11.38 Mbytes/sec
 
    message         observed          fitted
 length(bytes)     time(usec)       time(usec)
 
     1000.           833.68           967.41
     2000.           924.87          1055.26
     3000.          1207.22          1143.12
     4000.          1025.00          1230.97
     5000.          1402.87          1318.82
     6000.          1494.41          1406.68
     7000.          1462.21          1494.53
     8000.          1719.21          1582.38
     9000.          1992.04          1670.24
    10000.          1999.90          1758.09
    11000.          2045.58          1845.94
    12000.          2108.91          1933.80
    13000.          1394.00          2021.65
    14000.          2105.69          2109.50
    15000.          1921.51          2197.35
    16000.          1717.27          2285.21
    17000.          2671.90          2373.06
    18000.          2687.51          2460.91
    19000.          2729.57          2548.77
    20000.          2740.12          2636.62
    21000.          2734.20          2724.47
    22000.          2536.79          2812.33
    23000.          2588.43          2900.18
    24000.          2856.03          2988.03
    25000.          3428.03          3075.89
    26000.          3418.59          3163.74
    27000.          3461.11          3251.59
    28000.          3510.78          3339.45
    29000.          3472.08          3427.30
    30000.          3484.46          3515.15
    31000.          3343.21          3603.00
    32000.          3162.13          3690.86
    33000.          3578.37          3778.71
    34000.          4182.80          3866.56
    35000.          4209.18          3954.42
    36000.          4211.38          4042.27
    37000.          4251.07          4130.12
    38000.          4261.55          4217.98
    39000.          4250.22          4305.83
    40000.          4097.96          4393.68
PE = 0, iter = 2, sample =   1, num =   125
PE = 0, iter = 2, sample =   2, num =   250
PE = 0, iter = 2, sample =   3, num =   375
PE = 0, iter = 2, sample =   4, num =   500
PE = 0, iter = 2, sample =   5, num =   625
PE = 0, iter = 2, sample =   6, num =   750
PE = 0, iter = 2, sample =   7, num =   875
PE = 0, iter = 2, sample =   8, num =  1000
PE = 0, iter = 2, sample =   9, num =  1125
PE = 0, iter = 2, sample =  10, num =  1250
PE = 0, iter = 2, sample =  11, num =  1375
PE = 0, iter = 2, sample =  12, num =  1500
PE = 0, iter = 2, sample =  13, num =  1625
PE = 0, iter = 2, sample =  14, num =  1750
PE = 0, iter = 2, sample =  15, num =  1875
PE = 0, iter = 2, sample =  16, num =  2000
PE = 0, iter = 2, sample =  17, num =  2125
PE = 0, iter = 2, sample =  18, num =  2250
PE = 0, iter = 2, sample =  19, num =  2375
PE = 0, iter = 2, sample =  20, num =  2500
PE = 0, iter = 2, sample =  21, num =  2625
PE = 0, iter = 2, sample =  22, num =  2750
PE = 0, iter = 2, sample =  23, num =  2875
PE = 0, iter = 2, sample =  24, num =  3000
PE = 0, iter = 2, sample =  25, num =  3125
PE = 0, iter = 2, sample =  26, num =  3250
PE = 0, iter = 2, sample =  27, num =  3375
PE = 0, iter = 2, sample =  28, num =  3500
PE = 0, iter = 2, sample =  29, num =  3625
PE = 0, iter = 2, sample =  30, num =  3750
PE = 0, iter = 2, sample =  31, num =  3875
PE = 0, iter = 2, sample =  32, num =  4000
PE = 0, iter = 2, sample =  33, num =  4125
PE = 0, iter = 2, sample =  34, num =  4250
PE = 0, iter = 2, sample =  35, num =  4375
PE = 0, iter = 2, sample =  36, num =  4500
PE = 0, iter = 2, sample =  37, num =  4625
PE = 0, iter = 2, sample =  38, num =  4750
PE = 0, iter = 2, sample =  39, num =  4875
PE = 0, iter = 2, sample =  40, num =  5000
 
 iter =  2
 
 least squares fit:  time = a + b * (msg length)
    a = latency =   846.06 microseconds
    b = inverse bandwidth =  0.08895 secs/Mbyte
    1/b = bandwidth =    11.24 Mbytes/sec
  
    message         observed          fitted
 length(bytes)     time(usec)       time(usec)
  
     1000.           863.99           935.02
     2000.           917.76          1023.97
     3000.          1027.90          1112.93
     4000.          1022.51          1201.88
     5000.          1092.50          1290.84
     6000.          1422.13          1379.79
     7000.          1609.93          1468.75
     8000.          1738.83          1557.70
     9000.          1978.99          1646.66
    10000.          1992.19          1735.61
    11000.          2049.36          1824.56
    12000.          2112.09          1913.52
    13000.          1393.49          2002.47
    14000.          2109.73          2091.43
    15000.          1945.51          2180.38
    16000.          1708.46          2269.34
    17000.          2668.72          2358.29
    18000.          2686.89          2447.25
    19000.          2729.97          2536.20
    20000.          2728.82          2625.16
    21000.          2732.29          2714.11
    22000.          2545.97          2803.06
    23000.          2569.63          2892.02
    24000.          2842.12          2980.97
    25000.          3413.25          3069.93
    26000.          3424.13          3158.88
    27000.          3447.60          3247.84
    28000.          3490.27          3336.79
    29000.          3477.10          3425.75
    30000.          3483.79          3514.70
    31000.          3372.24          3603.66
    32000.          3191.49          3692.61
    33000.          3609.01          3781.56
    34000.          4167.97          3870.52
    35000.          4183.22          3959.47
    36000.          4215.16          4048.43
    37000.          4235.19          4137.38
    38000.          4255.42          4226.34
    39000.          4242.77          4315.29
    40000.          4086.90          4404.25