[Vagnekman-users] Ekman: Longest job

Lennart Karlsson Lennart.Karlsson at nsc.liu.se
Tue Jun 30 16:14:34 CEST 2009


Dear Ekman users!

Currently the longest job length allowed is 10 days,
i.e. 14400 minutes. If you anticipate that your
job will wait in queue for at most one day, you
need to get an eleven-day Kerberos ticket before
submitting such a long job:

    kinit -f -l 11d username at NADA.KTH.SE


On the other hand, the longest Kerberos ticket
allowed is 30 days:

    kinit -f -l 30d username at NADA.KTH.SE

If you have a restartable application, that
checkpoints itself and is able to restart
from the checkpoint in a clever way, you
may utilize those 30 days by running your
application in e.g. several seven-day
jobs, that are chained together, so only
one of them are active at a time.

There are at least two ways in which you
may chain your jobs together:

1/ Submit a new job at the end of the
job script.

2/ Submit several jobs in a sequence
and restrict each job to not start before
the previous job has exited. This is
done with the "-F" option to esubmit.

I will now give a short example of the
first alternative. [Thank you Klaus, I
borrowed part of your code.]


HOW TO SUBMIT ONE JOB FROM ANOTHER

Create a sufficiently long ticket before
submitting your first job in the chain.

Decide on in which directory to run
your application. You need to have
your input files available there and
you need to have enough space and quota.

You also need to decide on a test,
to put in your job script, to decide
if you want to submit the next job
before exiting.

Here is the example:
=============================================
#! /bin/bash

source /pdc/modules/etc/init/bash

# working directory (=where all the data goes)
RUNDIR=/cfs/testscratch/l/lenzkar/test

# Go to run directory on scratch or nobackup disk
cd $RUNDIR

# Do my thing. (Here the programs of the code are started.)
# Exchange the sleep statement with your algorithm.
# Depending of outcome from computing, decide to resubmit or not.
sleep 3600
need_to_resubmit=1

# Is the decision to resubmit or not?
if [ $need_to_resubmit -le 0 ]; then
        exit
fi

# The next job script to be submitted
runscript=$RUNDIR/resubmit_script

# Prepare to submit next job
module add easy
rsh=/usr/heimdal/bin/rsh
shost=ekman.pdc.kth.se
esubmit_program=`type -p esubmit`

# Submit next job on host $shost
date
echo ${rsh} -F ${shost} ${esubmit_program} -n 1 -t 10080 $runscript
${rsh} -F ${shost} ${esubmit_program} -n 1 -t 10080 $runscript
exit
=============================================

It is started in this way:

    module add easy
    esubmit -n 1 -t 10080 ./resubmit_script

The same Kerberos ticket is used all the way. When your time runs
out, your esubmit call will break with a message like this:

    esubmit: failed: tight ticket life? expire in 56m04s (request is 168h.)
    esubmit: info: use -f to override.

Of course you need to adapt the script to your application,
and that includes changing the walltime specification and
the number of nodes in the esubmit lines.

Please send any comments or questions to vagnekman-support at snic.vr.se.

Best regards,
-- Lennart Karlsson <vagnekman-support at snic.vr.se>
   National Supercomputer Centre, Linkoping University
   http://www.nsc.liu.se




More information about the Vagnekman-users mailing list