[Krypton-users] Krypton service stop over; queing system reconfigured

Thu Nov 1 15:23:24 CET 2012

Dear Krypton Users,

the system is available again after todays queuing system
reconfiguration (and miscellaneous updating/maintenance on nodes and
switches).

There are news concerning node limits, specifying you group, running
risk jobs and using fat nodes.

NODE LIMITS
-----------

We have implemented node limits between groups as requested via our
user representatives.

On the top level, there is a fixed division between

Group rossby - 73 nodes
Group sm_fouo - 73 nodes
Group fm - 15 nodes
Other groups - 67 nodes
Fat nodes - 12 nodes (more about them below)

So, for example, all users that run as "rossby" will
compete for the 73 nodes available, not less, not more. No other
group can run normal jobs on these nodes.

The other groups not mentioned specifically above will share 67
nodes. There are max node limits for each group there too, but these
node limits are "oversubscribed" (they add up to more than 67):

Group sm_foup - 59 nodes
Group sm_foua - 14 nodes
Group sm_foul - 30 nodes
Group sm_ml - 12 nodes
Group sm_mo - 12 nodes

This means that your jobs may be blocked due to an insufficient amount
of nodes in total, or because of your group hitting the group-based
limit.

SPECIFYING YOUR GROUP
---------------------

You may need to specify what group your job belongs to.  In SLURM
speak, that is an account, and you specify it with "-A accountname" to
sbatch, interactive, etc. For example:

  sbatch -N10 -t 2:0:0 -A rossby myscript.sh

When you login, we now try to set the environment variable
SBATCH_ACCOUNT to your default account, so if you are member of one
group only, you may not need to specify -A for normal jobs.
You can use "echo $SBATCH_ACCOUNT" to check the value.

If you can run jobs in several groups, you need to pass -A correctly,
or set SBATCH_ACCOUNT yourself.

You should *not* request partitions yourself using the -p flag, but
let the system handle this based on the account.

RUNNING RISK JOBS
-----------------

All groups on Krypton can submit risk jobs that are able to run on all
available nodes (even fat and huge ones). The drawback is that risk
jobs will be killed as soon as the nodes they run on are needed to be
able to run a non-risk job.

To submit risk jobs, add "_risk" to your account name. For example, if
you are part of the sm_fouo group, use "-A sm_fouo_risk" on the
sbatch command line.

If the risk job is a batch job (not interactive), you will probably
also want to add "--requeue", so that the risk job is requeued
automatically if it is preempted. Without this flag, the job will be
canceled as soon as it is preempted.

USING FAT NODES
---------------

The 12 nodes with more memory are kept in a separate partition and are
available to all groups. To use them, add "_fat" to your account
name. For example, if you are part of the sm_foup group, use "-A
sm_foup_fat" when you submit jobs.

The 10 fat nodes (128 GiB) will be used before the 2 huge nodes (256
GiB). If you need the huge nodes, use "-C huge" in addition to the
"-A" option discussed above.

REPORTING PROBLEMS
------------------

We have tried to test the changes before implementing them (on a test
cluster with simulated nodes), but this is a big change and there
might be problems.

If something seems broken, please report it to
smhi-support at nsc.liu.se.  As always, tell us where you have been
running, what you have tried to do, what happened, and what you think
should have happened.

Regards,
/ Kent Engström, NSC