[Vagnekman-users] Vagn: planned maintenance, better docs and something you should not do...

Mats Kronberg kronberg at nsc.liu.se
Fri Jan 20 17:25:20 CET 2012


Dear Vagn users,

Three pieces of information for you:

1: Next Tuesday (2012-01-24) we will upgrade the batch scheduler (SLURM) on
Vagn to version 2.3. If all goes well, the only thing you will notice is
that you will not be able to start new interactive jobs, schedule batch
jobs , check job status etc during a period of a few minutes sometime
before lunch.

We have tested the upgrade on a test system, but no tests can be 100%
realistic, so something might still go wrong. The worst thing that can
realistically happen is that all running and queued jobs are killed and
need to be re-run/re-submitted, but I consider the risk of that happening
to be low. However, if you have VERY time-critical jobs that need to run
next week, let me know and we can reschedule this upgrade.


2: The Vagn User Guide (http://www.nsc.liu.se/systems/vagn/) has been
updated. I have added some sections (e.g how to submit lots of batch jobs
without hogging all Vagn nodes) which I've written about before but only
sent out as email to some users.

Please let me know if you find anything wrong, something missing that was
in the old User Guide or some important subject that you think we should
document better. (If you want to check something in the old User Guide, it
has been saved as
http://www.nsc.liu.se/systems/vagn/userguide-2012-01-18.html).


3: IMPORTANT: If you start an interactive or batch job on an analysis node
and then log in to that node in a new window using SSH, anything started
from that SSH login will NOT be subject to the normal limitations on job
time and memory size (i.e those processes will not be killed when your job
ends, and those processes will not be killed if they exceed your jobs
memory limit).

This means that processes started from an SSH login to an analysis node can
cause the node to run out of memory because they are not counted towards
the memory limit used by SLURM to determine how many jobs can run on a
node. This is now documented on http://www.nsc.liu.se/systems/vagn/#sec-4-5

This has caused at least one node to run out of memory recently, but it
might have been responsible for more out-of-memory incidents in the past.

Using the SSH login option is only permitted to check or debug your "real"
jobs running on that node (e.g ls, cat, top, ps, gdb, ...). Not matlab,
paraview, cdo, ...

If this loophole turns out to be a problem we can probably plug it with
some clever SLURM hack, but I would prefer to spend my time on other
things, so I hope all users will respect this and not use SSH logins in the
way I have described above.


-- 
Mats Kronberg, NSC Support <vagnekman-support at snic.vr.se>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.nsc.liu.se/mailman/public/vagnekman-users/attachments/20120120/7abb3646/attachment.html 


More information about the Vagnekman-users mailing list