[Vagnekman-users] Vagn: The new fat nodes are available

Thu Sep 22 14:17:28 CEST 2011

Dear Vagn users,

The new fat nodes a7 and a8 are now available for general use (since
yesterday afternoon). They will be automatically used by jobs that
require large amounts of RAM, or when all other nodes are in use.

Quick HOWTO: Use "interactive --mem=N ..." or "sbatch --mem=N ..." to
request N megabytes of RAM, and the system will automatically start
your job on a fat node if necessary.

Now, while you are in a good mood and I have your attention... please
continue reading for some information on how to use Vagn efficiently
and without causing problems for other users.

The login node
==============
Do not run applications on the login node! It is reserved for
submitting jobs and data transfers to and from the system.

The Time Limit
==============
The default time limit for interactive and batch jobs is 6 days. This
is also the maximum allowed time limit.

If you are not planning to actually run your job or interactive
session for six days, please consider requesting less time. It helps
other users estimate how long they might have to wait in the queue.
For interactive jobs it also reduces the amount of wasted computing
time if you forget to close your interactive session.

E.g "interactive -t 3:30:00" will give you a time limit of 3 hours 30 minutes.

Memory and CPU allocation
=========================
The default allocation for is one CPU core and 4GB of RAM for all job
types (batch and interactive).

If you use more than the allocated memory, your job will automatically
be killed. The CPU core limit is not enforced, it is just used to
decide how many jobs can be started on the same node. Please do not
use more CPU cores than you have requested, or other users on the same
node might be affected.

To allocate more or less RAM than the default 4GB, use the --mem=<MB>
option, e.g:

"interactive --mem=24000 -t 8:00:00" will give you one core and 24GB
RAM for 8 hours.

Note: many applications can use more than one CPU core. If you want to
estimate how much CPU your job is using, run "top" and check the %CPU
column for your processes. 100% CPU is equivalent to one full CPU
core. If you run applications that routinely use more than one CPU
core, please request an appropriate number of cores when submitting
the job.

To allocate more than one CPU core on a single node, use "-n" (number
of cores) and "-N1-1" (to avoid getting cores spread out over more
than one node), e.g:

"interactive --mem=24000 -n 4 -N1-1 -t 8:00:00" will give you four
cores and 24GB RAM on one node for 8 hours.

The same options are used for batch jobs, e.g "sbatch --mem=24000 -t 8:00:00".

The noshare partition, and why you should avoid using it
========================================================
If you for some reason require a whole node dedicated to one job, use
the "noshare" partition. When using the noshare partition you must
also request a particular node type (thin=16GB/8 cores, fat=32GB/8
cores, huge=256GB RAM/32 cores) using the -C option. E.g:

"interactive -p noshare -C thin -t 8:00:00" will give you exclusive
access to one 32GB node for 8 hours.

Do not use the noshare partition unless you know that you really need
a whole node for your job!

Note: using --mem instead of the noshare partition will ensure that
your job is started as soon as possible. Why? There is room for 22
jobs submitted with "--mem=32000" in the cluster, but only four jobs
submitted with "-p noshare -C thin". Both will give you 32GB RAM, but
you might have to wait longer in the queue when using noshare.

Limitations on memory and core availability
===========================================
If you request more than 32186 MB RAM your job can only be run on one
of the fat nodes (a6, a7 or a8), which might result in you having to
wait longer for the job to start.

If you request more than 64372 MB RAM your job can only be run on one
of the new fat nodes (a7 or a8), which might result in you having to
wait longer for the job to start.

If you request more than 8 cores on a single node your job can only be
run on one of the new fat nodes, which might increase your queue time.

Unless you are planning to run a job over multiple nodes (e.g using
MPI), you should never request more than 32 cores and 256000 MB RAM
(the maximum that is available in a single fat node).

Running batch jobs without annoying other users
===============================================
Since Vagn is used for both interactive and batch jobs, batch job
users should be careful and not run so many concurrent jobs that it
becomes impossible for other users to start jobs.

Note: because the queue in Vagn is a true FIFO queue, one user can
easily block the entire system for other users by submitting a large
volume of jobs at once.

If you want to submit a lot of jobs at once without disturbing other
users, you can do so by limiting the number of concurrent jobs: use
job naming and the "--dependency=singleton" feature, e.g:

sbatch -J batch1 --dependency=singleton job1.sh
sbatch -J batch2 --dependency=singleton job2.sh
sbatch -J batch3 --dependency=singleton job3.sh
sbatch -J batch1 --dependency=singleton job4.sh
sbatch -J batch2 --dependency=singleton job5.sh
sbatch -J batch3 --dependency=singleton job6.sh
(this submits 6 jobs but no more than 3 can run at any one time)

There are currently no hard limits on the number of jobs a single user
can run or submit, please use common sense!

Common sense can be assisted by checking the queue status before
submitting a large volume of jobs. To get an overview of running and
queued jobs, use the "squeue" command with suitable options, e.g
squeue -o "%.7i %.9P %.8u %.8T %.11L %.11l %.8N %.10m"

All the options "--mem", "-n" etc for interactive and sbatch are
described in the man page for sbatch (run "man sbatch" on Vagn to read
it).

If your application does not work on the fat nodes
==================================================
Please read https://lists.nsc.liu.se/mailman/public/vagnekman-users/2011-July/000221.html

If you cannot get your application working, contact
vagnekman-support at snic.vr.se for assistance.

As a temporary workaround, you can request that your job is only run
on one of the old Intel nodes. Use the option "-C intel" to sbatch or
interactive.

--
Mats Kronberg, NSC Support <vagnekman-support at snic.vr.se>