[Vagnekman-users] Acceptable use of Vagn

Fri Aug 24 15:38:50 CEST 2012

Dear Vagn users,

There has been some complaints, both today and earlier this summer,
about some users using so many resources (CPU cores and memory) on
Vagn that it becomes difficult for other users to get access to the
system without having to wait for hours or even days.

Please remember that the queue on Vagn is just a FIFO, jobs (both
batch jobs and interactive sessions) are started in the order they are
submitted. Each job is allocated a certain number of CPU cores and
memory, and those resources cannot be used by other jobs, so when Vagn
runs out of unused CPU cores or memory, no new jobs can start until an
old one ends.

The responsibility for not "hogging" a too large part of the machine is yours!

If you want to do run several batch jobs job on Vagn, I suggest that
you consider using the method described on
http://www.nsc.liu.se/systems/vagn/#sec-4-9 to make sure that you
don't use too many resources at any one time.

How many cores/RAM is it OK to use? That is difficult to say. If you
are the only user from your user group (e.g MISU) using Vagn on a
certain day, it might be acceptable to use a large chunk of Vagn, but
if many other users from your group are also active you should
probably be more careful. Also remember that it is the amount of
resources used that matters, not the number of jobs. A 32-core/256GB
job uses just as many resources as 32 1-core/8GB jobs.

Since it is difficult to get an overview of who is actually using what
resources on Vagn I made a small script "vagn-usage" that might be
useful. Please try it out. Sample output:

[kronberg at analys1 ~]$ vagn-usage
Vagn usage at 2012-08-24T15:21:26

Usage by group
Group     #cores      Memory (MB)
---------------------------------
 kthmech      39      424000
    misu       7       38000
  rossby       6       22000
 sm_fouo       3       10000

Usage by user
User      #cores      Memory (MB)
---------------------------------
sm_annli       1        4000
sm_louca       1        2000
sm_mkola       2        8000
sm_ppemb       1        4000
sm_rohor       1        2000
sm_semsc       1        4000
sm_stran       1        4000
sm_torko       1        4000
 x_andci      32       64000
 x_iulib       2        8000
 x_janju       1        2000
 x_julsa       2        4000
 x_larah       1       20000
 x_laubr       1        4000
 x_liawe       2      100000
 x_maber       1        4000
 x_phisc       4      256000

     Cores                   Memory (MB)
Node in_use  total      %    in_use    total      %  Full?
----------------------------------------------------------
  a2      4      8   50.0     32000    32186   99.4    yes
  a3      1      8   12.5     32000    32186   99.4    yes
  a4      1      8   12.5     32000    32186   99.4    yes
  a5      7      8   87.5     18000    32186   55.9     no
  a6      4      8   50.0     62000    64372   96.3    yes
  a7      6     32   18.8    254000   257488   98.6    yes
  a8     32     32  100.0     64000   257488   24.9    yes

 all     55    104   52.9    494000   708092   69.8    n/a

(Full == no available cores or <4GB RAM available)

There are jobs waiting in the queue:
 JOBID PARTITION     USER  ACCOUNT NODES CPUS MIN_MEMORY NODELIST(REASON)
 72746     share  x_julsa     misu     1    8       4000 (Resources)
 72991     share  x_phisc  kthmech     1    1      48000 (Priority)
 72992     share  x_phisc  kthmech     1    1      48000 (Priority)

As you can see, Vagn was almost full, there is just one node where a
small single-core job could be started. You can also see that a single
user group is responsible for most of the usage.

-- 
Mats Kronberg, NSC Support <support at nsc.liu.se>