[monolith-users] News digest

Niclas Andersson nican at nsc.liu.se
Mon Oct 20 18:12:27 CEST 2003


Dear Monolith Users,

This is a digest of news items related to usage of Monolith. It is
sent to all users on Monolith via the e-mail-list
<monolith-users at nsc.liu.se>. Instructions on how to subscribe and
unsubscribe is listed in the end of this e-mail.


SCI Network Stability
=====================

We have experienced very good stability of Monolith since August 19
when we installed a new ScaMPI library with an improved
backoff-algorithm.  However, we still keep a close watch on error
output from mpirun/mpimon to detect any possible SCI induced errors.

There are still some applications that stops with error messages
containing one or more line of

	** Address Error **

This message is generated by the Intel compiler's runtime library in
various trivial error situations. E.g. a simple way of generating this
message is to try indexing an array outside its boundaries. Either it
can be the primary cause the application stop or it can be a secondary
error caused by an abort on another computing node or similar.

Since this message is easy to generate in user code and difficult to
relate to erroneous behaviour of the network, we do not take any
general actions. To pursue this error in a specific application we
need your help to isolate, trace, and debug to find the reason for
this error.

To get more descriptive errors messages the compiler option '-C'
(both ifc and icc) enables extensive runtime error checking
(see man-pages for more selective options).


Resource Allocation
===================

A few users have raised the question of equality in resource
allocation procedure in the scheduling on Monolith. They claim they
have been treated unfairly in comparison with others. In all cases,
the issue have resolved to amount of granted time to the project
rather than the used scheduling algorithm. 

  When a SNAC project have consumed all its granted time, the priority
  of all its users is lowered to bonus priority and submitted jobs are
  scheduled _after_ all normal priority jobs have been scheduled.

On Monolith where the activity always is high, lowered priority easily
results in longer time in queue. This can wrongly be viewed as unfair.
Instead, bonus priority provide a possibility to utilize the computing
time that other projects with normal priority do not claim.

We have slightly improved the view on WWW of the idle queue on
Monolith (<http://status.nsc.liu.se/monolith>). The division between
normal and bonus priority jobs is now explicity shown.


File Systems
============

There are three "areas" were you can put files:

* /home/$USER: Your home directory. Limited in size. Backed
up. NFS-mounted on all nodes.

* /disk/global/$USER: Your global directory. For large files. Fast. No
backup. NFS-mounted on all nodes.

* /disk/local: Only available on the local node. Removed when job ends. 
Fastest.

We have lately experienced a full /home file system. We increased its
size during the last maintenance period and we will monitor and
enforcing the size you are using. A few recommodations:

- Use /disk/local for local scratch files

- Use /disk/global/$USER for generated data. Remeber that stdout and
stderr is stored in the same directory in which you started the job.


Pbspeek
=======

With 'pbspeek' you can see the stdout or stderr of your running job.

Usage: pbspeek [-o|-e][-h] <jobid>
       -o show stdout (default)
       -e show stderr
       -h show this help and exit


Core dump
=========

When using ScaMPI with NSC's mpprun (in /usr/local/bin/mpprun)
you can now enable core dumps by supplying the option '-core' to
mpprun:

	mpprun -core your_binary

Please, don't use this option unless you really is in need of the core
dump for debugging. When enabled, running a parallel application can
in short time generate many core dumps, consuming a lot of disk space.
Please, refrain from using /home when this option is enabled!

(/usr/local/bin/mpirun is a symbolic link to /usr/local/bin/mpprun. We
use the name mpprun above to maintain a separation from the many
mpirun harnesses that exists and are used on Monolith to actually
launch your application.)


        Niclas Andersson

-- 
NSC Support Team                              E-mail: support at nsc.liu.se
National Supercomputer Centre                       Phone: +46 13 281000
Linkoping University, S-581 83 Linkoping, Sweden      Fax: +46 13 282535


More information about the monolith-users mailing list