[monolith-users] Monolith system maintenance tuesday

Niclas Andersson nican at nsc.liu.se
Fri Aug 15 22:35:49 CEST 2003


-------------------------------------
Tueseday 19/8 11:00-15:00 
System maintenance stop on Monolith 
-------------------------------------

Dear Monolith users,

On Tuesday at 11:00-15:00 we have reserved Monolith for system maintenance.

Jobs that can not be scheduled to finish before Tuesday will be held
in queue and start when Monolith becomes available again for
general use.

During the system maintenance period we will:

1. Install a new patched driver for the SCALI network.
   (A severe race condition was discovered last week)

2. Install a new ScaMPI library with a modified backoff algorithm.
   (to make nodes behave more friendly against eachother)

This will hopefully resolve some of the problems we currently see with
the SCALI network. 

There is no need to recompile or relink any applications.

-----------------------

SCALI personel now keep a close watch on Monolith and tries to analyse
the traces from each incident we discover. I say "tries to" because
with the high speed communication it is indeed difficult to catch
anything at all before it's far too late.

Also, we have gradually increased the more general monitoring of
hardware, operating system, and jobs. As long as you use 'mpprun'
(=/usr/local/bin/mpirun) to start your application, we automatically
get a copy of your stderr if your job crash or fail to start due to
mpimon problems.

One issue we need you to help us with is to monitor the progress of
your application and give us a notification if it hang for some
reason. Since the communication patterns varies and the processes spin
lock when waiting for communication, there is no easy way for us to
detect a hung state of an application. We would therefore be greatful
if you could give us a notification if and when this happens. JobID
and the time it hung (as accurate as you can) is the least information
we need. If you can tell us more about in what kind of operation it
hangs it will of course help us in our pursuit.

There are two main reasons for a hung application; either 1) there is
a message routing problem which most often can be resolved if we only
get our eyes on the problem in time or 2) it can be caused by hardware
failure (e.g. a node panic) or data corruption which is indeed fatal
for the application. When an application just stops producing output
it is difficult to tell what the reason is.

Questions and comments to support at nsc.liu.se.

Regards,
        Niclas

-- 
Niclas Andersson                                E-mail: nican at nsc.liu.se
National Supercomputer Centre                       Phone: +46 13 281464
Linkoping University, S-581 83 Linkoping, Sweden      Fax: +46 13 282535


More information about the monolith-users mailing list