[monolith-users] Status of Monolith

Niclas Andersson nican at nsc.liu.se
Tue Oct 5 20:28:55 CEST 2004


Dear Monolith users,

The Monolith system is now fully operational.

Monolith is very well utilized perhaps even too much(?) since the
queue lenght tend to increase. This is also a factor of the maximum
job length which nowadays is six days. It is important that we make
good use of every CPU-second there is.

- Many of our file systems are very full, please take a moment and
clean out old files and reduce the amount you store, especially on
/disk/global and /disk/global4. The Monolith is processing lots of
jobs these days and it is a pity if the results from long jobs are
lost just because there were no space left on the disk devices.

A few notes from today's (Tuesday) system maintenance:

While Lennart, who is NSCs first manager of the Monolith system,
enjoys himself on well deserved vacation on the other side of the
earth, we have spent the day to attack a few stability problems we've
seen lately on Monolith:

- We have downgraded the driver software for the SCI-network. This
seems to cosiderably improve the stability on the SCI link level. If
you recently have seen unexpectedly low performance, please,
rerun your application.

- We have slightly tuned the configuration of the driver. This may
have the effect of marginally decreased performance on some
application. If you experience this, please, let us know. What we are
trying to address is the infrequent issue where an application can
hang without any trace of errors.

- We have replaced NFS over UDP with NFS over TCP throughout the
system. Lately we have seen the NFS client on login-1 and login-2
simply giving up on the home filesystems. To resolve this, the only
measure have been to reboot the login-server causing user's lost
editing work and an outage in the login service. We hope this change
will resolve these irritating interupts.

- We forgot to send out the customary system maintenance announcement
beforehand via e-mail. Our apologies. 

We hope (as always) that the result of this maintenance is a more
stable system where your applications will smoothly run to completion
without any interrupts. Please report any malfunctions, errors,
peculiar behaviour, or any other comments to <support at nsc.liu.se>

Best Regards,

Niclas Andersson
National Supercomputer Centre
on behalf of the system support team


More information about the monolith-users mailing list