[Vagnekman-users] Re: Ekman: Broken file-server with system-software

lars malinowsky lama at pdc.kth.se
Tue Apr 6 13:29:01 CEST 2010


Hello, an update as of 2010-04-06/12:20:00

there is work in progress on repairing the
broken fileserver. In parallel we are preparing
for a full restore from tape, to alternative
hardware, should the repairs fail.

We did a separate restore of the binaries
used to communicate to the batch-system, and
you will see overdue running jobs terminate,
although delayed.

Jobs solely relying on files in /cfs/ during
execution should still be running unaffected.
Jobs that have finished during the outage have
not been able to release (give back) reserved
nodes, as the appropriate commands simply were
not there.

As random applications (openmpi, scali-mpi,
intel-compilers, libraries, ...) also live on
the broken server, we cannot allow jobs to start
until dependencies have been sorted out and
backups restored from tape (and/or the fileserver
is online again.)

regards,
lars/pdc-staff.
- - - 
lars malinowsky <lama at pdc.kth.se> writes:

> Hello,

> as you might have noticed, we have a broken file-server
> which among many other things contain most of the
> commands to communicate with the batch-system.

> As this has happened right in the middle of the Easter
> holidays, there is no immediate information to give
> about whether it possibly can be repaired today.

> (It is, as a curiosity, not even possible to even
>  announce this information using our ordinary
>  flash-news mechanism.)

> We are very sorry for the inconvenience.

> regards,
> lars/pdc-staff.


More information about the Vagnekman-users mailing list