[Snic-users] Running jobs lost, Triolith and Gamma

Mats Kronberg kronberg at nsc.liu.se
Wed Aug 12 10:57:01 CEST 2015


Dear Triolith and Gamma users,

We've seems to have hit some bad luck today. The storage system
consists of three redundant server pairs. As long as one server in
each pair is running, the storage is available.

Earlier, one server in the third pair failed, and the vendor had
replaced the broken hardware but the server had not yet been
reconfigured and started (due to some technical problems and staff
with the correct skills not being available due to vacations). And
this morning the other server in the same pair failed with what seems
to be the same type of hardware failure...

This resulted in the loss of all access to the shared /home and /proj
file systems, which caused all running jobs to fail.

We're now working on:

A: configure and start the server that was repaired earlier to allow
the storage system to be brought online
B: get the second server repaired and restored to service, so we get
our redundancy back


My best estimate on when Triolith and Gamma can be available again is
"late today, if there are no further problems".

We will send out further information when the systems become
available, or if there are further delays.


-- 
Mats Kronberg, NSC Support <support at nsc.liu.se>


On Wed, Aug 12, 2015 at 7:11 AM, Mats Kronberg <kronberg at nsc.liu.se> wrote:
> Dear Triolith and Gamma users,
>
> We have just experienced a problem with the shared storage system
> (/home, /proj) used by Triolith and Gamma. All running jobs have
> probably been lost.
>
> We are currently investigating what happened, and will send out more
> information when we have it.
>
> New logins to the login nodes have been temporarily disabled.
>
> Kind regards,
> Mats Kronberg, NSC


More information about the Snic-users mailing list