[Snic-users] Triolith and Gamma failure: running jobs lost

Mats Kronberg kronberg at nsc.liu.se
Thu Aug 17 19:50:29 CEST 2017


Dear Triolith and Gamma users,

The system is now back online.

Impact: access to Gamma and Triolith not possible from 16:24 to 19:41.
All jobs that were running at 16:24 were killed. As far as we can
tell, no data was damaged.


Cause:

Earlier this week, one of the two redundant servers that handle one
third of /home and /proj failed, and we're currently waiting for the
storage vendor to repair it. Unfortunately the single remaining server
lost all contact with its disks today. Failure of both servers is
something the system is not designed to handle, so all access to the
file systems were lost, and all running jobs failed.

The cause of today's failure has not been determined yet (possibly
unusually high load combined with only having 50% of the normal server
capacity online), so there's no guarantee it won't happen again, but
we still think it's better to open up the system now than to remain
closed until the first server has been repaired.


-- 
Mats Kronberg, NSC Support <support at nsc.liu.se>


More information about the Snic-users mailing list