[Snic-users] Unplanned downtime of Triolith, Kappa and Matter

Mats Kronberg kronberg at nsc.liu.se
Wed May 20 17:36:15 CEST 2015


Update: Triolith is now back online, Kappa and Matter will follow shortly.

This is what happened:

We identified the source of the earlier problem today to be a faulty
network cable connecting Kappa and Matter to the storage server
"gss3". The servers "gss3" and "gss4" form a redundant pair - if one
is shut down or fails, the other can handle the load.

However, running on just one of the servers is not entirely safe as we
are then just one fault away from a complete outage (like this one).
It also lowers storage performance (in theory by 50%, but it's
probably not very noticeable in most cases as we rarely use more than
50% of the system's full performance).

We decided to replace the faulty network cable (normally a safe
operation) without waiting for the next full service stop (which might
be weeks or months away).

Unfortunately, when the cable connecting gss3 was being replaced,
server gss4 unexpectedly crashed. We have currently no good
explanation for this, and it's not something we considered a
significant risk.

This failure resulted in one third of all disks being inaccessible to
the compute clusters, and this resulted in almost all jobs on
Triolith, Kappa and Matter failing.

We will work with the vendor to figure out why this happened.

NSC apologizes for the inconvenience.

-- 
Mats Kronberg, NSC Support <support at nsc.liu.se>


More information about the Snic-users mailing list