[Triolith-users] Unplanned downtime, jobs lost

Pär Lindfors paran at nsc.liu.se
Tue Aug 26 14:37:47 CEST 2014


Dear Triolith users,

Triolith had unplanned downtime earlier today.
Both running and queued jobs were lost.

10:06 - Both running and queued jobs lost.
12:42 - Triolith started running new jobs.


Longer explanation:

NSC had planned downtime today 9-13 for electrical work in our other
data center. That building houses (among other things) the clusters
Kappa, Matter, the SNIC centre storage (/home, /software,
/nobackup/global) and most of NSCs servers and networking
infrastructure.

Kappa and Matter had scheduled downtime due to this, while we planned on
keeping the storage and infrastructure running on UPS power.

Unfortunately all power was accidentally cut, including the UPS
power. This resulted in Triolith losing all GPFS file systems, and the
external network connection. Losing the file systems caused running jobs
to be killed. Queued jobs was also lost, we will investigate why that
happened.

Power was restored at 11:20 and we then started bringing systems
online. Triolith was back in normal operation at 12:40.

Regards,
Pär Lindfors, NSC



More information about the Triolith-users mailing list