[Triolith-users] All Triolith compute nodes down (unplanned outage)

Mats Kronberg kronberg at nsc.liu.se
Tue Sep 3 14:16:33 CEST 2013


Short summary: All 432 nodes that were running this morning and an
additional 288 that have been moved today will be available again
during the afternoon.


Details: the outage was caused by a failure in a UPS (battery backup)
unit that supplies power to the cooling system in our new computer
room. As the cooling system failed, the temperature increased and the
compute nodes were automatically powered off to prevent damage.

We decided to extend the outage a few hours in order to give the UPS
supplier an opportunity to investigate the problem better. We estimate
that 720 nodes will be available again sometime later this afternoon.

All users whose job failed due to this problem has been notified. If
you did not get an email about this a few minutes ago, your jobs were
not affected.

-- 
Mats Kronberg, NSC Support <support at nsc.liu.se>


More information about the Triolith-users mailing list