[Berzelius-users] Filesystem outage 2023-04-28 09:45-11:00

Henrik Henriksson hx at nsc.liu.se
Fri Apr 28 17:23:11 CEST 2023


Today from 09:45 until 11:00, we had a filesystem outage affecting Berzelius.

This resulted in issues with connecting to Berzelius, as well as doing anything
on the cluster. However, the semantics of the Lustre client are such that any
filesystem operations are shouldn't error out, instead they should block until
the filesystem becomes available again. Thus, we expect that most, if not all,
running jobs should have survived without any issues. From what we can see in
our statistics and logs, no or very few running jobs were affected.

The root cause has been identified - a badly connected CEE 400V/16A plug that
has come loose during normal operations, and finally lost its connection this
morning during installation of new network equipment for the expansion. As a
mitigation, we will do an additional round of cable checks during the scheduled
downtime.

--
Henrik Henriksson
Systems administrator
NSC


More information about the Berzelius-users mailing list