[Berzelius-users] Resolved: Unplanned login-node downtime

Mon Sep 11 13:24:40 CEST 2023

Dear Berzelius users,

The failed login-node `berzelius1.nsc.liu.se` has been returned to service.

> Due to a currently undiagnosed issue, we lost the login node `berzelius001`
> earlier this afternoon. All active sessions on that server are lost. Jobs and
> job scheduling are unaffected.

The issue was most likely a cascading failure starting from the OOM-killer,
gradually degrading the server until it was kicked off the shared network
filesystem. After this, the node was fully unresponsive and required a reboot.
During reboot, a hardware issue manifested, prolonging the troubleshooting
process over the weekend.

We'd like to remind users that the login nodes are shared resources, on which
heavy tasks should not run. Please schedule a job on a node or limit the amount
of parallelism you use. We have resource limits in place wherever we can, but
not all shared resources can be covered. Processes on login nodes deemed
disruptive may be terminated by staff without warning (but with a follow up
email).

`berzelius.nsc.liu.se` points to a service that should failover automatically
between the login nodes and is what most users should use to connect to
Berzelius. Users connecting directly to `berzelius1.nsc.liu.se` or
`berzelius2.nsc.liu.se` should be aware that you may need to switch to the other
login-node manually in case of hardware failures or other issues.

--
Henrik Henriksson
Systems Administrator
National Supercomputer Centre