[Snic-users] /proj file system unplanned downtime

Pär Lindfors paran at nsc.liu.se
Wed Nov 5 18:24:53 CET 2014


Dear NSC SNIC users,

At 17:25 today the new /proj became unavailable. We started to repair
the file system right away, and the file system became available again
at 18:20. We do not believe that any data stored on disk have been lost
during this downtime.

Running jobs that was using the file system will most likely have been
killed. Data transfers that was running on for example login nodes will
also have failed. The error messages will in most cases have been "Stale
file handle".

Since taking the new storage into production we have been experiencing
problems with excessive memory usage on the storage servers. To improve
this situation we have changed certain memory configuration settings and
were restarting the servers today one by one. This is normally a safe
operation, but one of the reboots now caused the file system to go
down. We will of course report this issue to the file system vendor.

Regards,
Pär Lindfors, NSC



More information about the Snic-users mailing list