[Snic-users] Temporary unplanned file system outage on Friday/Saturday (Triolith, Matter, Kappa)

Mon Sep 2 11:20:46 CEST 2013

Dear Triolith, Kappa and Matter users,

Approximately between 18:20 CEST on Friday and 03:00 CEST on Saturday,
we experienced a severe overload of the NSC Centre Storage system
(where /home, /nobackup/global and /software is located).

We did a quick check of the system on Saturday, and performance then
appeared to be normal again.

You might have experienced this problem as general slowness of the
file system ("ls", editing files, copying files etc) or even error
messages such as "No space left on device".

We have had reports of some jobs that tried to use the file system
that either failed, or stopped making progress.

If your jobs failed for no apparent reason during the weekend, this
problem might be the cause. If you find errors such as "no space left
on device" but you have plenty of space available in your quota, this
problem is definitely the cause of your job failure. You should
resubmit such jobs.

If you notice that a job is not making any progress (i.e if it
normally writes to an output file every 5 minutes and no longer does
so), it has probably stopped working due to this problem. In that case
you should cancel the job and resubmit it.

Unfortunately we have not been able to trace the source of the problem
yet. This is sometimes very hard or impossible to do after the problem
has gone away. The cause is likely to have been a job or something run
on a login node that did large amounts of metadata operations (i.e
copying or creating small files).

If you did something unusual to the file system starting around 18:20
on Friday, please let us know, we guarantee your anonymity. :)

NSC apologizes for the inconvenience.

-- 
Mats Kronberg, NSC Support <support at nsc.liu.se>