[Neolith-users] Filesystem problems being investigated

Peter Kjellstrom cap at nsc.liu.se
Wed Mar 17 18:50:09 CET 2010


On Wednesday 17 March 2010, Peter Kjellstrom wrote:
> All compute nodes and login nodes are currently unavailable while we
> investigate a filesystem related problem. All running jobs have been
> stopped.
>
> More information will be sent out as the investigation progresses. At this
> time we don't yet have a time schedule for returning the systems to normal
> service.

We have now (almost) completed our debugging and analysis.

Short version: Some data written to nobackup after 2010-03-16 19:50 (last 
night) has been lost and affected users will be contacted. Neolith (and 
Kappa) is expected to return to service some time tomorrow.


Longer version

At approx 19:50 last night the latest in a series of configuration changes, 
aiming at expanding the nobackup filesystem, went wrong (in a non-obvious 
way). The storage unit we attempted to attach had a low level address that 
conflicted with another storage unit already in use. That low level address 
is supposed to be unique...

This mixup went undetected partly because no meta-data was affected (so 
everything seemed to work and no errors were logged). During the night many 
read requests against the filesystem returned (very) unexpected data and 
write requests put data in the wrong place. Fortunately the "wrong place" was 
unused so no further data was over-written.

We've spent the day today figuring out what happened, how it happened and 
which files were affected. Information on lost/corrupted files will be sent 
out individually.

NSC regrets the incident and will work with the storage vendor to ensure that 
it won't happen again. Users with questions are as always welcome to contact 
support.

/Peter

-- 
------------------------------------------------------------
  Peter Kjellström               | E-mail: cap at nsc.liu.se
  National Supercomputer Centre  |
  Sweden                         | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url : http://www.nsc.liu.se/pipermail/neolith-users/attachments/20100317/ea464a16/attachment.bin


More information about the neolith-users mailing list