[Neolith-users] Filesystem problems being investigated
Peter Kjellstrom
cap at nsc.liu.se
Wed Mar 17 18:50:09 CET 2010
On Wednesday 17 March 2010, Peter Kjellstrom wrote:
> All compute nodes and login nodes are currently unavailable while we
> investigate a filesystem related problem. All running jobs have been
> stopped.
>
> More information will be sent out as the investigation progresses. At this
> time we don't yet have a time schedule for returning the systems to normal
> service.
We have now (almost) completed our debugging and analysis.
Short version: Some data written to nobackup after 2010-03-16 19:50 (last
night) has been lost and affected users will be contacted. Neolith (and
Kappa) is expected to return to service some time tomorrow.
Longer version
At approx 19:50 last night the latest in a series of configuration changes,
aiming at expanding the nobackup filesystem, went wrong (in a non-obvious
way). The storage unit we attempted to attach had a low level address that
conflicted with another storage unit already in use. That low level address
is supposed to be unique...
This mixup went undetected partly because no meta-data was affected (so
everything seemed to work and no errors were logged). During the night many
read requests against the filesystem returned (very) unexpected data and
write requests put data in the wrong place. Fortunately the "wrong place" was
unused so no further data was over-written.
We've spent the day today figuring out what happened, how it happened and
which files were affected. Information on lost/corrupted files will be sent
out individually.
NSC regrets the incident and will work with the storage vendor to ensure that
it won't happen again. Users with questions are as always welcome to contact
support.
/Peter
--
------------------------------------------------------------
Peter Kjellström | E-mail: cap at nsc.liu.se
National Supercomputer Centre |
Sweden | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url : http://www.nsc.liu.se/pipermail/neolith-users/attachments/20100317/ea464a16/attachment.bin
More information about the neolith-users
mailing list