[Gimle-users] Important information about filesystem /nobackup/rossby14

Kent Engström kent at nsc.liu.se
Thu Mar 3 17:21:39 CET 2011


[The gimle-users address was wrong in the last mail. Resending.]

Dear users of the /nobackup/rossby14 filesystem,

one of the servers serving the filesystem has been seriously confused.
Please read this email. You need to decide if you have to rerun
jobs and/or check copied data.

We got an excellent error report from Mihaela Caian today. The report
pointed us to a specific file. When we tried to read that file on all
nodes of Gimle, we got the same content on almost all nodes, but on a
small number of nodes, the file had a corrupted part somewhere. Clearing
the disk cache on the nodes and reading the file again, gave us the
right data (at least, the data seen earlier by the majority of the
nodes).

We saw the same phenomenon for other files, and found out that the
common factor was that the file was stored on a specific storage
server.

Thus, it seems like the data is correct on disk, but the server, with
a certain probability, has been sending bad data to the clients.

We have now shut down the bad server, moved the disks to the spare
chassis, and started it again. Trying again with the files we saw
inconsistent data for before, we now see the same data on all nodes.

We will check this further, but you may assume that there is a certain
probability that files you have read from rossby14 during the last weeks
may have contained bad parts.

If you have any doubts about the integrity of the output of jobs you
have run, you might want to run them again.

If you have copied data from rossby14, you might want to copy it
again, or compare with saved checksums for the files, if you have
that.

We have informed Michael Kolax about this already. You may discuss with
him if you think about the need to redo things.


Sincerely,
-- 
Kent Engström, National Supercomputer Centre
kent at nsc.liu.se, +46 13 28 4444



More information about the Gimle-users mailing list