[Neolith-users] InfiniBand problem 2009-03-03

Pär Andersson paran at nsc.liu.se
Thu Mar 5 06:58:38 CET 2009


Hi,

Tuesday 2009-03-03 at 09:42 a hardware component (spine module) in one 
of Neolith's InfiniBand core switches reset itself. This should of 
course not happen, and we are investigating the problem.

This disruption of the InfiniBand fabric caused a few jobs to fail 
between 09:42 and 09:44. Failed jobs will probably have InfiniBand 
and/or MPI related errors at the end of their output.

Here is a list of 12 jobs that we know failed during the mentioned time 
interval. This list may not be complete, and some of these may have 
failed for other reasons:

402462
404654
404659
404662
404655
404656
404657
408277
409312
409651
410202
410103

Please contact support at nsc.liu.se if you have questions about this.

Regards,
Pär Andersson
NSC



More information about the neolith-users mailing list