[Neolith-users] InfiniBand problem 2009-03-03
Pär Andersson
paran at nsc.liu.se
Thu Mar 5 06:58:38 CET 2009
Hi,
Tuesday 2009-03-03 at 09:42 a hardware component (spine module) in one
of Neolith's InfiniBand core switches reset itself. This should of
course not happen, and we are investigating the problem.
This disruption of the InfiniBand fabric caused a few jobs to fail
between 09:42 and 09:44. Failed jobs will probably have InfiniBand
and/or MPI related errors at the end of their output.
Here is a list of 12 jobs that we know failed during the mentioned time
interval. This list may not be complete, and some of these may have
failed for other reasons:
402462
404654
404659
404662
404655
404656
404657
408277
409312
409651
410202
410103
Please contact support at nsc.liu.se if you have questions about this.
Regards,
Pär Andersson
NSC
More information about the neolith-users
mailing list