[Neolith-users] Neolith infiniband network

Mattias Slabanja slabanja at nsc.liu.se
Wed Oct 22 10:24:52 CEST 2008


Dear Neolith user.

We have at a number of occasions experienced unexpected infiniband 
switch reboots in the Neolith cluster. The frequency at which these 
reboot events has occurred is very low, but at every such event a 
handful of MPI-applications have been disrupted.

Working together with an engineering team from the switch vendor, we are 
  gathering information to be able to find the root cause of the 
problem, and in this process we believe that during the coming week 
there could be an elevated risk of switch reboot.

Even though unlikely, if you do experience unexpected MPI-failures 
during the coming week, please let us know.


For those of you who are interested in the technical details.
The problem is related to a watchdog mechanism which is governing the 
management processor in each infiniband leaf switch (there are 70 such 
switches in Neolith). With a frequency of about one per four years per 
switch, the watchdog (for some unknown reason) decides that the switch 
is "stuck" and ought to be rebooted.
To pin point the cause, there has been ongoing work for the last couple 
of weeks, and we are currently pursuing leads indicating that the 
problem is network related.


Best regards,
The Neolith Team

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
Url : http://www.nsc.liu.se/pipermail/neolith-users/attachments/20081022/09d31003/signature.bin


More information about the neolith-users mailing list