[Neolith-users] Neolith infiniband network
Mattias Slabanja
slabanja at nsc.liu.se
Wed Oct 22 10:24:52 CEST 2008
Dear Neolith user.
We have at a number of occasions experienced unexpected infiniband
switch reboots in the Neolith cluster. The frequency at which these
reboot events has occurred is very low, but at every such event a
handful of MPI-applications have been disrupted.
Working together with an engineering team from the switch vendor, we are
gathering information to be able to find the root cause of the
problem, and in this process we believe that during the coming week
there could be an elevated risk of switch reboot.
Even though unlikely, if you do experience unexpected MPI-failures
during the coming week, please let us know.
For those of you who are interested in the technical details.
The problem is related to a watchdog mechanism which is governing the
management processor in each infiniband leaf switch (there are 70 such
switches in Neolith). With a frequency of about one per four years per
switch, the watchdog (for some unknown reason) decides that the switch
is "stuck" and ought to be rebooted.
To pin point the cause, there has been ongoing work for the last couple
of weeks, and we are currently pursuing leads indicating that the
problem is network related.
Best regards,
The Neolith Team
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
Url : http://www.nsc.liu.se/pipermail/neolith-users/attachments/20081022/09d31003/signature.bin
More information about the neolith-users
mailing list