[Tornado-users] tornado: job failure
Per Lundqvist
perl at nsc.liu.se
Wed Jul 2 12:27:49 CEST 2008
Dear Tornado users, yesterday afternoon, at around 1500 CEST, the system
node on Tornado crashed. We got it up and running pretty quickly, but we
failed to notice that jobs had trouble starting to run after this incident
(jobs were hanging, producing no output).
This problem was caused by the license daemon on all the compute nodes
bailing out when they couldn't get in contact with the license server on
the system node.
It was remedied by restarting the license daemon on the compute nodes (at
approx 11:15 today). Unfortunately this seems to have caused hanging nodes
to abort with an error message like:
--- mpimon --- n1: Error when receiving message ---
--- mpimon --- Contact license at scali.com to request or check a license
--- ---
Jul 2 11:08:44: (mpimon at n1)(21763) Mutable error: subMonitor-1 exits
--- before allFinished is set
Please, check the status of your recent jobs, and resubmit if necessary.
/Per
--
Per Lundqvist
National Supercomputer Centre
Linköping University, Sweden
http://www.nsc.liu.se
More information about the tornado-users
mailing list