[Tornado-users] tornado: job failure

Per Lundqvist perl at nsc.liu.se
Wed Jul 2 12:27:49 CEST 2008


Dear Tornado users, yesterday afternoon, at around 1500 CEST, the system 
node on Tornado crashed. We got it up and running pretty quickly, but we 
failed to notice that jobs had trouble starting to run after this incident 
(jobs were hanging, producing no output).

This problem was caused by the license daemon on all the compute nodes 
bailing out when they couldn't get in contact with the license server on 
the system node.

It was remedied by restarting the license daemon on the compute nodes (at 
approx 11:15 today). Unfortunately this seems to have caused hanging nodes 
to abort with an error message like:

   --- mpimon --- n1: Error when receiving message  ---
   --- mpimon --- Contact license at scali.com to request or check a license
   --- ---
   Jul  2 11:08:44: (mpimon at n1)(21763) Mutable error: subMonitor-1 exits
   --- before allFinished is set

Please, check the status of your recent jobs, and resubmit if necessary. 

/Per

-- 
Per Lundqvist

National Supercomputer Centre
Linköping University, Sweden

http://www.nsc.liu.se


More information about the tornado-users mailing list