[Tornado-users] Tornado is now available again

Johan Raber raber at nsc.liu.se
Mon May 4 19:25:19 CEST 2009


Dear Tornado users,

The unfortunate outage of Tornado's services is now over, feel free to use
them again and please report any problems as soon as you see them.

The loss of services was due to the system node crashing, which in turn was
most likely caused by a change in the routines (prompted by security
concerns) for how vital configuration files are synced within the cluster.
A probable bug in a component used accumulated forked off processes,
eventually consuming all available memory. A work-around has been
implemented to avoid this problem.

Unfortunately some submitted jobs became casualties of this incident. As
reported by qstat -n, they were;

105388.torn          sm_aulle rossbyq run_ecocliflake_  32548    12  --    --  144:0 R   -- 
   n104/1+n104/0+n103/1+n103/0+n102/1+n102/0+n101/1+n101/0+n100/1+n100/0+n99/1
   +n99/0+n98/1+n98/0+n97/1+n97/0+n96/1+n96/0+n95/1+n95/0+n94/1+n94/0+n93/1
   +n93/0
105483.torn          sm_stran rossbyq  runA2_1_run0    577    15  --    --
144:0 R 00:00
   n48/1+n48/0+n47/1+n47/0+n34/1+n34/0+n33/1+n33/0+n32/1+n32/0+n31/1+n31/0
   +n30/1+n30/0+n29/1+n29/0+n28/1+n28/0+n27/1+n27/0+n26/1+n26/0+n25/1+n25/0
   +n12/1+n12/0+n11/1+n11/0+n10/1+n10/0
105529.torn          sm_jenbr rossbyq  b30.004.1n  10950    16  --    --
75:00 R   -- 
   n107/1+n107/0+n106/1+n106/0+n87/1+n87/0+n86/1+n86/0+n82/1+n82/0+n68/1+n68/0
   +n67/1+n67/0+n66/1+n66/0+n65/1+n65/0+n64/1+n64/0+n46/1+n46/0+n45/1+n45/0
   +n44/1+n44/0+n43/1+n43/0+n42/1+n42/0+n41/1+n41/0
105542.torn          x_sema   misuq    run_MC       4665     4  --    --
100:0 R 00:00
   n20/1+n20/0+n19/1+n19/0+n18/1+n18/0+n17/1+n17/0
105598.torn          x_maxbe  misuq    LGM_ORCA05_LIM2   2156     8  --
--  70:00 R   -- 
   n119/1+n119/0+n118/1+n118/0+n117/1+n117/0+n116/1+n116/0+n115/1+n115/0+n114/1
   +n114/0+n113/1+n113/0+n112/1+n112/0
105600.torn          x_maxbe  misuq    LGM_ORCA05_LIM2   4290     8  --
--  40:00 R 00:00
   n111/1+n111/0+n110/1+n110/0+n109/1+n109/0+n108/1+n108/0+n105/1+n105/0+n92/1
   +n92/0+n91/1+n91/0+n90/1+n90/0

We regret any inconveniance or problem this has caused you and thank you for
swiftly notifying us of the problems as they occurred.

Regards,
NSC support

-- 
Johan Raber, PhD
Systems expert -- distributed computing
National Supercomputer Center, Linköping University
SE-581 83 LINKÖPING, SWEDEN



More information about the tornado-users mailing list