[Bi-users] Risk jobs

Kent Engström kent at nsc.liu.se
Wed Feb 1 09:57:34 CET 2017


Dear Bi Users,

a short update on risk jobs: During the downtime last Thursday, we
installed a version of the Slurm workload manager with a patch done at
NSC to fix a bug that affected risk jobs.

The bug made Slurm think that some non-risk job was allowed to start,
which made it kill one or more risk jobs to make nodes available, only
to find out when doing more stringent checks that the non-risk job was
blocked by a limit and should not run. The fix makes sure that relevant
checks are made early enough to stop that wasteful behaviour.

We hope that this will make risk jobs on Bi work a bit better from now
on.

Thanks to Pär at NSC for the detailed troubleshooting and patch, and to
Camilla and others at SMHI for problem reports about this behaviour.

mvh,
-- 
Kent Engström, National Supercomputer Centre
kent at nsc.liu.se, +46 13 28 4444



More information about the Bi-users mailing list