[Triolith-users] Job priority problem 2014-10-25 - 2014-11-11

Thu Nov 20 13:18:20 CET 2014

Dear Triolith users,

Non-technical summary:
==================

Between October 25th and November 11th, a problem with the job
scheduler on Triolith resulted in some projects being able to run more
jobs/core hours than they should otherwise have been able to.

After some consideration, we have decided that finding and
compensating those projects that lost computing time due to this would
be too time-consuming. We will instead spend the limited time we have
available to try to prevent the same thing from happening again.

If you feel that your project was extra unlucky (i.e you had a very
difficult time getting jobs to start in this period), you may contact
us to discuss possible compensation. But please note that we can only
do this in exceptional cases due to the amount of manual work involved
(we have 200 active projects on Triolith...).

Technical details:
=============

(Unless you already know how job scheduling works on Triolith, I
recommend reading https://www.nsc.liu.se/support/batch-jobs/triolith/
before reading the rest of this email)

On October 25th, for unknown reasons the scheduler (SLURM) stopped
updating the internal counters used to determine job priority (you can
see these as e.g "Raw Usage" and "FairShare" in the output from
"sshare"). This resulted in the fair-share priority for all projects
becoming fixed at whatever value it had at the time. Projects that
happened to have a high priority on October 25th would continue to get
their queued jobs started easily, while projects that had a low
priority lost out. Low priority jobs still started, but only when
there were no high-priority jobs left in the queue.

The problem was discovered on November 11th, and after the scheduler
was restarted, priorities started updating normally.

Even though the original problem is no longer present, due to the
internal counters not being updated during those 2.5 weeks, projects
that benefited due to the problem were not instantly penalized for
this on November 11th, as one might expect. Adjusting priorities
manually to match what was actually run has been considered but deemed
to be too time-consuming (we're not lazy, but we have several
high-priority projects ongoing that needs our time).

We have gone through our scheduler logs for all of 2014 without
finding this problem at any other time, so we think this is the first
time this has happened.

We monitor many things in our systems, but we had no monitoring for
this particular failure.

Planned actions to minimize the risk of this happening again:
- Add monitoring of the failure type "fair-share priority not changing
over time"
- Update the scheduler to the latest version

NSC apologizes for the inconvenience this may have caused you.

-- 
Mats Kronberg, NSC Support <support at nsc.liu.se>