[Berzelius-users] Automatic termination of inefficient jobs

Henrik Henriksson hx at nsc.liu.se
Mon Sep 4 11:39:56 CEST 2023


Dear Berzelius users,

After 2023-09-13 jobs spending more than an hour below 60W average GPU
utilization will be automatically canceled.

As previously announced [1], we are introducing scheduling policy to
automatically terminate jobs performing below certain thresholds w.r.t.
efficiency. Currently, this is running in a "noop"-mode, where the user is sent
an email informing them that the job performed below the fairly generous
threshold and that the job would have been terminated. We plan to bring this
fully live by 2023-09-13, after which jobs fulfilling certain criteria will be
canceled automatically.

The following criteria are used to determine if a job should be
canceled:

   - Average power utilization per GPU from start of job is below 60W. This is
     slightly above our measured idle-level of 52W. In normal use, we expect jobs
     properly utilizing the GPUs to pull 200W, with most AI/ML workloads at above
     300W.

   - The job is scheduled without any GPU.

Exceptions from this are

   - Jobs that have not yet run for one hour, to allow for *some*
     preprocessing at the beginning of jobs,

   - Interactive jobs, started with the NSC `interactive` tool,

   - Jobs in the `devel`-reservation,

   - Jobs running within reservations,

   - Explicitly whitelisted projects (please contact us if you believe this
     applies to you).

These criteria are slightly simplified and will be made stricter over time. In
particular, we expect to increase the power utilization criteria up from 60W at
some point in the future.

Users are informed about canceled jobs hourly, to avoid sending out too much
spam for users getting jobarrays canceled. Currently affected users have already
started getting automated emails from the system. Any projects we have
identified as particularly affected have been contacted separately.

As always, please reach out to us if you have any questions or comments.

[1] https://lists.nsc.liu.se/mailman/public/berzelius-users/2023-March/000028.html

-- 
Henrik Henriksson
Systems Administrator
National Supercomputer Centre


More information about the Berzelius-users mailing list