[Berzelius-users] Automatic termination of inefficient jobs

Henrik Henriksson hx at nsc.liu.se
Wed Sep 13 15:45:16 CEST 2023


Dear Berzelius users,

The system for automatic termination has been enabled and will terminate jobs.

> After 2023-09-13 jobs spending more than an hour below 60W average GPU
> utilization will be automatically canceled.
>
> As previously announced [1], we are introducing scheduling policy to
> automatically terminate jobs performing below certain thresholds w.r.t.
> efficiency. Currently, this is running in a "noop"-mode, where the user is sent
> an email informing them that the job performed below the fairly generous
> threshold and that the job would have been terminated. We plan to bring this
> fully live by 2023-09-13, after which jobs fulfilling certain criteria will be
> canceled automatically.
>
> The following criteria are used to determine if a job should be
> canceled:
>
>   - Average power utilization per GPU from start of job is below 60W. This is
>     slightly above our measured idle-level of 52W. In normal use, we expect jobs
>     properly utilizing the GPUs to pull 200W, with most AI/ML workloads at above
>     300W.
>
>   - The job is scheduled without any GPU.
>
> Exceptions from this are
>
>   - Jobs that have not yet run for one hour, to allow for *some*
>     preprocessing at the beginning of jobs,
>
>   - Interactive jobs, started with the NSC `interactive` tool,
>
>   - Jobs in the `devel`-reservation,
>
>   - Jobs running within reservations,
>
>   - Explicitly whitelisted projects (please contact us if you believe this
>     applies to you).
>
> These criteria are slightly simplified and will be made stricter over time. In
> particular, we expect to increase the power utilization criteria up from 60W at
> some point in the future.
>
> Users are informed about canceled jobs hourly, to avoid sending out too much
> spam for users getting jobarrays canceled. Currently affected users have already
> started getting automated emails from the system. Any projects we have
> identified as particularly affected have been contacted separately.
>
> As always, please reach out to us if you have any questions or comments.
>
> [1] https://lists.nsc.liu.se/mailman/public/berzelius-users/2023-March/000028.html


-- 
Henrik Henriksson
Systems Administrator
National Supercomputer Centre


More information about the Berzelius-users mailing list