[Berzelius-users] Increased focus on efficient use of compute resources

Mon Mar 6 11:14:20 CET 2023

Dear Berzelius users,

In the coming months we will look more and more at efficient use of hardware
resources, both with regards to allocations and scheduling policy. Our current
plan is to gradually raise the expectations we have on jobs and projects, with
regards to efficiency. To keep things simple, we consider power usage per GPU to
be the most important factor. Higher power usage is usually very strongly
correlated to efficient use of the available hardware.

The upcoming changes will be rolled out gradually, over the course of several
months. Details will be determined gradually as well, as we gather more and more
data. So far we have already implemented the following changes:

- Detailed monitoring of jobs and projects.

- Manually notifying users starting larger jobs that are spending a significant
   amount of time below certain thresholds. This is done only to inform
   the user, no further actions are taken.

- More detailed information on how projects are using the resource is collected
   and aggregated. This data will be taken into consideration when evaluating
   continuation proposals.

During the coming months we expect to implement the following changes:

- We will provide better tools for users to determine how their jobs are doing.

- Sending out automated alerts to users when jobs are performing below certain
   thresholds.

- We plan on changing the scheduling policy to automatically terminate jobs
   performing below certain thresholds. This will start out with very liberal
   thresholds (is the job using the GPU at all?). We plan to allow for some
   "warmup time" at the start of jobs, as well as allowing interactive jobs.
   Before this is rolled out, we expect to run it in "noop"-mode for a fairly
   long period, just sending out warnings without actually terminating any jobs.

There are several tools already available to users on Berzelius for evaluating
job efficiency. Power usage is usually the most important metric to look at. The
most basic tool is simply using `nvidia-smi` while logged in to a compute node
to check how a job is doing.

Another, slightly easier, tool is `jobload`. `jobload -j $JOBID` will provide
some basic momentary data on GPU utilization.

As part of our plan to provide better tools for users to monitor jobs, we are
now providing the tool `jobgraph` that will yield a detailed dashboard of
relevant metrics across the lifetime of the job. As this tool is newly developed
we are open to suggestions, feature requests and bug reports. Simply run
`jobgraph -j $JOBID` to try it out.

Please reach out to us if you have any questions, suggestions or comments!