[Berzelius-users] Update on Berzelius cluster management system changes

Henrik Henriksson hx at nsc.liu.se
Tue Dec 19 11:41:34 CET 2023


Dear Berzelius user,

We'd like to give you an update on the changes performed on Berzelius during the
downtime window in the end of November. Most parts of the migration went very
smooth and a majority of users could resume without any change in workflow.
However, some users got affected by specific issues and some convenience
features have been unavailable until now.



# Summary

- `jobgraph` is now available again
- A MIG-reservation, `1g.10gb` is now available
- Automatic job termination will resume on 2024-01-08
- Consider Pyxis and Enroot deprecated, to be removed in mid January
- Limited staffing, and thus user support, over the holidays



# Jobgraph

The `jobgraph`-tool is now available again. Getting this tool working
unfortunately had to wait, as issues directly affecting some users had to be
prioritized.



# MIG Reservation

A MIG-reservation, `1g.10gb` is now available. Note that each instance is
smaller than the previous `3g.20gb`-reservation, but that we have more of them.
Each `1g.10gb`-instance is equipped with

- 1/7th of the A100s' compute capabilities,
- 10GB VRAM,
- 2 cores / 4 threads,
- 32GB RAM.



# Automatic job termination

Due to `jobgraph` and MIG being unavailable, we decided to leave the automatic
job termination turned off. However, now that these are available, we will
resume the use of automatic job termination for jobs not utilizing the
resources. We decided to have a grace period, where jobs are not terminated, you
just get an email. The grace period ends on 2024-01-08.

During January, we plan to increase the current 60W limit up to at least
70W.



# Deprecation of Pyxis and Enroot

Enroot is a container runtime developed by Nvidia. Pyxis is an accompanying
plugin, used to launch Enroot jobs with the Slurm scheduler. A very small number
of users were affected by issues migrating these to our new system, but these
have been resolved.

However, due to what we consider to be severe design flaws in Pyxis, we have
decided to deprecate both of these tools and remove them from the cluster on the
2024-01-15.

Users should migrate to Apptainer. Please reach out to us if you require
assistance with this.



# Holiday staffing

During the holidays the staffing and available support at NSC will be limited.
The cluster will remain in normal operation, but some support tickets will be
postponed until after the holidays.

As always, please reach out to us via berzelius-support at nsc.liu.se for any
questions or comments.


Happy Holidays!
--
Berzelius Staff


More information about the Berzelius-users mailing list