[Berzelius-users] Information on Scheduled Downtime 22-23 November
Henrik Henriksson
hx at nsc.liu.se
Wed Nov 8 10:48:06 CET 2023
Dear Berzelius users,
## Summary ##
Berzelius will be down for maintenance 22-23 November. During this
window, we will replace the cluster management software completely, in order to
run the same management solution as the rest of NSC's clusters.
While this is a *very* large operation, we will try to minimize the effects our
planned changes have on normal use. We think most users won't notice any
significant difference after the downtime has concluded. However, we strongly
encourage all users to test their code and applications soon after the downtime
and reach out to us if there are any issues.
Any jobs queued for execution after the downtime will be removed from the queue
and must be requeued after the downtime.
You should read the rest of this email if you:
- Have deadlines soon after the maintenance window
- Use the "3g.20gb" reservation
- Use complex scheduling or automation workflows
## Schedule ##
The preliminary schedule [1] announced a few weeks ago will remain unchanged.
- 2023-11-22 07:00: Any scheduled jobs must complete before this
time, as nodes will become unavailable after this. During the days leading
up to the downtime, you must specify a timelimit (using `-t`) that
guarantees the job can complete before the downtime starts.
- 2023-11-22 08:00: All user sessions on login nodes will be
terminated, and the login nodes will become unavailable.
- 2023-11-23: We expect Berzelius to return to normal operation late
afternoon.
Dates were picked to respect the result of the survey we conducted earlier this
autumn [2].
## Background ##
We run our own in-house cluster management system on all NSC clusters
except Berzelius. This includes academic clusters, storage systems and
production weather forecasting. It is a mature and stable management layer
that has been in use and in development at NSC for several generations of
clusters.
However, NSC took delivery of Berzelius with a preconfigured cluster management
solution from our vendors. There were many reasons for doing so, first and
foremost an attempt to speed up deployment. Since then, we have augmented this
solution with plenty of additional features taken from NSC's in-house management
stack.
With the expansion of Berzelius done, we are now looking at the rest of the
operational lifetime of the cluster, and have decided to fully migrate the
cluster from the vendor provided solution to our in-house stack. We expect the
following benefits from the switch:
- Increased velocity in applying security patches.
- Increased mobility for NSC-staff - more people are able to work on the
system, as we have the same software everywhere.
- More coherent user environment - We currently try to emulate a user
environment that is as similar as possible to our other clusters (for
example Tetralith). With this change, we will run the same software
everywhere.
## Notable Changes ##
We aim to ensure that all user software still work after the downtime, and all
modules and software provided by NSC will remain as-is. During the downtime, we
will:
- Reinstall all login nodes and compute nodes with Rocky Linux 8, instead of
RHEL8. As Rocky Linux aims to be ABI-compatible, bug-for-bug, compiled code
should keep working without changes.
- Replace the cluster management system completely. This will include changes
in the user environment, but our aim is to ensure workflows and workloads
can remain unchanged.
- Clean up the scheduling system and configure it to be more in line with
other NSC clusters. We will change the name of the default Slurm partition
from the current "defq" to "berzelius". This change should only affect users
with more advanced scheduling scripts.
- Shrink and downgrade the current MIG-reservation "3g.20gb". After the
downtime, we will most likely offer "1g.10gb", consisting of
- 1/7th of the SMs on an A100,
- 10GB VRAM,
- 2 cores / 4 threads CPU,
- 32GB RAM.
If you have a workload where you believe the "3g.20gb" variant has
significant performance benefits over "1g.10gb" or a full A100, please reach
out to us.
## Risks and Mitigations ##
As any operation, the changes we aim to make carry some risk. Some risks are
outlined below, along with planned mitigations.
- Breaking existing applications - we aim to keep the user environment as
similar as possible, and are not planning any breaking changes w.r.t.
software versions, library versions, ABIs, driver versions or CUDA versions.
We will be running our standard test suites before returning the cluster to
service.
- Accidentally removed software, libraries or functionality - we aim to ensure
that all user facing features remain. However, we may miss something. If you
notice that anything you depend on has disappeared after the downtime,
please reach out to us.
- Delays in deployment - we have scheduled two full days of downtime, and
expect to use all of it. To reduce the risk of additional unexpected
downtime, we have been doing (successful) test deployments on identical
hardware.
- Failure to deploy - In the case we hit unforeseen issues we cannot overcome
within a reasonable time-frame, we will be able to restore the system to the
old management system on short notice, as we will use separate hardware for
our in-house system.
- Missing features - we may fail to implement features some users depend on
during the downtime, either due to human error or lack of time. To mitigate
this as much as we can, most of the installation is fully automated, while
the rest is well documented. However, since there are significant
differences between the two management systems, we may have overseen
something. Most such issues should be easily correctable after the downtime
window. Please reach out to us if you notice anything after the fact, or if
you want to ensure we have considered your use case.
- Data loss or corruption - this is something we continually mitigate across
all our clusters. Relevant tooling have "dry-run" capabilities we actively
use and check. We plan to keep network filesystems unmounted or read-only as
far into the process as possible. We do not plan to make any changes at all
on the shared Lustre filesystem during the downtime.
## Questions and comments ##
As always, if you have any questions or comments, please reach out to us at
berzelius-support at nsc.liu.se
[1] https://lists.nsc.liu.se/mailman/public/berzelius-users/2023-October/000050.html
[2] https://lists.nsc.liu.se/mailman/public/berzelius-users/2023-September/000045.html
Kind regards,
Berzelius staff
More information about the Berzelius-users
mailing list