[Berzelius-users] Information on Scheduled Downtime 22-23 November

Wed Nov 8 10:48:06 CET 2023

Dear Berzelius users,

## Summary ##

Berzelius will be down for maintenance 22-23 November. During this
window, we will replace the cluster management software completely, in order to
run the same management solution as the rest of NSC's clusters.

While this is a *very* large operation, we will try to minimize the effects our
planned changes have on normal use. We think most users won't notice any
significant difference after the downtime has concluded. However, we strongly
encourage all users to test their code and applications soon after the downtime
and reach out to us if there are any issues.

Any jobs queued for execution after the downtime will be removed from the queue
and must be requeued after the downtime.

You should read the rest of this email if you:

   - Have deadlines soon after the maintenance window
   - Use the "3g.20gb" reservation
   - Use complex scheduling or automation workflows

## Schedule ##

The preliminary schedule [1] announced a few weeks ago will remain unchanged.

   - 2023-11-22 07:00: Any scheduled jobs must complete before this
     time, as nodes will become unavailable after this. During the days leading
     up to the downtime, you must specify a timelimit (using `-t`) that
     guarantees the job can complete before the downtime starts.

   - 2023-11-22 08:00: All user sessions on login nodes will be
     terminated, and the login nodes will become unavailable.

   - 2023-11-23: We expect Berzelius to return to normal operation late
     afternoon.

Dates were picked to respect the result of the survey we conducted earlier this
autumn [2].

## Background ##

We run our own in-house cluster management system on all NSC clusters
except Berzelius. This includes academic clusters, storage systems and
production weather forecasting. It is a mature and stable management layer
that has been in use and in development at NSC for several generations of
clusters.

However, NSC took delivery of Berzelius with a preconfigured cluster management
solution from our vendors. There were many reasons for doing so, first and
foremost an attempt to speed up deployment. Since then, we have augmented this
solution with plenty of additional features taken from NSC's in-house management
stack.

With the expansion of Berzelius done, we are now looking at the rest of the
operational lifetime of the cluster, and have decided to fully migrate the
cluster from the vendor provided solution to our in-house stack. We expect the
following benefits from the switch:

   - Increased velocity in applying security patches.

   - Increased mobility for NSC-staff - more people are able to work on the
     system, as we have the same software everywhere.

   - More coherent user environment - We currently try to emulate a user
     environment that is as similar as possible to our other clusters (for
     example Tetralith). With this change, we will run the same software
     everywhere.

## Notable Changes ##

We aim to ensure that all user software still work after the downtime, and all
modules and software provided by NSC will remain as-is. During the downtime, we
will:

   - Reinstall all login nodes and compute nodes with Rocky Linux 8, instead of
     RHEL8. As Rocky Linux aims to be ABI-compatible, bug-for-bug, compiled code
     should keep working without changes.

   - Replace the cluster management system completely. This will include changes
     in the user environment, but our aim is to ensure workflows and workloads
     can remain unchanged.

   - Clean up the scheduling system and configure it to be more in line with
     other NSC clusters. We will change the name of the default Slurm partition
     from the current "defq" to "berzelius". This change should only affect users
     with more advanced scheduling scripts.

   - Shrink and downgrade the current MIG-reservation "3g.20gb". After the
     downtime, we will most likely offer "1g.10gb", consisting of

     - 1/7th of the SMs on an A100,
     - 10GB VRAM,
     - 2 cores / 4 threads CPU,
     - 32GB RAM.

     If you have a workload where you believe the "3g.20gb" variant has
     significant performance benefits over "1g.10gb" or a full A100, please reach
     out to us.

## Risks and Mitigations ##

As any operation, the changes we aim to make carry some risk. Some risks are
outlined below, along with planned mitigations.

   - Breaking existing applications - we aim to keep the user environment as
     similar as possible, and are not planning any breaking changes w.r.t.
     software versions, library versions, ABIs, driver versions or CUDA versions.
     We will be running our standard test suites before returning the cluster to
     service.

   - Accidentally removed software, libraries or functionality - we aim to ensure
     that all user facing features remain. However, we may miss something. If you
     notice that anything you depend on has disappeared after the downtime,
     please reach out to us.

   - Delays in deployment - we have scheduled two full days of downtime, and
     expect to use all of it. To reduce the risk of additional unexpected
     downtime, we have been doing (successful) test deployments on identical
     hardware.

   - Failure to deploy - In the case we hit unforeseen issues we cannot overcome
     within a reasonable time-frame, we will be able to restore the system to the
     old management system on short notice, as we will use separate hardware for
     our in-house system.

   - Missing features - we may fail to implement features some users depend on
     during the downtime, either due to human error or lack of time. To mitigate
     this as much as we can, most of the installation is fully automated, while
     the rest is well documented. However, since there are significant
     differences between the two management systems, we may have overseen
     something. Most such issues should be easily correctable after the downtime
     window. Please reach out to us if you notice anything after the fact, or if
     you want to ensure we have considered your use case.

   - Data loss or corruption - this is something we continually mitigate across
     all our clusters. Relevant tooling have "dry-run" capabilities we actively
     use and check. We plan to keep network filesystems unmounted or read-only as
     far into the process as possible. We do not plan to make any changes at all
     on the shared Lustre filesystem during the downtime.

## Questions and comments ##

As always, if you have any questions or comments, please reach out to us at
berzelius-support at nsc.liu.se

[1] https://lists.nsc.liu.se/mailman/public/berzelius-users/2023-October/000050.html
[2] https://lists.nsc.liu.se/mailman/public/berzelius-users/2023-September/000045.html

Kind regards,
Berzelius staff