[Berzelius-users] Cluster expansion is complete - nodes are available

Thu Jun 22 15:58:26 CEST 2023

Dear Berzelius users,

The expansion of the Berzelius cluster has now been finalized and the nodes are
available for jobs as of today.

--- New Hardware ---

The cluster now consists of 94 nodes in total with our 34 new 80GB-nodes. The
hardware is very similar in all nodes, but the new 'fat' nodes have "more" of it
than the old 'thin' nodes. The fat nodes have twice the amount of VRAM, system
RAM, storage bandwidth and local scratch, but are identical otherwise.

|----------------------+--------------------------+--------------------------|
|                      | DGX A100 40GB 'thin'     | DGX A100 80GB 'fat'      |
|----------------------+--------------------------+--------------------------|
| Number of nodes      | 60                       | 34                       |
| Node names           | node[001-060]            | node[061-094]            |
| GPUs                 | 8x A100 40 GB            | 8x A100 80 GB            |
| Compute Interconnect | 8x 200Gbit/s HDR IB      | 8x 200Gbit/s HDR IB      |
| Storage Interconnect | 1x 200Gbit/s HDR IB      | 2x 200Gbit/s HDR IB      |
| RAM                  | 1TB                      | 2TB                      |
| CPU                  | 2x AMD EPYC 7742 64-Core | 2x AMD EPYC 7742 64-Core |
|----------------------+--------------------------+--------------------------|

Additionally, the storage has been expanded from 1PB to 1.5PB. We estimate the
aggregated sustained read performance of the storage to be over 300 GB/s after
the expansion, up from the previous 192 GB/s.

--- New Scheduling Policy ---

With the new nodes, we also have a new scheduling policy, affecting scheduling
for *all* users. As the new 'fat' nodes are almost twice as expensive than the
older 'thin' nodes, 'fat' nodes are billed at 2x the cost of 'thin' nodes. That
is, one GPU-hour on a 'fat' node equals two GPU-hours on a 'thin' node.

However, to ensure efficient use of resources and keep queues short, both
generations are combined into a *single queue*. Slurm will prioritize the thin
nodes, but will move any "overflow" from thin nodes to fat nodes. This is the
*default behaviour*, but users are still able to control where jobs end up:

  - If the 'thin' feature is specified via `-C thin`, Slurm will ensure the job
    is scheduled on a 'thin' node, billed at 1x. Use this if 40GB VRAM is enough,
    and you want to ensure the job is billed at 1x.

  - If the 'fat' feature is specified via `-C fat`, Slurm will ensure the job is
    scheduled on a 'fat' node, billed at 2x. Use this when 80GB VRAM is desired.

  - If no feature flag is specified, neither 'thin' nor 'fat', Slurm will
    prioritize scheduling the job on 'thin' nodes, but may schedule the job on
    'fat' nodes instead if the 'thin' nodes are filling up. The job will be
    billed at either 1x or 2x, depending on where it ends up. This is the default
    behaviour, as we expect it will increase utilization and reduce time spent in
    the queue. It may be beneficial for some use cases to detect how much VRAM is
    available at runtime and adjust relevant parameters to use all available
    VRAM when scheduling jobs this way.

Please reach out to us if you find any bugs in the implementation of this
policy, or if it affects your existing workflow negatively.

--- Issues During the Expansion ---

In general, the expansion went according to plan without any larger issues. The
largest user-facing issue was the brief unplanned filesystem outage [0]. Further
looking into the root cause for the outage, we found some servers with less
power redundancy than we expected. This specific issue has been corrected.

[0] https://lists.nsc.liu.se/mailman/public/berzelius-users/2023-April/000034.html

Kind regards,
The Berzelius team at NSC