[Berzelius-users] Berzelius downtime calendar week 4, 20th to 24th of January
Filip Polbratt
octol at nsc.liu.se
Mon Jan 27 17:03:32 CET 2025
Dear Berzelius Users,
Berzelius is back to running jobs.
Postmortem on the delay in returning the cluster to operational status:
Two of the new switches in the storage fabric had a later build of their
firmware (same Major.Minor release, but a later build). The infiniband
links were up and at the expected HDR speed, but actual throughput of
the traffic was not as expected. Aggregate throughput to the storage
servers were roughly 60 to 140 kB/s.
All switches in the storage fabric have been flashed to the newer build
of the firmware and are now working as expected. Additionally, the
routing engine on the fabric has been changed to be more robust.
Best regards,
//Berzelius staff
On 1/24/25 18:02, Filip Polbratt wrote:
> Dear Berzelius Users,
>
> The connections between the storage servers and the client nodes (DGXs
> and login nodes) has been unreliable when stressed.
>
> Berzelius will be offline for maintenance until this is fixed.
>
> You can view system status for systems at NSC on this page:
> https://www.nsc.liu.se/systemstatus/
>
> Best regards,
> Berzelius staff
>
> On 1/16/25 16:58, Filip Polbratt wrote:
>> Dear Berzelius Users,
>>
>> This is a reminder that Berzelius will be down next week. Make sure
>> that any data that you need during that week has been copied to
>> somewhere else.
>>
>> You can view system status for systems at NSC on this page:
>> https://www.nsc.liu.se/systemstatus/
>>
>> Best regards,
>> Berzelius staff
>>
>> On 1/3/25 15:49, Filip Polbratt wrote:
>>> Dear Berzelius Users,
>>>
>>> on Monday the 20th of January we will start a maintenance window for
>>> several tasks that require significant downtime of the cluster. The
>>> largest single task is the complete rewiring of the storage network
>>> in Berzelius to receive an additional 3PB of storage. We are
>>> scheduling this maintenance window to start on the 20th at 09:00 and
>>> to last to Friday the 24th, 18:00.
>>>
>>> Jobs will not be running during this maintenance window and the login
>>> nodes will not be available. If you need any data that you have on
>>> Berzelius during this period you must access it before the
>>> maintenance window starts.
>>>
>>> We might be able to return the login nodes and some compute nodes to
>>> service earlier. However, this should not be expected and depended upon.
>>>
>>> Best regards,
>>> Berzelius staff
>>> _______________________________________________
>>> Berzelius-users mailing list
>>> Berzelius-users at lists.nsc.liu.se
>>> https://lists.nsc.liu.se/mailman/listinfo/berzelius-users
>>
>> _______________________________________________
>> Berzelius-users mailing list
>> Berzelius-users at lists.nsc.liu.se
>> https://lists.nsc.liu.se/mailman/listinfo/berzelius-users
>
> _______________________________________________
> Berzelius-users mailing list
> Berzelius-users at lists.nsc.liu.se
> https://lists.nsc.liu.se/mailman/listinfo/berzelius-users
More information about the Berzelius-users
mailing list