[Neolith-users] Yesterday's service stop

Mattias Slabanja slabanja at nsc.liu.se
Fri May 9 17:16:49 CEST 2008


Dear Neolith users.

The service stop went along according to plan, but took a bit longer 
than initially anticipated. We were back in production yesterday 
evening, when we opened up the cluster for queued jobs.


Summary of changes to the system

* The firmware of the central ethernet switch were upgraded, addressing 
a stability issue. All other switches in the cluster had their firmware 
upgraded to the latest bug fix version as well. The instability issue 
have at an earlier occasion affected the availability of the /nobackup 
file system.

* Switches were reconfigured to improve manageability.

* Compute node BMC firmware were upgraded to address a CMOS battery 
related issue.

* Compute node System ROM were upgraded due to a dependency of the new 
BMC firmware.

* Due to an unfortunate side effect of the system ROM upgrade which 
affects the memory bandwidth of the 32 GiB nodes, the job scheduler is 
for the time being configured to avoid intermixing of 32 and 16 GiB 
nodes for all job reservations. Hence, a job will now always run on 
either only 32 or only 16 GiB nodes.

Regards,
Neolith Admin


More information about the neolith-users mailing list