[Vagnekman-users] Re: Information about Ekman service windows during the next two to three weeks

lama at pdc.kth.se lama at pdc.kth.se
Sun Aug 30 00:48:28 CEST 2009


Hello,

a late update as of 2009-08-29:

all major cpu replacements were complete yesterday, Friday.

Roughly ~dozen of the nodes were not in a shape to accept new CPUs.

Out of 880 nodes previously available for jobs (withouth obvious
defects) 855 are being tested. The missing 35 have been set aside
as they did not behave satisfactory after replacement.

We have since late Friday evening until ~15:00 today been running
tests on the 855 nodes. Tests that supposedly generate heat
and consume power. No nodes failed on that. 

After that we spawned jobs that were more memory intensive.

We have in the past seen quite a few large/wide jobs die due
to uncorrectable memory errors, and want to reduce that ratio.

We saw the first node out of the 855 go bad due to an
uncorrectable memory error after less than 3 hours.

We will extend the tests into at least tomorrow, Sunday,
to have more errata than a single failure to try to draw
conclusions from.

The 254 nodes online prior the bulk-exchange seem all still
to be online. If not, please let us know.

regards,
lars/pdc-staff.
- - - 
lama at pdc.kth.se writes:

> Hello,
>
> an update as of 2009-08-24:
>
> We have finished the first day of the 'bulk-cpu-replacement.'
>
> Roughly a quarter of the cpus were replaced today, and we
> did just allow jobs to start again.
>
> We will not let any of the nodes with cpus replaced today
> go back to production. Those nodes have not been tested yet.
>
> There are 254 nodes on-line. These got new cpus as reported
> before, and you have been running on them for quite some time.
>
> As at least 3/4 of the work remain, and there always is
> the risk of doing a mistake, jobs could get damaged.
>
> In case you experience an unexpected job loss, please let us know.
>
> Also, as the machine is much smaller right now but your jobs
> are not - you will experience a different job-flow. Small (thin)
> jobs will block large (wide) jobs, and vice versa, more often.
>
> regards,
> lars/pdc-staff.
> - - - 
>> an update as of 2009-08-21:
>>
>> We have gotten enough, if not all, of the replacement CPUs.
>> We will go ahead with replacement, Monday morning (2009-08-24.)
>>
>> regards,
>> lars/pdc-staff.
>> - - - 
>>> an update as of 2009-08-20:
>>>
>>> The trucks with CPUs will arrive tomorrow Friday, 2009-08-21.
>>>
>>> Replacement work can start by Monday morning, 2009-08-24.
>>>
>>> New scheduling block set to Monday morning (09:00.)
>>>
>>> $
>>> $ spstatus           # or spq
>>> [..]
>>> ---- System Actualities ----
>>>
>>> Note: All reserved between 2009-08-24/09:00:00 and 2009-08-25/09:00:00 (24h)
>>> $
>>> $
>>>
>>> When we consider the 'CPU replacement assembly lines' to work smooth
>>> and safe enough, we will re-enable parts of the system where CPUs
>>> already have been replaced. Please consider the 24hour duration
>>> indicated above an early approximate.
>>>
>>> regards,
>>> lars/pdc-staff.
>>> - - - 
>>>   [..]
>>>
>>>> there are unfortunatelly further delays in the delivery of CPUs.
>>>>
>>>> The start of the upgrade is postponed until Thursday morning.
>>>> This is reflected in the output of, i.e. spstatus or spq
>>>>
>>>> ekman$
>>>> ekman$ spstatus
>>>> [..]
>>>> ---- System Actualities ----
>>>>
>>>> Note: All reserved between 2009-08-17/09:00:00 and 2009-08-17/12:00:00 (3h)
>>>> Note: All reserved between 2009-08-20/09:00:00 and 2009-08-21/09:00:00 (24h)
>>>> ekman$
>>>>
>>>> No jobs are allowed to execute over the above reservations. The
>>>> above window(s) are not static. Once the upgrade work is flowing,
>>>> we will try to make a sub-set of the system available for jobs.
>>>>
>>>> We are sorry for the inconvenience.
>>>>
>>>> regards,
>>>> lars/pdc-staff.
>>>>
>>>>   [..]
>>>>
>>>>>> Due to delays in delivery of the CPUs the update schedule will be
>>>>>> revised as follows:
>>>>>> All replacements will be carried out between 17/8 and 21/8. We will
>>>>>> make an effort to keep as many nodes as we can available during the
>>>>>> upgrade but there may be as few as about 230 nodes available during
>>>>>> that period (these 230 nodes were upgraded the last week).
>>>>>> Also - Please note that from now until all nodes has been updated
>>>>>> Ekman will contain nodes with two different CPU's. We expect the
>>>>>> performance difference to be very small but you should still be aware
>>>>>> that there may be some. If you want to know the cpu version on the
>>>>>> node you are running on you can always do:
>>>>>> grep "model name" /proc/cpuinfo | sort --unique
> _______________________________________________
> Vagnekman-users mailing list
> Vagnekman-users at lists.nsc.liu.se
> http://www.nsc.liu.se/mailman/listinfo/vagnekman-users


More information about the Vagnekman-users mailing list