[Vagnekman-users] Re: Information about Ekman service windows during the next two to three weeks

lama at pdc.kth.se lama at pdc.kth.se
Mon Aug 31 19:16:18 CEST 2009


Hello,

as of 2009-08-31:

We did just re-enable 853 nodes to the 253 already on-line,
giving a total of 1106 available, all with new CPUs.

Please report if you experience strange behaviour.

regards,
lars/pdc-staff.
- - - 
lama at pdc.kth.se writes:

> Hello,
>
> a late update as of 2009-08-29:
>
> all major cpu replacements were complete yesterday, Friday.
>
> Roughly ~dozen of the nodes were not in a shape to accept new CPUs.
>
> Out of 880 nodes previously available for jobs (withouth obvious
> defects) 855 are being tested. The missing 35 have been set aside
> as they did not behave satisfactory after replacement.
>
> We have since late Friday evening until ~15:00 today been running
> tests on the 855 nodes. Tests that supposedly generate heat
> and consume power. No nodes failed on that. 
>
> After that we spawned jobs that were more memory intensive.
>
> We have in the past seen quite a few large/wide jobs die due
> to uncorrectable memory errors, and want to reduce that ratio.
>
> We saw the first node out of the 855 go bad due to an
> uncorrectable memory error after less than 3 hours.
>
> We will extend the tests into at least tomorrow, Sunday,
> to have more errata than a single failure to try to draw
> conclusions from.
>
> The 254 nodes online prior the bulk-exchange seem all still
> to be online. If not, please let us know.
>
> regards,
> lars/pdc-staff.
> - - - 
> lama at pdc.kth.se writes:
>
>> Hello,
>>
>> an update as of 2009-08-24:
>>
>> We have finished the first day of the 'bulk-cpu-replacement.'
>>
>> Roughly a quarter of the cpus were replaced today, and we
>> did just allow jobs to start again.
>>
>> We will not let any of the nodes with cpus replaced today
>> go back to production. Those nodes have not been tested yet.
>>
>> There are 254 nodes on-line. These got new cpus as reported
>> before, and you have been running on them for quite some time.
>>
>> As at least 3/4 of the work remain, and there always is
>> the risk of doing a mistake, jobs could get damaged.
>>
>> In case you experience an unexpected job loss, please let us know.
>>
>> Also, as the machine is much smaller right now but your jobs
>> are not - you will experience a different job-flow. Small (thin)
>> jobs will block large (wide) jobs, and vice versa, more often.
>>
>> regards,
>> lars/pdc-staff.
>> - - - 
>>> an update as of 2009-08-21:
>>>
>>> We have gotten enough, if not all, of the replacement CPUs.
>>> We will go ahead with replacement, Monday morning (2009-08-24.)
>>>
>>> regards,
>>> lars/pdc-staff.
>>> - - - 
>>>> an update as of 2009-08-20:
>>>>
>>>> The trucks with CPUs will arrive tomorrow Friday, 2009-08-21.
>>>>
>>>> Replacement work can start by Monday morning, 2009-08-24.
>>>>
>>>> New scheduling block set to Monday morning (09:00.)
>>>>
>>>> $
>>>> $ spstatus           # or spq
>>>> [..]
>>>> ---- System Actualities ----
>>>>
>>>> Note: All reserved between 2009-08-24/09:00:00 and 2009-08-25/09:00:00 (24h)
>>>> $
>>>> $
>>>>
>>>> When we consider the 'CPU replacement assembly lines' to work smooth
>>>> and safe enough, we will re-enable parts of the system where CPUs
>>>> already have been replaced. Please consider the 24hour duration
>>>> indicated above an early approximate.
>>>>
>>>> regards,
>>>> lars/pdc-staff.
>>>> - - - 
>>>>   [..]
>>>>
>>>>> there are unfortunatelly further delays in the delivery of CPUs.
>>>>>
>>>>> The start of the upgrade is postponed until Thursday morning.
>>>>> This is reflected in the output of, i.e. spstatus or spq
>>>>>
>>>>> ekman$
>>>>> ekman$ spstatus
>>>>> [..]
>>>>> ---- System Actualities ----
>>>>>
>>>>> Note: All reserved between 2009-08-17/09:00:00 and 2009-08-17/12:00:00 (3h)
>>>>> Note: All reserved between 2009-08-20/09:00:00 and 2009-08-21/09:00:00 (24h)
>>>>> ekman$
>>>>>
>>>>> No jobs are allowed to execute over the above reservations. The
>>>>> above window(s) are not static. Once the upgrade work is flowing,
>>>>> we will try to make a sub-set of the system available for jobs.
>>>>>
>>>>> We are sorry for the inconvenience.
>>>>>
>>>>> regards,
>>>>> lars/pdc-staff.
>>>>>
>>>>>   [..]
>>>>>
>>>>>>> Due to delays in delivery of the CPUs the update schedule will be
>>>>>>> revised as follows:
>>>>>>> All replacements will be carried out between 17/8 and 21/8. We will
>>>>>>> make an effort to keep as many nodes as we can available during the
>>>>>>> upgrade but there may be as few as about 230 nodes available during
>>>>>>> that period (these 230 nodes were upgraded the last week).
>>>>>>> Also - Please note that from now until all nodes has been updated
>>>>>>> Ekman will contain nodes with two different CPU's. We expect the
>>>>>>> performance difference to be very small but you should still be aware
>>>>>>> that there may be some. If you want to know the cpu version on the
>>>>>>> node you are running on you can always do:
>>>>>>> grep "model name" /proc/cpuinfo | sort --unique



More information about the Vagnekman-users mailing list