[Bi-users] FYI - Done - Emergency service on Accumulus filesystems

Fredrik Nyström freny at nsc.liu.se
Mon Oct 25 16:41:54 CEST 2021


Hej Klaus,

två av dina jobb drabbades:

Tetralith rossby26: jobid=16767358 username=sm_wyser
2021-10-22T02:12:13.463748+02:00 mds16 kernel: LustreError: 4055:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 10.26.15.43 at tcp  ns: mdt-rossby26-MDT0000_UUID lock: ffff93d62a260900/0x305bcb132a5aaa04 lrc: 3/0,0 mode: PR/PR res: [0x2000075e9:0x330:0x0].0x0 bits 0x13/0x0 rrc: 7 type: IBT flags: 0x60200400000020 nid: 10.26.15.43 at tcp remote: 0x3bc76ec2cc35a4a8 expref: 818 pid: 28484 timeout: 45495 lvb_type: 0
2021-10-22 02:12:17 jobstate=FAILED jobid=16767358 username=sm_wyser account=snic2021-1-15 start=1634850482 end=1634861537 submit=1634850467 nodes=n[256,314,340,345,456,458,524,547,569,628,662,720,747,838-839,917,921-922,928,977-978,981,991,998,1041,1500,1502,1512,1540,1543] procs=960 batch=yes jobname=x623 partition=tetralith limit=1-20:00:00 work_dir=/accumulus/rossby21/rossby/joint_exp/crescendo/sm_wyser/ecearth3-r6679/runtime/c

Tetralith rossby20: jobid=16830911 username=sm_wyser
2021-10-24T10:53:56.556898+02:00 mds9 kernel: LustreError: 6293:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 10.26.2.33 at tcp  ns: mdt-rossby20-MDT0000_UUID lock: ffff9b20c51446c0/0xd50133f8e8b0d286 lrc: 3/0,0 mode: PR/PR res: [0x2000514e8:0xd:0x0].0x0 bits 0x13/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.26.2.33 at tcp remote: 0x8848df6767ef5402 expref: 22 pid: 9607 timeout: 246983 lvb_type: 0
2021-10-24 10:54:00 jobstate=FAILED jobid=16830911 username=sm_wyser account=snic2021-1-15 start=1635065462 end=1635065636 submit=1635065451 nodes=n233 procs=16 batch=yes jobname=hc_x625_4 partition=tetralith limit=01:30:00 work_dir=/home/sm_wyser/ece3-postproc


Mvh / Fredrik Nyström, NSC


On 2021-10-25 15:29, Fredrik Nyström wrote:
> Dear Accumulus storage Users,
> 
> mds[8-11,15] has now also been downgraded from Lustre 2.12.7 to 2.12.6.
> 
> Downtime was between 15:03:56 and 15:12:28 CEST.
> 
> 
> Kind Regards / Fredrik Nyström, NSC
> 
> On 2021-10-25 13:29, Fredrik Nyström wrote:
>> Dear Accumulus storage Users,
>>
>> the software updates we applied on Thursday last week we have seen 
>> cases of clients being wrongfully evicted by Lustre medatada servers.
>>
>> On friday ~16:00 CEST we downgraded mds14 and mds16 and have not seen 
>> any problem since for the following filesystems:
>>
>>   /nobackup/rossby24
>>   /nobackup/smhid17
>>   /nobackup/rcdl
>>   /nobackup/bolinc1
>>   /nobackup/rossby26
>>   /nobackup/smhid19
>>
>>
>> We will soon update mds[8-11,15] during which time accesses to 
>> filesystems (except those mentioned above) will hang (but not fail, if 
>> it goes according to plan) for 5-10 minutes.
>>
>>
>> Downgrading is a temporary fix, we are still working on a permanent 
>> solution.
>>
>>
>> Kind Regards,
>>
> 

-- 
Fredrik Nyström, National Supercomputer Centre
freny at nsc.liu.se


More information about the Bi-users mailing list