[Snic-users] Triolith and Gamma failure: some running jobs lost

Wed May 16 10:44:28 CEST 2018

Dear Triolith and Gamma users,

Due to a storage problem(*), some jobs that were running on Triolith and
Gamma between 09:39 and 10:36 CEST today failed.

Job starts are currently blocked on both clusters while we investigate. We
hoe to be able to resume starting new jobs within a few hours.

On most of the affected compute nodes, the running job failed. If your jobs
ended unexpectedly during this period, check the output of your job, and if
it's incomplete, resubmit the job.

On some of the affected compute nodes, the jobs kept running, probably
because the application was not doing any disk I/O when the storage problem
happened. These jobs are probably OK, but could also be hanging and not
making any progress. If your job is in the list below, you should check if
its making progress, and if not, cancel it and resubmit.

Jobs on Triolith (that ran on affected compute nodes but did not fail):

             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
           18762689  triolith CaBF_mol  x_rafar  R    2:33:45      2
n[1165,1169]
           18757068  triolith      FDA  x_vache  R 1-00:25:15     12
n[1394,1402-1412]
           18751626  triolith  Co_Surf  x_jaksp  R 1-23:44:17     11
n[1324-1325,1327-1329,1333,1336-1340]
           18762479  triolith     Lund  x_vache  R    4:18:50      3
n[265-267]
           18678353  triolith   tetCFe  x_davga  R 6-17:36:46     16
n[46,1105-1108,1113-1118,1120-1122,1124-1125]
           18762040  triolith viscosit  x_jiefu  R    7:22:11      4
n[1159-1160,1164,1175]
           18762031  triolith viscosit  x_jiefu  R    7:24:11      4
n[1513-1516]
           18759684  triolith    Dimer  x_emied  R   18:23:35      4
n[277-279,281]
           18759954  triolith interact  x_fahkh  R   18:18:34      1 n1601
           18762122  triolith  Ising3D  x_wenwa  R    7:03:07      1 n43
           18761670  triolith      El1  x_saymo  R   11:16:53      8
n[1374-1375,1378-1379,1382-1385]
           18757502  triolith CO2nuLow  x_maaku  R   23:04:40      4
n[145,147,149-150]
           18751613  triolith   Anthr+  x_vivsh  R 1-23:47:54      2
n[1257-1258]
           18757794  triolith md1000re  x_jowar  R   21:57:55      8
n[157-159,162,164-165,167-168]

Jobs on Gamma (that ran on affected compute nodes but did not fail):

              JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
            1279640     gamma    alpha  x_qinfe  R 1-18:33:22      4
n[742,791,901-902]
            1279889     gamma     w2fe  x_marda  R    7:56:10      2
n[825-826]
            1279861     gamma    zr2ni  x_marda  R    8:18:03      2
n[891,898]
            1279866     gamma    hf2ni  x_marda  R    8:18:03      2
n[936,960]
            1280318     gamma  Kr_1088  x_yuali  R      55:55      2
n[866-867]
            1279616     gamma dimer_di  x_liaca  R 1-17:39:51      8
n[819-820,854-855,858,952,957-958]

(*) behind the scenes, we're this week doing changes to prepare the storage
system to connect our new Tetralith and Sigma clusters later this summer.
Something went wrong as the storage system vendor and NSC was working on
this upgrade. We're still investigating what happened.

NSC apologizes for the inconvenience. If you have any questions regarding
this failure, please contact support at nsc.liu.se.

-- 
Mats Kronberg, NSC Support <support at nsc.liu.se>