[Snic-users] Triolith and Gamma failure: some running jobs lost
Mats Kronberg
kronberg at nsc.liu.se
Wed May 16 10:44:28 CEST 2018
Dear Triolith and Gamma users,
Due to a storage problem(*), some jobs that were running on Triolith and
Gamma between 09:39 and 10:36 CEST today failed.
Job starts are currently blocked on both clusters while we investigate. We
hoe to be able to resume starting new jobs within a few hours.
On most of the affected compute nodes, the running job failed. If your jobs
ended unexpectedly during this period, check the output of your job, and if
it's incomplete, resubmit the job.
On some of the affected compute nodes, the jobs kept running, probably
because the application was not doing any disk I/O when the storage problem
happened. These jobs are probably OK, but could also be hanging and not
making any progress. If your job is in the list below, you should check if
its making progress, and if not, cancel it and resubmit.
Jobs on Triolith (that ran on affected compute nodes but did not fail):
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
18762689 triolith CaBF_mol x_rafar R 2:33:45 2
n[1165,1169]
18757068 triolith FDA x_vache R 1-00:25:15 12
n[1394,1402-1412]
18751626 triolith Co_Surf x_jaksp R 1-23:44:17 11
n[1324-1325,1327-1329,1333,1336-1340]
18762479 triolith Lund x_vache R 4:18:50 3
n[265-267]
18678353 triolith tetCFe x_davga R 6-17:36:46 16
n[46,1105-1108,1113-1118,1120-1122,1124-1125]
18762040 triolith viscosit x_jiefu R 7:22:11 4
n[1159-1160,1164,1175]
18762031 triolith viscosit x_jiefu R 7:24:11 4
n[1513-1516]
18759684 triolith Dimer x_emied R 18:23:35 4
n[277-279,281]
18759954 triolith interact x_fahkh R 18:18:34 1 n1601
18762122 triolith Ising3D x_wenwa R 7:03:07 1 n43
18761670 triolith El1 x_saymo R 11:16:53 8
n[1374-1375,1378-1379,1382-1385]
18757502 triolith CO2nuLow x_maaku R 23:04:40 4
n[145,147,149-150]
18751613 triolith Anthr+ x_vivsh R 1-23:47:54 2
n[1257-1258]
18757794 triolith md1000re x_jowar R 21:57:55 8
n[157-159,162,164-165,167-168]
Jobs on Gamma (that ran on affected compute nodes but did not fail):
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
1279640 gamma alpha x_qinfe R 1-18:33:22 4
n[742,791,901-902]
1279889 gamma w2fe x_marda R 7:56:10 2
n[825-826]
1279861 gamma zr2ni x_marda R 8:18:03 2
n[891,898]
1279866 gamma hf2ni x_marda R 8:18:03 2
n[936,960]
1280318 gamma Kr_1088 x_yuali R 55:55 2
n[866-867]
1279616 gamma dimer_di x_liaca R 1-17:39:51 8
n[819-820,854-855,858,952,957-958]
(*) behind the scenes, we're this week doing changes to prepare the storage
system to connect our new Tetralith and Sigma clusters later this summer.
Something went wrong as the storage system vendor and NSC was working on
this upgrade. We're still investigating what happened.
NSC apologizes for the inconvenience. If you have any questions regarding
this failure, please contact support at nsc.liu.se.
--
Mats Kronberg, NSC Support <support at nsc.liu.se>
More information about the Snic-users
mailing list