[Vagn-users] File system problem, login node restart 2014-10-06

Pär Lindfors paran at nsc.liu.se
Fri Oct 3 21:22:42 CEST 2014


Dear Vagn users,

During the last few weeks some users have reported various strange
problems on SNIC systems at NSC. Software that have been working fine in
the past would suddenly crash or fail to work correctly in some other
way.

Example of reported issues:

 * Could not compile software using CMake (for example DALTON)
 * Problems running RStudio
 * Problems running Intel VTune

We have narrowed this down to a problem in the GPFS file-system
software. The technical explanation is that in some conditions the
system call writev() will incorrectly fail with the error code EINVAL
(invalid argument). The problem have been assigned IBM APAR numbers
IV64862 and IV64863. We received fixed software packages on Wednesday,
and all SNIC clusters at NSC is being upgraded.

On Vagn GPFS is used for /home, /nobackup/vagn1 and
/nobackup/vagn2. None of the reported problems have been on Vagn, but
the version in use is affected and I could easily reproduce the problem
there. This version was installed during the downtime on 2014-09-08.

I have installed the fixed version on five nodes today, and at the same
time also upgraded them to CentOS 5.11. The remaining two nodes (a2 and
a6) is currently running jobs and will be upgraded once they finish. All
new jobs will start on already upgraded nodes.

The login node (analys1.vagn.nsc.liu.se) will be upgraded 16:00 on
Monday 2014-10-06. The upgrade requires a restart, so if you are logged
in at that time you will be logged out.

Regards,
Pär Lindfors, NSC



More information about the Vagn-users mailing list