[Webb-users] /hsm mount on Bi enabled again

Andreas Johansson andjo at nsc.liu.se
Thu Apr 16 10:58:03 CEST 2020


Den 2020-04-06 kl. 10:59, skrev Andreas Johansson:
> If you are using the /hsm mount on Bi:
> 
> There has been reports of errors when reading data via the NFS mount of
> the HSM storage. NSC has disabled the /hsm mount while checking data and
> trying to find the cause of the problems. These checks might take some
> time due to the amount of data that is checksummed, so we expect the
> time it will be disabled to be in the order of days.

This access has now been enabled again.

Details below:

After some investigation is was discovered that the root cause of the
problems seen was neither on the HSM or Bi side but in the network
between them. An intermittently faulty fibre cable in the core network
was identified and replaced last week. Since there are multiple paths
through the core network for redundancy this only affected parts of the
network traffic. In the particular HSM use case we believe it only hit
some reads due to the network layout. All writes should have gone
through a different path.

The network monitoring has been extended to catch these kinds of errors.
Multiple other things were already checked, but not this particular one.
Large amounts of traffic was sent to test the replaced cable before that
network link was enabled again. It took more than 75 million packets
without any errors for the link to be declared healthy again.

SFTP transfers were not affected due to the stronger checksums used in
protocols. NFS uses the two checksums on the Ethernet and TCP levels,
but SSH adds a third checksum on the application level. (The errors seen
for NFS transfers had thus managed to slip through two levels of checksums.)

Regards,

-- 
Andreas Johansson                  andjo at nsc.liu.se
System Expert                      +46-(0)13-285778
National Supercomputer Centre, Linköping University


More information about the Webb-users mailing list