[Vagnekman-users] Ekman service window notice and important cluster file system notes

Daniel Ahlin dah at pdc.kth.se
Wed Jul 14 16:36:55 CEST 2010


Hi

As you may have noticed a service window for the entire Ekman cluster
has been scheduled for 2010-07-21 08:00 to 20:00 CEST. We apologise
for the short notice but deemed that updates we plan to make are
important enough to do as soon as possible.

Planned changes are:

* Addition of two file-servers to the cluster file-system - this
  will increase available space by 50% and should also improve
  file transfer bandwidth. Regarding the cluster file-system -
  please also read the note at the end of this email.

* Upgrade system software to CentOS 5.5 - no user noticeable changes
  are anticipated.

* Upgrade of the Ethernet infrastructure of Ekman - this will increase
  the available bandwidth for cluster-external traffic (e.g. file
  transfers in and out of the cluster).

In preparation for the service window we ask all users to clean out
unused/old
files from the shared scratch area.

About the cluster file system

Several users have experienced slowness of response using the cluster
file-system. The responsiveness of a parallel file-system depends on
many things - major factors being:

a) Hardware resources dedicated to the service
b) Current state of the backing storage (e.g. a raid-set may be rebuilding
due to a hard-disk failure)
c) Choice of file-system
d) Tuning of the file-system
e) Current usage

Now - (a) and (c) are settled factors and (b) is outside our control
(partly an effect of (a)). Regarding (d), the tuning on Ekman can be
described as trying to achieve maximum aggregate bandwidth out of the
file-servers, as the file-server to compute node ratio is 4:1268
(1:317) that can translate to a fairly low bandwidth per node in the
rare case when all clients are reading from the file-system. The
tuning can be changed in various ways - but it is our belief that the
current setting is good balance between interactiveness and batch-job
performance for the usage on Ekman.

(e) is a highly changing factor and unfortunately also one that is
very hard to contain the effects of. There are seemingly valid usage
patterns which will bring any parallel file-system to a crawl. This of
course makes such usage patterns undesired on shared systems even
though they may make sense on dedicated systems (such as a
workstation). When we discover such usage patterns we contact the user
asking for a change in how files are accessed. We ask you to also
pro-actively think about how your code access the file-system. Typical
expensive operations are:

* Creation, deletion and moving of files.
* Polling meta-data status (such as repeatedly running du, ls -lR and so on)
* Short reads and writes and partial reads from files are very
  expensive when compared to long reads and writes and the reading of
  entire files.

Finally - we are also announcing that we will start automatic
cleaning of files in the scratch-area of the cluster file-system the
1st September 2010.

Regards,
Daniel Ahlin
PDC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.nsc.liu.se/pipermail/vagnekman-users/attachments/20100714/ef1a986f/attachment.htm


More information about the Vagnekman-users mailing list