[Vagnekman-users] Re: Ekman service window notice and important cluster file system notes

Thu Jul 22 00:20:14 CEST 2010

Hi again

Unfortunately the service window will be prolonged to at least Thursday due
to problems with the file-server upgrades. We will keep you posted.

Regards,
Daniel Ahlin
PDC

On Wed, Jul 14, 2010 at 3:36 PM, Daniel Ahlin <dah at pdc.kth.se> wrote:

> Hi
>
> As you may have noticed a service window for the entire Ekman cluster
> has been scheduled for 2010-07-21 08:00 to 20:00 CEST. We apologise
> for the short notice but deemed that updates we plan to make are
> important enough to do as soon as possible.
>
> Planned changes are:
>
> * Addition of two file-servers to the cluster file-system - this
>   will increase available space by 50% and should also improve
>   file transfer bandwidth. Regarding the cluster file-system -
>   please also read the note at the end of this email.
>
> * Upgrade system software to CentOS 5.5 - no user noticeable changes
>   are anticipated.
>
> * Upgrade of the Ethernet infrastructure of Ekman - this will increase
>   the available bandwidth for cluster-external traffic (e.g. file
>   transfers in and out of the cluster).
>
> In preparation for the service window we ask all users to clean out
> unused/old
> files from the shared scratch area.
>
> About the cluster file system
>
> Several users have experienced slowness of response using the cluster
> file-system. The responsiveness of a parallel file-system depends on
> many things - major factors being:
>
> a) Hardware resources dedicated to the service
> b) Current state of the backing storage (e.g. a raid-set may be rebuilding
> due to a hard-disk failure)
> c) Choice of file-system
> d) Tuning of the file-system
> e) Current usage
>
> Now - (a) and (c) are settled factors and (b) is outside our control
> (partly an effect of (a)). Regarding (d), the tuning on Ekman can be
> described as trying to achieve maximum aggregate bandwidth out of the
> file-servers, as the file-server to compute node ratio is 4:1268
> (1:317) that can translate to a fairly low bandwidth per node in the
> rare case when all clients are reading from the file-system. The
> tuning can be changed in various ways - but it is our belief that the
> current setting is good balance between interactiveness and batch-job
> performance for the usage on Ekman.
>
> (e) is a highly changing factor and unfortunately also one that is
> very hard to contain the effects of. There are seemingly valid usage
> patterns which will bring any parallel file-system to a crawl. This of
> course makes such usage patterns undesired on shared systems even
> though they may make sense on dedicated systems (such as a
> workstation). When we discover such usage patterns we contact the user
> asking for a change in how files are accessed. We ask you to also
> pro-actively think about how your code access the file-system. Typical
> expensive operations are:
>
> * Creation, deletion and moving of files.
> * Polling meta-data status (such as repeatedly running du, ls -lR and so
> on)
> * Short reads and writes and partial reads from files are very
>   expensive when compared to long reads and writes and the reading of
>   entire files.
>
> Finally - we are also announcing that we will start automatic
> cleaning of files in the scratch-area of the cluster file-system the
> 1st September 2010.
>
> Regards,
> Daniel Ahlin
> PDC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.nsc.liu.se/pipermail/vagnekman-users/attachments/20100721/3f6acd34/attachment.htm