[Vagnekman-users] job starts inhibited

Mattias Claesson slas at pdc.kth.se
Mon Nov 30 18:01:16 CET 2009


> as /cfs/ (aka lustre filesystem, aka /cfs/ekman/scratch/....)
> is full and have caused jobs to experience 'no space left on
> device' and crash over the weekend, node allocation is paused
> until the situation has cleared somewhat.

There is now some more space free in the filesystem, but due to the free space
being unevenly distributed over the servers, some extra headroom doesn't hurt.
Therefor, all users should check if they have data there that can be removed.

To make matters worse, one of the file server for the filesystem has failed
multiple disks and is now rebuilding its redundancy. This means that the files
on that server, and those that will be written to that server, will be without
redundancy and might be lost if more disk errors occur before this completes. We
have already seen warnings from another disk. The rebuild process is expected
to be finished by Thursday.

That said, we will now enable jobs to start on Ekman again.

Mattias
PDC-staff


More information about the Vagnekman-users mailing list