[Gimle-users] Gimle login node crashes, status update

KentEngström kent at nsc.liu.se
Tue Feb 1 17:45:56 CET 2011


Dear Gimle Users,

please bear with me for this rather long status update on the login node
crash problem.


Problems with directories with a huge number of files
-----------------------------------------------------

We are pretty sure now that the Gimle login node crashes we have been
seeing the last weeks are related to directories with a huge number of
files in them (directly, not in a directory hierarchy below it).

At least one of the directories involved has ~ 350 000 files in it.

We've always told you (if you asked) that this is not a very good
idea, but we did not think that the system as such could be crashed by
this.

We are still unsure of why this has started to crash the login node
often now --- if it is because some component has become more
sensitive to huge directories, or if you have created directories with
more files in them than ever before.

We will try to find out if there are parameters in the Lustre
clients/servers and the kernel we can adjust, but until we find that,
we must suggest some other ways forward.


Solution 1: Fewer and Bigger Files
----------------------------------

Some of the directories containing huge amount of files have rather
small files in them. If that is the case for you, perhaps you could
create fewer and bigger files. That would probably also help with I/O
performance (but benchmark that to be sure).

If this is a viable solution depends a lot on your applications, of
course.


Solution 2: Directory Hierarchy
-------------------------------

If you still have more than a couple of ten thousand files in flat
directories, you could create subdirectories to make the number of
files in each directory more reasonable.

For files whose names are based on dates, one solution I know that
many of you already use is to create an extra directory level based on
the year. If 100 years in one directory means 350000 files, with a
year-based split you will have 100 subdirectories, each containing
3500 files.


ls --color=yes considered harmful
---------------------------------

On CentOS 5, by default the ls command is aliased to "ls --color=yes".
This means two things:

1) The output will be colour-coded, which gives a nice immediate
feedback about the file types in the listing.

2) But... every file needs to be "stat:ed" to find out the type. On
our Lustre filesystems, that means that instead of only reading the
directory listing from the metadata server, the client needs to speak
to all the object storage servers too, doing a stat(2) operation on
each and every file. Thus, every "ls" will exercise the storage system
as much as an "ls -l".

The difference in time between "ls" and "ls --color=yes" in a
directory with a huge amount of files in it can be rather large
(seconds vs. minutes).

It might also be related to the recent crashes.

Until we know more about the cause for the crashes, we have disabled
the automatic "--color=yes" feature (for new logins). You will have to
add the option yourself when you need it.

We might revert this change later, when the crash problem is under
control.


Restructuring directories
-------------------------

If you have directories with huge amounts of files in them and want to
restructure them (e.g. by adding a year sub-directory level) please
contact us and wait for response before embarking on this. At least
for the next couple of weeks, we would like to monitor what happens
during this.


Problems with /nobackup/smhid7 today 
------------------------------------

At around ~13.30 we got problems with /nobackup/smhid7, (related to
problems with huge directories).  In the end, one of the fileservers
for that filesystem had to be restarted. At 15.50 that was completed
and the filesystem should be OK again.


Regards,
-- 
Kent Engström, National Supercomputer Centre
kent at nsc.liu.se, +46 13 28 4444


More information about the Gimle-users mailing list