[Gimle-users] Gimle login node crashes, status update

Wed Feb 2 10:21:33 CET 2011

Hi all,

Just a short comment: It is easy to check the number of files in a 
directory by typing this in the directory:

ls | wc

The second number (often identical to the first number) that appears is 
the number of files.

Best regards,

Lars Axell

On 02/01/11 17:45, Kent Engström wrote:
> Dear Gimle Users,
>
> please bear with me for this rather long status update on the login node
> crash problem.
>
>
> Problems with directories with a huge number of files
> -----------------------------------------------------
>
> We are pretty sure now that the Gimle login node crashes we have been
> seeing the last weeks are related to directories with a huge number of
> files in them (directly, not in a directory hierarchy below it).
>
> At least one of the directories involved has ~ 350 000 files in it.
>
> We've always told you (if you asked) that this is not a very good
> idea, but we did not think that the system as such could be crashed by
> this.
>
> We are still unsure of why this has started to crash the login node
> often now --- if it is because some component has become more
> sensitive to huge directories, or if you have created directories with
> more files in them than ever before.
>
> We will try to find out if there are parameters in the Lustre
> clients/servers and the kernel we can adjust, but until we find that,
> we must suggest some other ways forward.
>
>
> Solution 1: Fewer and Bigger Files
> ----------------------------------
>
> Some of the directories containing huge amount of files have rather
> small files in them. If that is the case for you, perhaps you could
> create fewer and bigger files. That would probably also help with I/O
> performance (but benchmark that to be sure).
>
> If this is a viable solution depends a lot on your applications, of
> course.
>
>
> Solution 2: Directory Hierarchy
> -------------------------------
>
> If you still have more than a couple of ten thousand files in flat
> directories, you could create subdirectories to make the number of
> files in each directory more reasonable.
>
> For files whose names are based on dates, one solution I know that
> many of you already use is to create an extra directory level based on
> the year. If 100 years in one directory means 350000 files, with a
> year-based split you will have 100 subdirectories, each containing
> 3500 files.
>
>
> ls --color=yes considered harmful
> ---------------------------------
>
> On CentOS 5, by default the ls command is aliased to "ls --color=yes".
> This means two things:
>
> 1) The output will be colour-coded, which gives a nice immediate
> feedback about the file types in the listing.
>
> 2) But... every file needs to be "stat:ed" to find out the type. On
> our Lustre filesystems, that means that instead of only reading the
> directory listing from the metadata server, the client needs to speak
> to all the object storage servers too, doing a stat(2) operation on
> each and every file. Thus, every "ls" will exercise the storage system
> as much as an "ls -l".
>
> The difference in time between "ls" and "ls --color=yes" in a
> directory with a huge amount of files in it can be rather large
> (seconds vs. minutes).
>
> It might also be related to the recent crashes.
>
> Until we know more about the cause for the crashes, we have disabled
> the automatic "--color=yes" feature (for new logins). You will have to
> add the option yourself when you need it.
>
> We might revert this change later, when the crash problem is under
> control.
>
>
> Restructuring directories
> -------------------------
>
> If you have directories with huge amounts of files in them and want to
> restructure them (e.g. by adding a year sub-directory level) please
> contact us and wait for response before embarking on this. At least
> for the next couple of weeks, we would like to monitor what happens
> during this.
>
>
> Problems with /nobackup/smhid7 today
> ------------------------------------
>
> At around ~13.30 we got problems with /nobackup/smhid7, (related to
> problems with huge directories).  In the end, one of the fileservers
> for that filesystem had to be restarted. At 15.50 that was completed
> and the filesystem should be OK again.
>
>
> Regards,
>    

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.nsc.liu.se/pipermail/gimle-users/attachments/20110202/dd61c173/attachment.htm