How do you deal with lots of small files?

前端未结

关注

 14  1767

A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production the

相关标签:

14条回答

北海茫月

2020-12-08 05:21

If there are any meaningful, categorical, aspects of the data you could nest them in a directory tree. I believe the slowdown is due to the number of files in one directory, not the sheer number of files itself.

The most obvious, general grouping is by date, and gives you a three-tiered nesting structure (year, month, day) with a relatively safe bound on the number of files in each leaf directory (1-3k).

Even if you are able to improve the filesystem/file browser performance, it sounds like this is a problem you will run into in another 2 years, or 3 years... just looking at a list of 0.3-1mil files is going to incur a cost, so it may be better in the long-term to find ways to only look at smaller subsets of the files.

Using tools like 'find' (under cygwin, or mingw) can make the presence of the subdirectory tree a non-issue when browsing files.

0 讨论(0)
发布评论:

提交评论
- 加载中...
予麋鹿

2020-12-08 05:22

You could try using something like Solid File System.

This gives you a virtual file system that applications can mount as if it were a physical disk. Your application sees lots of small files, but just one file sits on your hard drive.

http://www.eldos.com/solfsdrv/

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦毁少年i

2020-12-08 05:23

I have run into this problem lots of times in the past. We tried storing by date, zipping files below the date so you don't have lots of small files, etc. All of them were bandaids to the real problem of storing the data as lots of small files on NTFS.

You can go to ZFS or some other file system that handles small files better, but still stop and ask if you NEED to store the small files.

In our case we eventually went to a system were all of the small files for a certain date were appended in a TAR type of fashion with simple delimiters to parse them. The disk files went from 1.2 million to under a few thousand. They actually loaded faster because NTFS can't handle the small files very well, and the drive was better able to cache a 1MB file anyway. In our case the access and parse time to find the right part of the file was minimal compared to the actual storage and maintenance of stored files.

0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2020-12-08 05:28

Consider pushing them to another server that uses a filesystem friendlier to massive quantities of small files (Solaris w/ZFS for example)?

0 讨论(0)
发布评论:

提交评论
- 加载中...
青春惊慌失措

2020-12-08 05:30

Aside from placing the files in sub-directories..

Personally, I would develop an application that keeps the interface to that folder the same, ie all files are displayed as being individual files. Then in the application background actually takes these files and combine them into a larger files(and since the sizes are always 64k getting the data you need should be relatively easy) To get rid of the mess you have.

So you can still make it easy for them to access the files they want, but also lets you have more control how everything is structured.

0 讨论(0)
发布评论:

提交评论
- 加载中...
名媛妹妹

2020-12-08 05:31

NTFS performance severely degrades after 10,000 files in a directory. What you do is create an additional level in the directory hierarchy, with each subdirectory having 10,000 files.

For what it's worth, this is the approach that the SVN folks took in version 1.5. They used 1,000 files as the default threshold.

0 讨论(0)
发布评论:

提交评论
- 加载中...