I\'m trying to figure out the best way to find the number of files in a particular directory when there are a very large number of files (more than 100,000).
When the
You should use "getdents" in place of ls/find
Here is one very good article which described the getdents approach.
http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html
Here is the extract:
ls
and practically every other method of listing a directory (including Python's os.listdir and find .
) rely on libc readdir(). However, readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory (e.g., 500 million directory entries) it is going to take an insanely long time to read all the directory entries, especially on a slow disk. For directories containing a large number of files, you'll need to dig deeper than tools that rely on readdir(). You will need to use the getdents() system call directly, rather than helper methods from the C standard library.
We can find the C code to list the files using getdents() from here:
There are two modifications you will need to do in order quickly list all the files in a directory.
First, increase the buffer size from X to something like 5 megabytes.
#define BUF_SIZE 1024*1024*5
Then modify the main loop where it prints out the information about each file in the directory to skip entries with inode == 0. I did this by adding
if (dp->d_ino != 0) printf(...);
In my case I also really only cared about the file names in the directory so I also rewrote the printf() statement to only print the filename.
if(d->d_ino) printf("%sn ", (char *) d->d_name);
Compile it (it doesn't need any external libraries, so it's super simple to do)
gcc listdir.c -o listdir
Now just run
./listdir [directory with an insane number of files]