I\'m trying to figure out the best way to find the number of files in a particular directory when there are a very large number of files (more than 100,000).
When the
I came here when trying to count the files in a data set of approximately 10,000 folders with approximately 10,000 files each. The problem with many of the approaches is that they implicitly stat 100 million files, which takes ages.
I took the liberty to extend the approach by Christopher Schultz so it supports passing directories via arguments (his recursive approach uses stat as well).
Put the following into file dircnt_args.c
:
#include
#include
int main(int argc, char *argv[]) {
DIR *dir;
struct dirent *ent;
long count;
long countsum = 0;
int i;
for(i=1; i < argc; i++) {
dir = opendir(argv[i]);
count = 0;
while((ent = readdir(dir)))
++count;
closedir(dir);
printf("%s contains %ld files\n", argv[i], count);
countsum += count;
}
printf("sum: %ld\n", countsum);
return 0;
}
After a gcc -o dircnt_args dircnt_args.c
you can invoke it like this:
dircnt_args /your/directory/*
On 100 million files in 10,000 folders, the above completes quite quickly (approximately 5 minutes for the first run, and followup on cache: approximately 23 seconds).
The only other approach that finished in less than an hour was ls
with about 1 min on cache: ls -f /your/directory/* | wc -l
. The count is off by a couple of newlines per directory though...
Other than expected, none of my attempts with find
returned within an hour :-/