问题
I'm trying to figure out the best way to find the number of files in a particular directory when there are a very large number of files ( > 100,000).
When there are that many files, performing "ls | wc -l" takes quite a long time to execute. I believe this is because it's returning the names of all the files. I'm trying to take up as little of the disk IO as possible.
I have experimented with some shell and Perl scripts to no avail. Any ideas?
回答1:
By default ls sorts the names, which can take a while if there are a lot of them.  Also there will be no output until all of the names are read and sorted.  Use the ls -f option to turn off sorting.
ls -f | wc -l
Note that this will also enable -a, so ., .., and other files starting with . will be counted.
回答2:
The fastest way is a purpose-built program, like this:
#include <stdio.h>
#include <dirent.h>
int main(int argc, char *argv[]) {
    DIR *dir;
    struct dirent *ent;
    long count = 0;
    dir = opendir(argv[1]);
    while((ent = readdir(dir)))
            ++count;
    closedir(dir);
    printf("%s contains %ld files\n", argv[1], count);
    return 0;
}
From my testing without regard to cache, I ran each of these about 50 times each against the same directory, over and over, to avoid cache-based data skew, and I got roughly the following performance numbers (in real clock time):
ls -1  | wc - 0:01.67
ls -f1 | wc - 0:00.14
find   | wc - 0:00.22
dircnt | wc - 0:00.04
That last one, dircnt, is the program compiled from the above source.
EDIT 2016-09-26
Due to popular demand, I've re-written this program to be recursive, so it will drop into subdirectories and continue to count files and directories separately.
Since it's clear some folks want to know how to do all this, I have a lot of comments in the code to try to make it obvious what's going on. I wrote this and tested it on 64-bit Linux, but it should work on any POSIX-compliant system, including Microsoft Windows. Bug reports are welcome; I'm happy to update this if you can't get it working on your AIX or OS/400 or whatever.
As you can see, it's much more complicated than the original and necessarily so: at least one function must exist to be called recursively unless you want the code to become very complex (e.g. managing a subdirectory stack and processing that in a single loop). Since we have to check file types, differences between different OSs, standard libraries, etc. come into play, so I have written a program that tries to be usable on any system where it will compile.
There is very little error checking, and the count function itself doesn't really report errors. The only calls that can really fail are opendir and stat (if you aren't lucky and have a system where dirent contains the file type already). I'm not paranoid about checking the total length of the subdir pathnames, but theoretically, the system shouldn't allow any path name that is longer than than PATH_MAX. If there are concerns, I can fix that, but it's just more code that needs to be explained to someone learning to write C. This program is intended to be an example of how to dive into subdirectories recursively.
#include <stdio.h>
#include <dirent.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/stat.h>
#if defined(WIN32) || defined(_WIN32) 
#define PATH_SEPARATOR '\\' 
#else
#define PATH_SEPARATOR '/' 
#endif
/* A custom structure to hold separate file and directory counts */
struct filecount {
  long dirs;
  long files;
};
/*
 * counts the number of files and directories in the specified directory.
 *
 * path - relative pathname of a directory whose files should be counted
 * counts - pointer to struct containing file/dir counts
 */
void count(char *path, struct filecount *counts) {
    DIR *dir;                /* dir structure we are reading */
    struct dirent *ent;      /* directory entry currently being processed */
    char subpath[PATH_MAX];  /* buffer for building complete subdir and file names */
    /* Some systems don't have dirent.d_type field; we'll have to use stat() instead */
#if !defined ( _DIRENT_HAVE_D_TYPE )
    struct stat statbuf;     /* buffer for stat() info */
#endif
/* fprintf(stderr, "Opening dir %s\n", path); */
    dir = opendir(path);
    /* opendir failed... file likely doesn't exist or isn't a directory */
    if(NULL == dir) {
        perror(path);
        return;
    }
    while((ent = readdir(dir))) {
      if (strlen(path) + 1 + strlen(ent->d_name) > PATH_MAX) {
          fprintf(stdout, "path too long (%ld) %s%c%s", (strlen(path) + 1 + strlen(ent->d_name)), path, PATH_SEPARATOR, ent->d_name);
          return;
      }
/* Use dirent.d_type if present, otherwise use stat() */
#if defined ( _DIRENT_HAVE_D_TYPE )
/* fprintf(stderr, "Using dirent.d_type\n"); */
      if(DT_DIR == ent->d_type) {
#else
/* fprintf(stderr, "Don't have dirent.d_type, falling back to using stat()\n"); */
      sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
      if(lstat(subpath, &statbuf)) {
          perror(subpath);
          return;
      }
      if(S_ISDIR(statbuf.st_mode)) {
#endif
          /* Skip "." and ".." directory entries... they are not "real" directories */
          if(0 == strcmp("..", ent->d_name) || 0 == strcmp(".", ent->d_name)) {
/*              fprintf(stderr, "This is %s, skipping\n", ent->d_name); */
          } else {
              sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
              counts->dirs++;
              count(subpath, counts);
          }
      } else {
          counts->files++;
      }
    }
/* fprintf(stderr, "Closing dir %s\n", path); */
    closedir(dir);
}
int main(int argc, char *argv[]) {
    struct filecount counts;
    counts.files = 0;
    counts.dirs = 0;
    count(argv[1], &counts);
    /* If we found nothing, this is probably an error which has already been printed */
    if(0 < counts.files || 0 < counts.dirs) {
        printf("%s contains %ld files and %ld directories\n", argv[1], counts.files, counts.dirs);
    }
    return 0;
}
EDIT 2017-01-17
I've incorporated two changes suggested by @FlyingCodeMonkey:
- Use lstatinstead ofstat. This will change the behavior of the program if you have symlinked directories in the directory you are scanning. The previous behavior was that the (linked) subdirectory would have its file count added to the overall count; the new behavior is that the linked directory will count as a single file, and its contents will not be counted.
- If the path of a file is too long, an error message will be emitted and the program will halt.
EDIT 2017-06-29
With any luck, this will be the last edit of this answer :)
I've copied this code into a GitHub repository to make it a bit easier to get the code (instead of copy/paste, you can just download the source), plus it makes it easier for anyone to suggest a modification by submitting a pull-request from GitHub.
The source is available under Apache License 2.0. Patches* welcome!
- "patch" is what old people like me call a "pull request".
回答3:
Did you try find? For example:
find . -name "*.ext" | wc -l
回答4:
find, ls and perl tested against 40 000 files: same speed (though I didn't try to clear the cache):
[user@server logs]$ time find . | wc -l
42917
real    0m0.054s
user    0m0.018s
sys     0m0.040s
[user@server logs]$ time /bin/ls -f | wc -l
42918
real    0m0.059s
user    0m0.027s
sys     0m0.037s
and with perl opendir/readdir, same time:
[user@server logs]$ time perl -e 'opendir D, "."; @files = readdir D; closedir D; print scalar(@files)."\n"'
42918
real    0m0.057s
user    0m0.024s
sys     0m0.033s
note: I used /bin/ls -f to make sure to bypass the alias option which might slow a little bit and -f to avoid file ordering. ls without -f is twice slower than find/perl except if ls is used with -f, it seems to be the same time:
[user@server logs]$ time /bin/ls . | wc -l
42916
real    0m0.109s
user    0m0.070s
sys     0m0.044s
I also would like to have some script to ask the file system directly without all the unnecessary information.
tests based on answer of Peter van der Heijden, glenn jackman and mark4o.
Thomas
回答5:
You can change the output based on your requirements, but here is a bash one-liner I wrote to recursively count and report the number of files in a series of numerically named directories.
dir=/tmp/count_these/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$i => $(find ${dir}${i} -type f | wc -l),"; }
This looks recursively for all files (not directories) in the given directory and returns the results in a hash-like format. Simple tweaks to the find command could make what kind of files you're looking to count more specific, etc.
Results in something like this:
1 => 38,
65 => 95052,
66 => 12823,
67 => 10572,
69 => 67275,
70 => 8105,
71 => 42052,
72 => 1184,
回答6:
Surprisingly for me, a bare-bones find is very much comparable to ls -f
> time ls -f my_dir | wc -l
17626
real    0m0.015s
user    0m0.011s
sys     0m0.009s
versus
> time find my_dir -maxdepth 1 | wc -l
17625
real    0m0.014s
user    0m0.008s
sys     0m0.010s
Of course, the values on the third decimal place shift around a bit every time you execute any of these, so they're basically identical. Notice however that find returns one extra unit, because it counts the actual directory itself (and, as mentioned before, ls -f returns two extra units, since it also counts . and ..).
回答7:
Just adding this for the sake of completeness. The correct answer of course has already been posted by someone else, but you can also get a count of files and directories with the tree program.
Run the command tree | tail -n 1 to get the last line, which will say something like "763 directories, 9290 files". This counts files and folders recursively, excluding hidden files, which can be added with the flag -a. For reference, it took 4.8 seconds on my computer, for tree to count my whole home dir, which was 24777 directories, 238680 files. find -type f | wc -l took 5.3 seconds, half a second longer, so I think tree is pretty competitive speed-wise.
As long as you don't have any subfolders, tree is a quick and easy way to count the files.
Also, and purely for the fun of it, you can use tree | grep '^├' to only show the files/folders in the current directory - this is basically a much-slower version of ls.
回答8:
Fast Linux File Count
The fastest linux file count which I know is
locate -c -r '/home'
There is no need to invoke grep! But as mentioned you should have a fresh database (updated daily by a cron job, or manual by sudo updatedb).  
From man locate
-c, --count
    Instead  of  writing  file  names on standard output, write the number of matching
    entries only.
Additional you should know that it also counts the directories as files!
BTW: If you want an overview of your files and directories on your system type
locate -S
It outputs the number of directories, files etc.
回答9:
You could try if using opendir() and readdir() in Perl is faster. For an example of those function look here
回答10:
This answer here is faster than almost everything else on this page for very large, very nested directories:
https://serverfault.com/a/691372/84703
locate -r '.' | grep -c "^$PWD"
回答11:
I came here when trying to count the files in a dataset of ~ 10K folders with ~10K files each. Problem with many of the approaches is that they implicitly stat 100M files, which takes ages.
I took the liberty to extend the approach by christopher-schultz so it supports passing directories via args (his recursive approach uses stat as well).
Put the following into file dircnt_args.c:
#include <stdio.h>
#include <dirent.h>
int main(int argc, char *argv[]) {
    DIR *dir;
    struct dirent *ent;
    long count;
    long countsum = 0;
    int i;
    for(i=1; i < argc; i++) {
        dir = opendir(argv[i]);
        count = 0;
        while((ent = readdir(dir)))
            ++count;
        closedir(dir);
        printf("%s contains %ld files\n", argv[i], count);
        countsum += count;
    }
    printf("sum: %ld\n", countsum);
    return 0;
}
After a gcc -o dircnt_args dircnt_args.c you can invoke it like this:
dircnt_args /your/dirs/*
On 100M files in 10K folders the above completes quite quickly (~5 min for first run, followup on cache: ~23 s).
The only other approach that finished in less than an hour was ls with about 1 min on cache: ls -f /your/dirs/* | wc -l. The count is off by a couple of newlines per dir though...
Other than expected, none of my attempts with find returned within an hour :-/
回答12:
Writing this here as I don't have enough reputation points to comment on an answer, but I am allowed to leave my own answer, which doesn't make sense. Anyway...
Regarding the answer by Christopher Schultz, I suggest changing stat to lstat and possibly adding a bounds-check to avoid buffer overflow:
if (strlen(path) + strlen(PATH_SEPARATOR) + strlen(ent->d_name) > PATH_MAX) {
    fprintf(stdout, "path too long (%ld) %s%c%s", (strlen(path) + strlen(PATH_SEPARATOR) + strlen(ent->d_name)), path, PATH_SEPARATOR, ent->d_name);
    return;
}
The suggestion to use lstat is to avoid following symlinks which could lead to cycles if a directory contains a symlink to a parent directory.
回答13:
The fastest way on linux (the question is tagged as linux), is to use direct system call. Here's a little program that counts files (only, no dirs) in a directory. You can count millions of files and it is around 2.5 times faster than "ls -f" and around 1.3-1.5 times faster than Christopher Schultz's answer.
#define _GNU_SOURCE
#include <dirent.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/syscall.h>
#define BUF_SIZE 4096
struct linux_dirent {
    long d_ino;
    off_t d_off;
    unsigned short d_reclen;
    char d_name[];
};
int countDir(char *dir) {
    int fd, nread, bpos, numFiles = 0;
    char d_type, buf[BUF_SIZE];
    struct linux_dirent *dirEntry;
    fd = open(dir, O_RDONLY | O_DIRECTORY);
    if (fd == -1) {
        puts("open directory error");
        exit(3);
    }
    while (1) {
        nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
        if (nread == -1) {
            puts("getdents error");
            exit(1);
        }
        if (nread == 0) {
            break;
        }
        for (bpos = 0; bpos < nread;) {
            dirEntry = (struct linux_dirent *) (buf + bpos);
            d_type = *(buf + bpos + dirEntry->d_reclen - 1);
            if (d_type == DT_REG) {
                // Increase counter
                numFiles++;
            }
            bpos += dirEntry->d_reclen;
        }
    }
    close(fd);
    return numFiles;
}
int main(int argc, char **argv) {
    if (argc != 2) {
        puts("Pass directory as parameter");
        return 2;
    }
    printf("Number of files in %s: %d\n", argv[1], countDir(argv[1]));
    return 0;
}
PS: It is not recursive but you could modify it to achieve that.
回答14:
ls spends more time sorting the files names, using -f to disable the sorting will save sometime:
ls -f | wc -l
or you can use find:
find . -type f | wc -l
回答15:
I realized that not using in memory processing when you have a huge amount of data is fastest than "piping" the commands. So I saved the result to a file and after analyzed it
ls -1 /path/to/dir > count.txt && cat count.txt | wc -l
回答16:
You should use "getdents" in place of ls/find
Here is one very good article which described the getdents approach.
http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html
Here is the extract:
ls and practically every other method of listing a directory (including python os.listdir, find .) rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory (i.e. 500M of directory entries) it is going to take an insanely long time to read all the directory entries, especially on a slow disk. For directories containing a large number of files, you'll need to dig deeper than tools that rely on readdir(). You will need to use the getdents() syscall directly, rather than helper methods from libc.
We can find the C code to list the files using getdents() from here:
There are two modifications you will need to do in order quickly list all the files in a directory.
First, increase the buffer size from X to something like 5 megabytes.
#define BUF_SIZE 1024*1024*5
Then modify the main loop where it prints out the information about each file in the directory to skip entries with inode == 0. I did this by adding
if (dp->d_ino != 0) printf(...);
In my case I also really only cared about the file names in the directory so I also rewrote the printf() statement to only print the filename.
if(d->d_ino) printf("%sn ", (char *) d->d_name);
Compile it (it doesn't need any external libraries, so it's super simple to do)
gcc listdir.c -o listdir
Now just run
./listdir [directory with insane number of files]
回答17:
I prefer the following command to keep track of the changes in the number of files in a directory.
watch -d -n 0.01 'ls | wc -l'
The command will keeps a window open to keep track of the no of files that are in the directory with a refresh rate of 0.1 sec.
回答18:
First 10 directores with the higest no of files.
dir=/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$(find ${dir}${i} \
    -type f | wc -l) => $i,"; } | sort -nr | head -10
来源:https://stackoverflow.com/questions/1427032/fast-linux-file-count-for-a-large-number-of-files