Is there a way to efficiently yield every file in a directory containing millions of files?

前端 未结 6 1661
萌比男神i
萌比男神i 2020-12-01 18:53

I\'m aware of os.listdir, but as far as I can gather, that gets all the filenames in a directory into memory, and then returns the list. What I want, is a way t

6条回答
  •  离开以前
    2020-12-01 19:06

    tl;dr : As of Python 3.5 (currently in beta) just use os.scandir

    As I've written earlier, since "iglob" is just a facade for a real iterator, you will have to call low level system functions in order to get one at a time like you want. Fortunately, calling low level functions is doable from Python. The low level functions are different for Windows and Posix/Linux systems.

    • If you are on Windows, you should check if win32api has any call to read "the next entry from a dir" or how to proceed otherwise.
    • If you are on Posix/Linux, you can proceed to call libc functions straight through ctypes and get a file-dir entry (including naming information) a time.

    The documentation on the C functions is here: http://www.gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory

    http://www.gnu.org/s/libc/manual/html_node/Reading_002fClosing-Directory.html#Reading_002fClosing-Directory

    I have provided a snippet of Python code that demonstrates how to call the low-level C functions on my system but this code snippet may not work on your system[footnote-1]. I recommend opening your /usr/include/dirent.h header file and verifying the Python snippet is correct (your Python Structure must match the C struct) before using the snippet.

    Here is the snippet using ctypes and libc I've put together that allow you to get each filename, and perform actions on it. Note that ctypes automatically gives you a Python string when you do str(...) on the char array defined on the structure. (I am using the print statement, which implicitly calls Python's str)

    #!/usr/bin/env python2
    from ctypes import *
    
    libc = cdll.LoadLibrary( "libc.so.6")
    dir_ = c_voidp( libc.opendir("/home/jsbueno"))
    
    class Dirent(Structure):
        _fields_ = [("d_ino",  c_voidp),
                    ("off_t", c_int64),
                    ("d_reclen", c_ushort),
                    ("d_type", c_ubyte),
                    ("d_name", c_char * 2048)
                ]
    
    while True:
        p  = libc.readdir64(dir_)
        if not p:
            break
        entry = Dirent.from_address( p)
        print entry.d_name
    

    update: Python 3.5 is now in beta - and in Python 3.5 the new os.scandir function call is available as the materialization of PEP 471 ("a better and faster directory iterator") which does exactly what is asked for here, besides a lot other optimizations that can deliver up to 9 fold speed increase over os.listdir on large-directories listing under Windows (2-3 fold increase in Posix systems).

    [footnote-1] The dirent64 C struct is determined at C compile time for each system.

提交回复
热议问题