Is there a way to efficiently yield every file in a directory containing millions of files?

前端 未结 6 1666
萌比男神i
萌比男神i 2020-12-01 18:53

I\'m aware of os.listdir, but as far as I can gather, that gets all the filenames in a directory into memory, and then returns the list. What I want, is a way t

6条回答
  •  死守一世寂寞
    2020-12-01 19:12

    What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.

    No method will reveal a filename which "changed". It's not even clear what you mean by this "filenames change, new files are added, and files are deleted"? What is your use case?

    Let's say you have three files: a.a, b.b, c.c.

    Your magical "iterator" starts with a.a. You process it.

    The magical "iterator" moves to b.b. You're processing it.

    Meanwhile a.a is copied to a1.a1, a.a is deleted. What now? What does your magical iterator do with these? It's already passed a.a. Since a1.a1 is before b.b, it will never see it. What's supposed to happen for "filenames change, new files are added, and files are deleted"?

    The magical "iterator" moves to c.c. What was supposed to happen to the other files? And how were you supposed to find out about the deletion?


    Process A is continuously writing files to a storage location. Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.

    Don't use the naked file system for coordination.

    Use a queue.

    Process A writes files and enqueues the add/change/delete memento onto a queue.

    Process B reads the memento from queue and then does the follow-on processing on the file named in the memento.

提交回复
热议问题