Lazily Reading a File in D

问题

I'm writing a directory tree scanning function in D that tries to combine tools such as grep and file and conditionally grep for things in a file only if it's not matching a set of magic bytes indicating filetypes such as ELF, images, etc.

What is the best approach to making such an exclusion logic run as fast as possible with regards to minimizing file io? I typically don't want to read in the whole file if I only need to read some magic bytes in the beginning. However to make the code more future-general (some magics may lie at the end or somewhere else than at the beginning) it would be nice if I could use a mmap-like interface to lazily fetch data from the disk only when I it is read. The array interface also simplifies my algorithms.

Is D's std.mmfile the best option in this case?

Update: According to this post I guess mmap is adviced: http://forum.dlang.org/thread/dlrwzrydzjusjlowavuc@forum.dlang.org

If I only need read-access as an array (opIndex) are there any cons to using std.mmfile over std.stdio.File or std.file?

回答1:

If you want to lazily read a file with Phobos, you pretty much have three options

Use std.stdio.File's byLine and read a line at a time.
Use std.stdio.File's byChunk and read a particular number of bytes at a time.
Use std.mmfile.MmFile and operate on the file as an array, taking advantage of mmap underneath the hood to avoid reading in the whole file.

I fully expect that #3 is going to be the fastest (profiling could prove differently, but I'd be very surprised given how fantastic mmap is). It's also probably the easiest to use, because you get an array to operate on. The only problem with MmFile that I'm aware of is that it's a class when it should arguably be a ref-counted struct so that it would clean itself up when you were done. Right now, if you don't want to wait for the GC to clean it up, you'd have to manually call unmap on it or use destroy to destroy it without freeing its memory (though destroy should be used with caution). There may be some sort of downside to using mmap (which would then naturally mean that there was a downside to using MmFile), but I'm not aware of any.

In the future, we're going to end up with some range-based streaming I/O stuff, which might be closer to what you need without actually using mmap, but that hasn't been completed yet, and mmap is so incredibly cool that there's a good chance that it would still be better to use MmFile.

回答2:

you can combine seek and rawread of std.stdio.File to do what you want

you can then do a rawRead for only the first few bytes

File file=//...

ubyte[1024] buff;
ubtye[] magic=file.rawRead(buff[0..4]);//only the first 4 bytes are read
//check magic

then depending on the OS' caching/read-ahead strategy this can be nearly as fast as mmfile, however multiple seeks will ruin the read-ahead behavior

来源：https://stackoverflow.com/questions/18888612/lazily-reading-a-file-in-d

标签

file-io

lazy-loading

mmap