问题
I'm writing a directory tree scanning function in D that tries to combine tools such as grep and file and conditionally grep for things in a file only if it's not matching a set of magic bytes indicating filetypes such as ELF, images, etc.
What is the best approach to making such an exclusion logic run as fast as possible with regards to minimizing file io? I typically don't want to read in the whole file if I only need to read some magic bytes in the beginning. However to make the code more future-general (some magics may lie at the end or somewhere else than at the beginning) it would be nice if I could use a mmap-like interface to lazily fetch data from the disk only when I it is read. The array interface also simplifies my algorithms.
Is D's std.mmfile
the best option in this case?
Update: According to this post I guess mmap is adviced: http://forum.dlang.org/thread/dlrwzrydzjusjlowavuc@forum.dlang.org
If I only need read-access as an array (opIndex) are there any cons to using std.mmfile
over std.stdio.File
or std.file
?
回答1:
If you want to lazily read a file with Phobos, you pretty much have three options
Use
std.stdio.File
'sbyLine
and read a line at a time.Use
std.stdio.File
'sbyChunk
and read a particular number of bytes at a time.Use
std.mmfile.MmFile
and operate on the file as an array, taking advantage ofmmap
underneath the hood to avoid reading in the whole file.
I fully expect that #3 is going to be the fastest (profiling could prove differently, but I'd be very surprised given how fantastic mmap
is). It's also probably the easiest to use, because you get an array to operate on. The only problem with MmFile
that I'm aware of is that it's a class when it should arguably be a ref-counted struct so that it would clean itself up when you were done. Right now, if you don't want to wait for the GC to clean it up, you'd have to manually call unmap
on it or use destroy
to destroy it without freeing its memory (though destroy
should be used with caution). There may be some sort of downside to using mmap
(which would then naturally mean that there was a downside to using MmFile
), but I'm not aware of any.
In the future, we're going to end up with some range-based streaming I/O stuff, which might be closer to what you need without actually using mmap
, but that hasn't been completed yet, and mmap
is so incredibly cool that there's a good chance that it would still be better to use MmFile
.
回答2:
you can combine seek
and rawread
of std.stdio.File
to do what you want
you can then do a rawRead for only the first few bytes
File file=//...
ubyte[1024] buff;
ubtye[] magic=file.rawRead(buff[0..4]);//only the first 4 bytes are read
//check magic
then depending on the OS' caching/read-ahead strategy this can be nearly as fast as mmfile, however multiple seeks will ruin the read-ahead behavior
来源:https://stackoverflow.com/questions/18888612/lazily-reading-a-file-in-d