What is the fastest way to read every 30th byte of a large binary file (2-3 GB)? I\'ve read there are performance problems with fseek because of I/O buffers, but I don\'t wa
The whole purpose of a buffered I/O library is to free you from such concerns. If you have to read every 30th byte, the OS is going to wind up reading the whole file, because the OS reads in larger chunks. Here are your options, from highest performance to lowest performance:
If you have a large address space (i.e., you're running a 64-bit OS on 64-bit hardware), then using memory-mapped IO (mmap
on POSIX systems) will save you the cost of having the OS copy data from kernel space to user space. This savings could be significant.
As shown by the detailed notes below (and thanks to Steve Jessop for the benchmark), if you care about I/O performance you should download Phong Vo's sfio library from the AT&T Advanced Software Technology group. It is safer, better designed, and faster than C's standard I/O library. On programs that use fseek
a lot, it is dramatically faster:
up to seven times faster on a simple microbenchmark.
Just relax and use fseek
and fgetc
, which are designed and implemented exactly to solve your problem.
If you take this problem seriously, you should measure all three alternatives. Steve Jessop and I showed that using fseek
is slower, and if you are using the GNU C library, fseek
is a lot slower. You should measure mmap
; it may be the fastest of all.
Addendum: You want to look into your filesystem and making sure it can pull 2–3 GB off the disk quickly. XFS may beat ext2, for example. Of course, if you're stuck with NTFS or HFS+, it's just going to be slow.
I repeated Steve Jessop's measurements on Linux. The GNU C library makes a system call at every fseek
. Unless POSIX requires this for some reason, it's insane. I could chew up a bunch of ones and zeroes and puke a better buffered I/O library than that. Anyway, costs go up by about a factor of 20, much of which is spent in the kernel. If you use fgetc
instead of fread
to read single bytes, you can save about 20% on small benchmarks.
I did the experiment again, this time using Phong Vo's sfio
library. Reading 200MB takes
fseek
(BUFSZ
is 30k)fseek
Repeated measurements show that without fseek
, using sfio still shaves about 10% off the run time, but the run times are very noisy (almost all time is spent in the OS).
On this machine (laptop) I don't have enough free disk space to run with a file that won't fit in the disk cache, but I'm willing to draw these conclusions:
Using a sensible I/O library, fseek
is more expensive, but not more expensive enough to make a big difference (4 seconds if all you do is the I/O).
The GNU project does not provide a sensible I/O library. As is too often the case, the GNU software sucks.
Conclusion: if you want fast I/O, your first move should be to replace the GNU I/O library with the AT&T sfio library. Other effects are likely to be small by comparison.