Fastest way to read every 30th byte of large binary file?

前端未结

关注

 7  1752

What is the fastest way to read every 30th byte of a large binary file (2-3 GB)? I\'ve read there are performance problems with fseek because of I/O buffers, but I don\'t wa

相关标签:

7条回答

渐次进展

2020-12-24 14:51
The whole purpose of a buffered I/O library is to free you from such concerns. If you have to read every 30th byte, the OS is going to wind up reading the whole file, because the OS reads in larger chunks. Here are your options, from highest performance to lowest performance:
- If you have a large address space (i.e., you're running a 64-bit OS on 64-bit hardware), then using memory-mapped IO (mmap on POSIX systems) will save you the cost of having the OS copy data from kernel space to user space. This savings could be significant.
- As shown by the detailed notes below (and thanks to Steve Jessop for the benchmark), if you care about I/O performance you should download Phong Vo's sfio library from the AT&T Advanced Software Technology group. It is safer, better designed, and faster than C's standard I/O library. On programs that use fseek a lot, it is dramatically faster: up to seven times faster on a simple microbenchmark.
- Just relax and use fseek and fgetc, which are designed and implemented exactly to solve your problem.
If you take this problem seriously, you should measure all three alternatives. Steve Jessop and I showed that using fseek is slower, and if you are using the GNU C library, fseek is a lot slower. You should measure mmap; it may be the fastest of all.

Addendum: You want to look into your filesystem and making sure it can pull 2–3 GB off the disk quickly. XFS may beat ext2, for example. Of course, if you're stuck with NTFS or HFS+, it's just going to be slow.

Shocking results just in

I repeated Steve Jessop's measurements on Linux. The GNU C library makes a system call at every fseek. Unless POSIX requires this for some reason, it's insane. I could chew up a bunch of ones and zeroes and puke a better buffered I/O library than that. Anyway, costs go up by about a factor of 20, much of which is spent in the kernel. If you use fgetc instead of fread to read single bytes, you can save about 20% on small benchmarks.

Less shocking results with a decent I/O library

I did the experiment again, this time using Phong Vo's sfio library. Reading 200MB takes
- 0.15s without using fseek (BUFSZ is 30k)
- 0.57s using fseek
Repeated measurements show that without fseek, using sfio still shaves about 10% off the run time, but the run times are very noisy (almost all time is spent in the OS).

On this machine (laptop) I don't have enough free disk space to run with a file that won't fit in the disk cache, but I'm willing to draw these conclusions:
- Using a sensible I/O library, fseek is more expensive, but not more expensive enough to make a big difference (4 seconds if all you do is the I/O).
- The GNU project does not provide a sensible I/O library. As is too often the case, the GNU software sucks.
Conclusion: if you want fast I/O, your first move should be to replace the GNU I/O library with the AT&T sfio library. Other effects are likely to be small by comparison.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

Fastest way to read every 30th byte of large binary file?

Shocking results just in

Less shocking results with a decent I/O library