Concurrently writing to file while reading it out using mmap

问题

The situation is this.

A large buffer of data (which shall exceed reasonable RAM consumption) is being generated by the program.
The program concurrently serves a websocket which will allow a web client to specify a small subset of this buffer of data to view.

To support the first goal, the file is written to using standard methods (I use portable C-stdio fopen and fwrite because it's been shown to be faster than various "pure C++" methods. Doesn't matter. Data gets appended to file; stdio will buffer the writes and periodically flush them.)

To support the second goal (on BSD, in particular iOS), the file is opened (open from sys/fcntl.h -- not as portable as stdio.h) and memory-mapped (mmap from sys/mman.h -- ditto). By deciding to use memory mapping I have to give up some portability with this code. It seems like Boost is something I could look at to avoid wheel reinvention.

Anyway, my question is about how exactly I'm supposed to do this, because there will be at least two threads: The main program thread appending to the file periodically, and the network (or a worker) thread which responds to web requests and delivers data read out of the memory regions that are mapped to the file on disk.

Supposing the file starts out 1024 bytes in size, mmap is called initially mapping 1024 bytes. As the main thread writes a further 512 bytes into the file, how can the network thread be notified or know anything about the current actual size of the file (so that it can munmap and mmap again with a larger buffer corresponding to the new size)? Furthermore, if I do this naively, I am wary of a situation where the main thread reports that 512 bytes are written, so the other thread now maps 1536 bytes of the file, but not all of the new 512 bytes actually got written to disk yet (OS is still working on writing it, maybe). What happens now? Could there be some garbage that shows up? Will my program crash?

How can I determine when data has been properly flushed? How can I be notified in a timely fashion after the data has been flushed so that I can memory map it?

In particular, is calling fflush the only way to guarantee that the file is now updated w.r.t. the stream, and then can I guarantee (once fflush returns) that the memory map can access the new size without an access violation? What about fsync?

回答1:

When you are using POSIX API directly in the form of mmap, you should also be using it directly for the writing. POSIX and LibC interfaces just don't play well together.

write is a system call which transfers the data directly to kernel. It would be slow for writing byte-by-byte, but for writing large buffers it is tiny fraction faster because it has less overhead (fwrite ends up calling write under the hood anyway). And it is definitely more efficient that fwrite+fflush, because those may end up being two or more calls to write and if you do direct write, it is just one.

The documentation of mmap is not very clear about it, but it seems you must not request more bytes than the file actually has.

来源：https://stackoverflow.com/questions/22678981/concurrently-writing-to-file-while-reading-it-out-using-mmap

标签

c++

multithreading

file-io

mmap