C++ - How to chunk a file for simultaneous/async processing?

问题

How does one read and split/chunk a file by the number of lines?

I would like to partition a file into separate buffers, while ensuring that a line is not split up between two or more buffers. I plan on passing these buffers into their own pthreads so they can perform some type of simultaneous/asynchronous processing.

I've read the answer below reading and writing in chunks on linux using c but I don't think it exactly answers the question about making sure that a line is not split up into two or more buffers.

回答1:

I would choose a chunk size in bytes. Then I would seek to the appropriate location in the file and read some smallish number of bytes at a time until I got a newline.

The first chunk's last character is the newline. The second chunk's first character is the character after the newline.

Always seek to a pagesize() boundary and read in pagesize() bytes at a time to search for your newline. This will tend to ensure that you only pull the minimum necessary from disk to find your boundaries. You could try reading like 128 bytes at a time or something. But you then risk making more system calls.

I wrote an example program that does this for letter frequency counting. This, of course, is largely pointless to split into threads as it's almost certainly IO bound. And it also doesn't matter where the newlines are because it isn't line oriented. But, it's just an example. Also, it's heavily reliant on you having a reasonably complete C++11 implementation.

threaded_file_split.cpp on lisp.paste.org

They key function is this:

// Find the offset of the next newline given a particular desired offset.
off_t next_linestart(int fd, off_t start)
{
   using ::std::size_t;
   using ::ssize_t;
   using ::pread;

   const size_t bufsize = 4096;
   char buf[bufsize];

   for (bool found = false; !found;) {
      const ssize_t result = pread(fd, buf, bufsize, start);
      if (result < 0) {
         throw ::std::system_error(errno, ::std::system_category(),
                                   "Read failure trying to find newline.");
      } else if (result == 0) {
         // End of file
         found = true;
      } else {
         const char * const nl_loc = ::std::find(buf, buf + result, '\n');
         if (nl_loc != (buf + result)) {
            start += ((nl_loc - buf) + 1);
            found = true;
         } else {
            start += result;
         }
      }
   }
   return start;
}

Also notice that I use pread. This is absolutely essential when you have multiple threads reading from different parts of the file.

The file descriptor is a shared resource between your threads. When one thread reads from the file using ordinary functions it alters a detail about this shared resource, the file pointer. The file pointer is the position in the file at which the next read will occur.

Simply using lseek before you read each time does not help this because it introduces a race condition between the lseek and the read.

The pread function allows you to read a bunch of bytes from a specific location within the file. It also doesn't alter the file pointer at all. Apart from the fact that it doesn't alter the file pointer, it's otherwise like combining an lseek and a read in the same call.

回答2:

How is the file encoded? If it each byte represents a character, I would do the following:

Memory map the file using mmap().
Tell the jobs their approximate start and end by computing it based on an appropriate chunk size.
Have each job find its actual start and end by finding the next '\n'.
Process the respective chunks concurrently.
Note that the first chunk needs special treatment because its start isn't approximate but exact.

回答3:

Define a class for the buffers. Give each one a large buffer space that is some multiple of page size and a start/end index, a method that reads the buffer from a passed-in stream and a 'lineParse' method that takes another *buffer instance as a parameter.

Make some *buffers and store them on a producer-consumer pool queue. Open the file, get a buffer from the pool and read into the buffer space from start to end, (return a boolean for error/EOF). Get another *buffer from the pool and pass it into the lineparse() of earlier one. In there, search backwards from the end of the data, looking for newLine. When found, reload the end index and memcpy the fragment of the last line, (if there is one - you might occasionally be lucky:), into the new, passed *buffer and set its start index. The first buffer now has whole lines and can be queued off to the thread/s that will process the lines. The second buffer has the fragment of line copied from the first and more data can be read from disk into its buffer space at its start index.

The line-processing thread/s can recycle the 'used' *buffers back to the pool.

Keep going until EOF, (or error:).

If you can, add a method to the buffer class that does the processing of the buffer.

Using large buffer classes and parsing back from the end will be mure efficient than continually reading small bits, looking for newlines from the start. Inter-thread comms is slow and the larger the buffers you can pass, the better.

Using a pool of buffers eliminates continual new/delete and provides flow-control - if the disk read thread is faster than the processing, the pool will empty and the disk read thread will block on it until some used buffers are recycled. This prevents memory runaway.

Note that if you use more than one processing thread, the buffers may get processed 'out-of-order' - this may, or may not, matter.

You can only gain in this scenario by ensuring that the advantage of lines being processed in parallel with disk-read latencies is greater than the overhead of inter-thread comms - communicating small buffers between threads is very likely to be counter-productive.

The biggest speedup would be experienced with networked disks that are fast overall, but have large latencies.

来源：https://stackoverflow.com/questions/13484184/c-how-to-chunk-a-file-for-simultaneous-async-processing

标签

c++

multithreading

asynchronous

pthreads