multithread read from disk?

后端 未结 5 1056
青春惊慌失措
青春惊慌失措 2020-12-17 10:55

Suppose I need to read many distinct, independent chunks of data from the same file saved on disk.

Is it possible to multi-thread this upload?

Related: Do al

相关标签:
5条回答
  • 2020-12-17 11:17

    As mentioned in the other answers a parallel read may be slower depending on the way the file is physically stored on disk. So if the head has to move a significant distance it can cause an actual slowdown. This being said there are however storage systems which can support multiple simultaneous reads and writes efficiently. The most simple one I can imagine is a SSD disk. I myself worked with magnificent storage systems from IBM which could perform simultaneous reads and writes with no slowdown. So let's assume you have such a file system and physical storage which will not slow down on parallel reads.

    In that case parallel reads are very logical. In general there are two ways to achieve that:

    1. If you want to use the standard C/C++ library to perform the IO then the only option you have is to keep one open file handle (descriptor) per thread. This is because the file pointer (which points to where to read or write from in the file) is kept per handle. So if you try to read simultaneously from the same file handle you will not have any way of knowing what you are actually reading.
    2. Use platform specific API to perform asynchronous (OVERLAPPED) IO. On windows you use the WinAPI functions with what is called OVERLAPPED IO. On Unix/Linux you have posix AIO although I understand that it's use is discouraged although I didn't see any satisfactory explanation as to why that is the case.

    I myself implemented the both fd/thread approach on both linux and windows and the OVERLAPPED approach on windows. Both work great.

    0 讨论(0)
  • 2020-12-17 11:21

    You won't be able to speed up the process of reading to disk. If you're calculating at the same time as you're writing, parallelizing will help. But the pure writing will be limited by the bandwidth of the lane between processor and hard drive and, more notably, by the harddrive itself (my hard drive does 30 MB/s, I've heard about raid setups serving 120 MB/s over network, but don't rely on that).

    0 讨论(0)
  • 2020-12-17 11:35

    If you're doing this on Windows you might want to look into the ReadFileScatter function. It will let you read multiple segments from a file in a single asynchronous call. This will allow the OS to better control the file IO bottle neck and hopefully optimizes the reads.

    The matching write call on Windows would be WriteFileGather.

    For UNIX you're looking at readv and writev to do the same thing.

    0 讨论(0)
  • 2020-12-17 11:35

    Multiple reads from a disk should be thread-safe by the design of the op system if you use the standard system functions there's no need to manually locking it, open the files read-only though. (Otherwise you'll get file access errors.)

    Btw you are not necessary reading from the disk in practice, the op system will decide where it will serve you from. It typically prefetches the reads and serves from the memory.

    0 讨论(0)
  • 2020-12-17 11:38

    Yes, it is possible. However:

    Do all threads on the same processor use the same IO device to read from disk?

    Yes. The read head on the disk. As an example, try copying two files in parallel as opposed to in series. It will take significantly longer in parallel, because the OS uses scheduling algorithms to make sure the IO rate is "fair," or equal between the two threads/processes. Because of this, the read head will jump back and forth between different parts of the disk, slowing the process down A LOT. The time to actually read the data is pretty small compared to the time to seek to it, and when you're reading two different parts of the disk at once, you spend most of the time seeking.

    Note that all of this assumes you're using a hard disk. If you're using an SSD, it will not be slower in parallel, but it will not be faster either. Edit: according to comments parallel is actually faster for an SSD. With RAID the situation becomes more complicated, and (obviously) depends on what kind of RAID you're using.

    This is what it looks like (I've unwrapped the circular disk into a rectangle because ascii circles are hard, and simplified the data layout to make it easier to read):

    Assume the files are separated by some space on the platter like so:

    |         |
    

    A series read will look like (* indicates reading)

    space ----->
    |        *|  t
    |        *|  i
    |        *|  m
    |        *|  e
    |        *|  |
    |       / |  |
    |     /   |  |
    |   /     |  V
    |  /      |
    |*        |
    |*        |
    |*        |
    |*        |
    

    While a parallel read will look like

    |       \ |
    |        *|
    |       / |
    |     /   |
    |   /     |
    |  /      |
    |*        |
    |  \      |
    |    \    |
    |     \   |
    |       \ |
    |        *|
    |       / |
    |     /   |
    |   /     |
    |  /      |
    |*        |
    |  \      |
    |    \    |
    |     \   |
    |       \ |
    |        *|
    

    etc

    0 讨论(0)
提交回复
热议问题