How to efficiently read thousands of small files with GCD

£可爱£侵袭症+ 提交于 2021-02-07 05:41:26

问题


I'd like to read some metadata data (e.x.: EXIF data) from potentially thousands of files as efficiently as possible without impacting the user experience. I'm interested if anyone has any thoughts on how best to go about this using something like regular GCD queues, dispatch_io channels or even another implementation.

Option #1: Using regular GCD queues.

This one is pretty straightforward I can just use something like the following:

for (NSURL *URL in URLS) {
  dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_LOW, 0), ^{
    // Read metadata information from file.
    CGImageSourceCopyProperties(...);
  });
}

The problem with this implementation, I think (and have experienced), is that GCD doesn't know that the operation in the block is I/O related so it submits dozens of these blocks to the global queue for processing, who in turn saturate the I/O. The system eventually recovers, but the I/O takes a hit if I'm reading in thousands, or tens of thousands of files.

Option #2: Using dispatch_io

This one seems like a good contender, but I actually get worse performance with it then using a regular GCD queue. That could be my implementation.

dispatch_queue_t intakeQueue = dispatch_queue_create("someName"), NULL);

for (NSURL *URL in URLS) {    
  const char *path = URL.path.UTF8String;
  dispatch_io_t intakeChannel = dispatch_io_create_with_path(DISPATCH_IO_RANDOM, path, O_RDONLY, 0, intakeQueue, NULL);
  dispatch_io_set_high_water(intakeChannel, 256);
  dispatch_io_set_low_water(intakeChannel, 0);

  dispatch_io_handler_t readHandler = ^void(bool done, dispatch_data_t data, int error) {
    // Read metadata information from file.
    CGImageSourceCopyProperties(...);
    // Error stuff...
  };

  dispatch_io_read(intakeChannel, 0, 256, intakeQueue, readHandler);
}

In this second option, I feel like I'm somewhat abusing dispatch_read. I'm not interested in the data it reads at all, I just want dispatch_io to throttle the I/O for me. The 256 size is just a random number so that some amount of data is read, even though I never use it.

In this second option, I've had several runs where the system worked "pretty good", but I've also had an instance where my entire machine locked up (even the cursor) and I had to hard-reset. In other instances (equally random), the application has simply quit with a stack trace that looks like dozens of dispatch_io calls trying to clean up. (In all of these instances, I'm attempting to read in excess of 10,000 images.)

(Since I'm not opening any file descriptors myself, and GCD blocks are now ARC-friendly, I don't think I have to do any explicit clean-up after the dispatch_io_read has completed, though maybe I'm wrong?)

Solutions?

Is there another option I could use? I've considered manually throttling the requests with an NSOperationQueue and a low value for the maxConcurrentOperationCount but that just seems wrong as the newer MacPros can clearly handle a ton more I/O compared to an older, non-SSD, MacBook.

Update 1

I thought of doing a slight modification to option #2 based on some of the points @Ken-Thomases touched on below. In this attempt, I'm trying to prevent the dispatch_io block from exiting by setting a high_water mark below the total number of bytes requested. The idea being that the read handler will get called with data remaining to be read.

dispatch_queue_t intakeQueue = dispatch_queue_create("someName"), NULL);

for (NSURL *URL in URLS) {    
  const char *path = URL.path.UTF8String;
  dispatch_io_t intakeChannel = dispatch_io_create_with_path(DISPATCH_IO_RANDOM, path, O_RDONLY, 0, intakeQueue, NULL);
  dispatch_io_set_high_water(intakeChannel, 256);
  dispatch_io_set_low_water(intakeChannel, 0);
  __block BOOL didReadProperties = NO;

  dispatch_io_handler_t readHandler = ^void(bool done, dispatch_data_t data, int error) {
    // Read metadata information from file.
    if (didReadProperties == NO) {
        CGImageSourceCopyProperties(...);
        didReadProperties = YES;
    } else {
      // Maybe try and force close the channel here with dispatch_close?
     }        
  };

  dispatch_io_read(intakeChannel, 0, 512, intakeQueue, readHandler);
}

This does appear to slow down the dispatch_io calls, but it's now causing a situation where calls to CGImageSourceCreateWithURL are failing in a different part of the application where they never used to. (Now CGImageSourceCreateWithURL is randomly returning NULL, which, if I had to guess, suggests that it can't open a file descriptor because the file is definitely present at the given path.)

Update 2

After experimenting with a half-dozen other ideas, an implementation as simple as using an NSOperationQueue and calling addOperationWithBlock turned out to be just as effective as anything else I could come up with. Manually tuning the maxConcurrentOperationCount had some effect but nowhere near as much as I would have thought.

Obviously, the performance difference between an SSD and an external USB 3.0 drive is dramatic. While I can iterate over 100,000 images (and even get away with around 200,000) on an SSD in reasonable time, that many images on a USB drive is hopeless. Simple math: (bytes needed to read * file count / drive speed) shows that I can't really get the user experience I was hoping for. (Instruments seems to show that _CGImageSourceBindToPlugin is reading anywhere from about 40KB to 1MB per file.)


回答1:


The reality is that a modern, multi-tasking, multi-user system that runs across many configurations of hardware, automatically throttling an I/O bound task is nigh impossible for the system to do.

You're going to have to do the throttling yourself. This could be done with NSOperationQueue, with a semaphore, or with any of a number of other mechanisms.

Normally, I'd suggest you try to separate the I/O from any computation so you can serialize I/O (which is going to be the most generally reasonable performance across all systems), but that is pretty much impossible when using high level APIs. In fact, it isn't clear how the CG* I/O APIs might interact with the dispatch_io_* advisory APIs.

Not a terribly helpful answer. Without knowing more about your very specific case, it is hard to be more specific. I would suggest that caching may be key here; build up a database of metadata for all the various images. Of course, then you have synchronization and validation problems.




回答2:


It would be nice if GCD provided a way of load-balancing arbitrary blocks based on which disk device they were going to do I/O against, but it doesn't. Your use of dispatch I/O ends up being not too different from your first approach.

Dispatch I/O does the file read of 256 bytes on your behalf. Once the data has been read, though, it can allow reading of another file to proceed even though your data-handling block hasn't run to completion. So, pretty quickly, a bunch of your data-handling blocks get queued simultaneously, just like with your first solution. To some extent, the I/O implicit in CGImageSourceCopyProperties() competes with the dispatch I/O and so may throttle submission of the data-handling tasks a bit, but probably not enough.

The obvious/naive way to apply dispatch I/O to this problem would be to have it read in each whole image file into a data object and then use that to create the image source using CGImageSourceCreateWithData(). The problem with that it that it reads the whole image file when only part of it is actually required to copy the properties.

You could try to improve this by using an incremental image source, created with CGImageSourceCreateIncremental(). You would have dispatch I/O read some significant chunk (perhaps the device block size) of image data from the file, concatenate it onto a mutable data object, and update the image source using CGImageSourceUpdateData(). Then, check the image source's status using CGImageSourceGetStatus(). You'd keep reading data that way until the status indicates it's possible to copy the image source properties. Hopefully, CGImageSourceCopyProperties() can succeed before the image is complete, so you don't have to read all of the image file data – that is, after the status transitions from kCGImageStatusReadingHeader to kCGImageStatusIncomplete. (Of course, kCGImageStatusComplete also indicates it's ready.)

It would probably be more efficient to update the incremental image source using CGImageSourceUpdateDataProvider() and a data provider created using CGDataProviderCreateDirect(). Then, you would write the callbacks to use the dispatch data functions. That way, you can accumulate the file data using dispatch_data_create_concat() which doesn't need to copy buffers.

It might be possible to do even better than that, although it gets (perhaps unnecessarily) complicated. You could create a direct data provider using CGDataProviderCreateDirect(). Then create a non-incremental image source from that using CGImageSourceCreateWithDataProvider(). Then call CGImageSourceCopyProperties() on that data provider. During creation or possibly not until you copy the properties, the image source will ask the data provider for data. It will call your callbacks. At this point, you don't have any data to provide, so you have to fail (return end-of-file). But you can use the nature of that call to learn what part of the file that CGImageSource needs to provide the properties.

You can then use dispatch I/O to read in the requested data. Once you have that data, you then create a new image source from the data provider and try again. This time you supply the data you have. CGImageSource will probably then ask for more data, so you repeat this process until you have successfully supplied all of the data that it needs to copy the properties.

Once again, probably best to round and align any request up to whole device blocks and to prime your data provider with the first block of the file, since that's certainly going to be needed.


A completely different approach would be to figure out the physical device for each file. Then submit the task for copying its image properties to a serial queue dedicated to that device. Each time you identify a new device, create a new serial queue for it. For the common case where all of your files are on the same device, though, this will simply serialize the operations (plus add overhead). So, maybe an operation queue with a small concurrent limit, as you mentioned, except per device. I don't think this needs to scale based on CPU speed or even disk speed, since I suspect that copying the image properties has a very small non-I/O component.



来源:https://stackoverflow.com/questions/23599251/how-to-efficiently-read-thousands-of-small-files-with-gcd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!