Huge broadcast variable, optimizing code without parfor?

问题

I have a 40000 by 80000 matrix from which I'm obtaining the number of "clusters" (groups of elements with the same value that are adjacent to one another) and then calculating the size of each of those clusters. Here it is the chunk of code.

FRAGMENTSIZESCLASS = struct([]);  %We store the data in a structure
for class=1:NumberOfClasses
  %-First we create a binary image for each class-%
  BWclass = foto==class;
  %-Second we calculate the number of connected components (fragments)-%
  L = bwlabeln(BWclass);          %returns a label matrix, L, containing labels for the connected components in BWclass
  clear BWclass
  NumberFragments=max(max(L));
  %-Third we calculate the size of each fragment-%
  FragmentSize=zeros(NumberFragments,1);
  for f=1:NumberFragments      % potential improvement: using parfor while saring the memory between workers
    FragmentSize(f,1) = sum(L(:) == f);
  end
  FRAGMENTSIZESCLASS{class}=FragmentSize;
  clear L
end

The problem is that the matrix L is so large that if I use a parfor loop it turns into a broadcast variable and then the memory gets multiplied and I run out of memory.

Any ideas on how to sort this out? I've seen this file: https://ch.mathworks.com/matlabcentral/fileexchange/28572-sharedmatrix but is not an straightforward solution, even though I have 24 cores still will take a lot of time.

Cheers!

Here it is a picture showing the time it takes as a function of image size when using the code I posted in the question vs using bwconncomp as suggested by @bla:

回答1:

instead of bwlabeln use the built in function bwconncomp, for example:

...
s=bwconncomp(BWClass);
fragmentsize=sum(cellfun(@numel,s.PixelIdxList));
....

回答2:

Note that the reason you are running out of memory is probably because you used parfor to replace one of the two loops in your code. In both cases, each worker thread would be creating an array of the same size as foto during processing. Note that, in the inner loop, sum(L(:) == f) creates a logical array of the size of L, then sums its values (I don't think the JIT is clever enough to optimize that intermediate array out).

In short, parallelizing an operation over such a large image in the way you did is not viable. The right way to parallelize it is to cut the image into tiles, and process each tile on a different thread. If fragments are small (an assumption I dare to make given the name) it should be possible to process tiles using only a small overlap (tiles need to overlap so that each fragment is completely inside at least on tile). It is a bit more elaborate to remove duplicates in this scenario, so the parallelization is not trivial. However, my hope is that the suggestion below makes it unnecessary to parallelize the code at all.

From the code, it is clear that fragments of the same class do not touch. But it is not clear that fragments from different classes do not touch (i.e. the code would produce the same output if they did). Under the assumption that they do not, it is possible to avoid both loops.

The idea is to label all fragments, irrespective of class, in one go. So instead of calling bwlabeln once per class, you call it only once. I don't know how many classes there are, but this is potentially a large reduction in computation time.

Next, use regionprops to determine, for each fragment, its size and class. This operation can, in principle, also be performed by iterating over the image only once. Note that your code, FragmentSize(f,1) = sum(L(:) == f), iterates over the image once per fragment. Given the size of the image, there could be millions of fragments. This could provide a reduction of time of 6 orders of magnitude.

From this point on, we only deal with the output of regionprops, which could contain (order of magnitude) a million elements, a trivial amount (3 orders of magnitude fewer than the number of pixels).

This could be the code:

L = bwlabeln(foto>0);
cls  = regionprops(L,foto,'MinIntensity','Area');
clear L
sz = [cls.Area];
cls = [cls.MinIntensity];
NumberOfClasses = max(cls);
FRAGMENTSIZESCLASS = cell(NumberOfClasses,1);
for ii=1:NumberOfClasses
   FRAGMENTSIZESCLASS{ii} = sz(cls==ii);
end

This last loop might not be necessary, I didn't find a quick alternative. I can't imagine it's expensive, but if it is, it's trivial to parallelize it, or to improve it by sorting cls and using diff to find the indices where a new class starts.

It is possible to rewrite the above code using @bla's suggestion of bwconncomp. This function returns a struct containing a cell array with indices to all pixels with each label. It is then not necessary to use regionprops, one can directly find the size (as @bla showed) and use the first index for each label to find the class (by indexing into foto):

cc = bwconncomp(foto>0);
sz = cellfun(@numel,cc.PixelIdxList);
cls = cellfun(@(indx)foto(indx(1)),cc.PixelIdxList);
NumberOfClasses = max(cls);
FRAGMENTSIZESCLASS2 = cell(NumberOfClasses,1);
for ii=1:NumberOfClasses
   FRAGMENTSIZESCLASS2{ii} = sz(cls==ii);
end

This was 3 to 4 times faster for a small test image of 256x256 with 63 fragments. However, given the size of the image you're dealing with, I fear that this might actually be very inefficient. The only way to know is to try both approaches and time them!

A few notes about your code:

FRAGMENTSIZESCLASS = struct([]);

You initialize it as an empty struct array, but then use {} to index into it, converting it into a cell array. It's always good to preallocate arrays, as I did above:

FRAGMENTSIZESCLASS = cell(NumberOfClasses,1);

NumberFragments=max(max(L));

This creates a maximum projection of L onto the horizontal axis (80k elements), and then finds the maximum within that. It is more efficient to reshape the matrix as you did elsewhere:

NumberFragments = max(L(:));

来源：https://stackoverflow.com/questions/50181836/huge-broadcast-variable-optimizing-code-without-parfor

标签

matlab

matrix

memory

optimization

parfor