Finding (multiset) difference between two arrays

对着背影说爱祢 提交于 2019-12-01 19:40:54

问题


Given arrays (say row vectors) A and B, how do I find an array C such that merging B and C will give A?

For example, given

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 3, 5, 5];

then

C = multiset_diff(A, B) % Should be [4, 6, 4, 3, 1, 5]

(the order of the result does not matter here).

For the same A, if B = [2, 4, 5], then the result should be [6, 4, 3, 3, 1, 5, 5].

(Since there were two 4s in A and one 4 in B, the result C should have 2 - 1 = 1 4 in it. Similarly for the other values.)

PS: Note that setdiff would remove all instances of 2, 3, and 5, whereas here they need to be removed just however many times they appear in B.


Performance: I ran some quick-n-dirty benchmarks locally, here are the results for future reference:

  • @heigele's nested loop method performs best for small lengths of A (say upto N = 50 or so elements). It does 3x better for small (N=20) As, and 1.5x better for medium-sized (N=50) As, compared to the next best method - which is:

  • @obchardon's histc-based method. This is the one performs the best when A's size N starts to be 100 and above. For eg., this does 3x better than the above nested loop method when N = 200.

@matt's for+find method does comparably to the histc method for small N, but quickly degrades in performance for larger N (which makes sense since the entire C == B(x) comparison is run every iteration).

(The other methods are either several times slower or invalid at the time of writing.)


回答1:


Still another approach using the histc function:

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 3, 5, 5];

uA  = unique(A);
hca = histc(A,uA); 
hcb = histc(B,uA);
res = repelem(uA,hca-hcb)

We simply calculate the number of repeated elements for each vectors according to the unique value of vector A, then we use repelem to create the result.

This solution do not preserve the initial order but it don't seems to be a problem for you.

I use histc for Octave compatibility, but this function is deprecated so you can also use histcounts




回答2:


Here's a vectorized way. Memory-inefficient, mostly for fun:

tA = sum(triu(bsxfun(@eq, A, A.')), 1);
tB = sum(triu(bsxfun(@eq, B, B.')), 1);
result = setdiff([A; tA].', [B; tB].', 'rows', 'stable');
result = result(:,1).';

The idea is to make each entry unique by tagging it with an occurrence number. The vectors become 2-column matrices, setdiff is applied with the 'rows' option, and then the tags are removed from the result.




回答3:


I'm not a fan of loops, but for random perturbations of A this was the best I came up with.

C = A;
for x = 1:numel(B)
C(find(C == B(x), 1, 'first')) = [];
end

I was curious about looking at the affect of different orders of A on a solution approach so I setup a test like this:

Ctruth = [1 3 3 4 5 5 6];
for testNumber = 1:100
    Atest = A(randperm(numel(A)));
    C = myFunction(Atest,B);
    C = sort(C);
    assert(all(C==Ctruth));
end



回答4:


You can use the second output of ismember to find the indexes where elements of B are in A, and diff to remove duplicates:

This answer assumes that B is already sorted. If that is not the case, B has to be sorted before executing above solution.

For the first example:

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 3, 5, 5];
%B = sort(B); Sort if B is not sorted.
[~,col] = ismember(B,A);
indx = find(diff(col)==0);
col(indx+1) = col(indx)+1;
A(col) = [];
C = A;

>>C

4     6     4     3     1     5

For the second example:

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 4, 5, 5];
%B = sort(B); Sort if B is not sorted.
[~,col] = ismember(B,A);
indx = find(diff(col)==0);
col(indx+1) = col(indx)+1;
A(col) = [];
C = A;
>>C

6     4     3     3     1     5



回答5:


Strongly inspired by Matt, but on my machine 40% faster:

function A = multiDiff(A,B)
for j = 1:numel(B)
    for i = 1:numel(A)
        if A(i) == B(j)
            A(i) = [];
            break;
        end
    end
end
end


来源:https://stackoverflow.com/questions/51829635/finding-multiset-difference-between-two-arrays

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!