Finding (multiset) difference between two arrays

问题

Given arrays (say row vectors) A and B, how do I find an array C such that merging B and C will give A?

For example, given

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 3, 5, 5];

then

C = multiset_diff(A, B) % Should be [4, 6, 4, 3, 1, 5]

(the order of the result does not matter here).

For the same A, if B = [2, 4, 5], then the result should be [6, 4, 3, 3, 1, 5, 5].

(Since there were two 4s in A and one 4 in B, the result C should have 2 - 1 = 1 4 in it. Similarly for the other values.)

PS: Note that setdiff would remove all instances of 2, 3, and 5, whereas here they need to be removed just however many times they appear in B.

Performance: I ran some quick-n-dirty benchmarks locally, here are the results for future reference:

@heigele's nested loop method performs best for small lengths of A (say upto N = 50 or so elements). It does 3x better for small (N=20) As, and 1.5x better for medium-sized (N=50) As, compared to the next best method - which is:
@obchardon's histc-based method. This is the one performs the best when A's size N starts to be 100 and above. For eg., this does 3x better than the above nested loop method when N = 200.

@matt's for+find method does comparably to the histc method for small N, but quickly degrades in performance for larger N (which makes sense since the entire C == B(x) comparison is run every iteration).

(The other methods are either several times slower or invalid at the time of writing.)

回答1:

Still another approach using the histc function:

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 3, 5, 5];

uA  = unique(A);
hca = histc(A,uA); 
hcb = histc(B,uA);
res = repelem(uA,hca-hcb)

We simply calculate the number of repeated elements for each vectors according to the unique value of vector A, then we use repelem to create the result.

This solution do not preserve the initial order but it don't seems to be a problem for you.

I use histc for Octave compatibility, but this function is deprecated so you can also use histcounts

回答2:

Here's a vectorized way. Memory-inefficient, mostly for fun:

tA = sum(triu(bsxfun(@eq, A, A.')), 1);
tB = sum(triu(bsxfun(@eq, B, B.')), 1);
result = setdiff([A; tA].', [B; tB].', 'rows', 'stable');
result = result(:,1).';

The idea is to make each entry unique by tagging it with an occurrence number. The vectors become 2-column matrices, setdiff is applied with the 'rows' option, and then the tags are removed from the result.

回答3:

I'm not a fan of loops, but for random perturbations of A this was the best I came up with.

C = A;
for x = 1:numel(B)
C(find(C == B(x), 1, 'first')) = [];
end

I was curious about looking at the affect of different orders of A on a solution approach so I setup a test like this:

Ctruth = [1 3 3 4 5 5 6];
for testNumber = 1:100
    Atest = A(randperm(numel(A)));
    C = myFunction(Atest,B);
    C = sort(C);
    assert(all(C==Ctruth));
end

回答4:

You can use the second output of ismember to find the indexes where elements of B are in A, and diff to remove duplicates:

This answer assumes that B is already sorted. If that is not the case, B has to be sorted before executing above solution.

For the first example:

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 3, 5, 5];
%B = sort(B); Sort if B is not sorted.
[~,col] = ismember(B,A);
indx = find(diff(col)==0);
col(indx+1) = col(indx)+1;
A(col) = [];
C = A;

>>C

4     6     4     3     1     5

For the second example:

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 4, 5, 5];
%B = sort(B); Sort if B is not sorted.
[~,col] = ismember(B,A);
indx = find(diff(col)==0);
col(indx+1) = col(indx)+1;
A(col) = [];
C = A;
>>C

6     4     3     3     1     5

回答5:

Strongly inspired by Matt, but on my machine 40% faster:

function A = multiDiff(A,B)
for j = 1:numel(B)
    for i = 1:numel(A)
        if A(i) == B(j)
            A(i) = [];
            break;
        end
    end
end
end

来源：https://stackoverflow.com/questions/51829635/finding-multiset-difference-between-two-arrays

标签

arrays

matlab

multiset