Faster algorithm to find unique element between two arrays?

青春壹個敷衍的年華 提交于 2019-12-18 09:58:59

问题


EDIT: For anyone new to this question, I have posted an answer clarifying what was going on. The accepted answer is the one I feel best answers my question as originally posted, but for further details please see my answer.

NOTE: This problem was originally pseudocode and used lists. I have adapted it to Java and arrays. So while I'd love to see any solutions that use Java-specific tricks (or tricks in any language for that matter!), just remember that the original problem is language-independent.

The Problem

Let's say that there are two unsorted integer arrays a and b, with element repetition allowed. They are identical (with respect to contained elements) except one of the arrays has an extra element. As an example:

int[] a = {6, 5, 6, 3, 4, 2};
int[] b = {5, 7, 6, 6, 2, 3, 4};

Design an algorithm that takes as input these two arrays and outputs the single unique integer (in the above case, 7).

The Solution (So Far)

I came up with this:

public static int getUniqueElement(int[] a, int[] b) {
    int ret = 0;
    for (int i = 0; i < a.length; i++) {
        ret ^= a[i];
    }
    for (int i = 0; i < b.length; i++) {
        ret ^= b[i];
    }
    return ret;
}

The "official" solution presented in class:

public static int getUniqueElement(int[] a, int[] b) {
    int ret = 0;
    for (int i = 0; i < a.length; i++) {
        ret += a[i];
    }
    for (int i = 0; i < b.length; i++) {
        ret -= b[i];
    }
    return Math.abs(ret);
}

So, both are conceptually doing the same thing. And given that a is of length m and b is of length n, then both solutions have running time of O(m + n).

The Question

I later got to talking with my teacher and he hinted that there was an even faster way of doing it. Honestly I don't see how; to find out whether an element is unique it seems you'd have to at least look at every element. At that's at least O(m + n)...right?

So is there a faster way? And if so, what is it?


回答1:


This is probably the fastest you can do it in Java using HotLick's suggestion in the comments. It makes the assumption that b.length == a.length + 1 so b is the larger array with the extra "unique" element.

public static int getUniqueElement(int[] a, int[] b) {
    int ret = 0;
    int i;
    for (i = 0; i < a.length; i++) {
        ret = ret ^ a[i] ^ b[i];
    }
    return ret ^ b[i];
}

Even if the assumption cannot be made, you can easily expand it to include the case where either a or b can be the larger array with the unique element. It's still O(m+n) though and only loop/assignment overhead is reduced.

Edit:

Due to details of language implementation, this is still (surprisingly) the fastest way to do it in CPython.

def getUniqueElement1(A, B):
    ret = 0
    for a in A: ret = ret ^ a
    for b in B: ret = ret ^ b
    return ret

I have tested this with the timeit module and found some interesting results. It turns out that the longhand ret = ret ^ a is indeed faster in Python than the shorthand ret ^= a. Also iterating over the elements of a loop is much much faster than iterating over the indexes and then making subscript operations in Python. That is why this code is much faster than my previous method where I tried to copy Java.

I guess the moral of the story is that there is no correct answer because the question is bogus anyways. As the OP noted in another answer below, it turns out you can't really go any faster than O(m+n) on this and his teacher was just pulling his leg. Thus the problem reduces to finding the fastest way to iterate over all elements in the two arrays and accumulating the XOR of all of them. And this means it's entirely dependent on language implementation, and you have to do some testing and playing around to get the true "fastest" solution in whatever implementation you are using, because the overall algorithm will not change.




回答2:


Alright here we go...apologies to anyone expecting a faster solution. It turns out my teacher was having a little fun with me and I completely missed the point of what he was saying.

I should begin by clarifying what I meant by:

he hinted that there was an even faster way of doing it

The gist of our conversation was this: he said that my XOR approach was interesting, and we talked for a while about how I arrived at my solution. He asked me whether I thought my solution was optimal. I said I did (for the reasons I mentioned in my question). Then he asked me, "Are you sure?" with a look on his face I can only describe as "smug". I was hesitant but said yeah. He asked me if I could think of a better way to do it. I was pretty much like, "You mean there's a faster way?" but instead of giving me a straight answer he told me to think about it. I said I would.

So I thought about it, sure that my teacher knew something I didn't. And after not coming up with anything for a day, I came here.

What my teacher actually wanted me to do was defend my solution as being optimal, not try to find a better solution. As he put it: creating a nice algorithm is the easy part, the hard part is proving it works (and that it's the best). He thought it was quite funny that I spent so much time in Find-A-Better-Way Land instead of working out a simple proof of O(n) that would have taken considerably less time (we ended up doing so, see below if you're interested).

So I guess, big lesson learned here. I'll be accepting Shashank Gupta's answer because I think that it does manage to answer the original question, even though the question was flawed.

I'll leave you guys with a neat little Python one-liner I found while typing the proof. It's not any more efficient but I like it:

def getUniqueElement(a, b):
    return reduce(lambda x, y: x^y, a + b)

A Very Informal "Proof"

Let's start with the original two arrays from the question, a and b:

int[] a = {6, 5, 6, 3, 4, 2};
int[] b = {5, 7, 6, 6, 2, 3, 4};

We'll say here that the shorter array has length n, then the longer array must have length n + 1. The first step to proving linear complexity is to append the arrays together into a third array (we'll call it c):

int[] c = {6, 5, 6, 3, 4, 2, 5, 7, 6, 6, 2, 3, 4};

which has length 2n + 1. Why do this? Well, now we have another problem entirely: finding the element that occurs an odd number of times in c (from here on "odd number of times" and "unique" are taken to mean the same thing). This is actually a pretty popular interview question and is apparently where my teacher got the idea for his problem, so now my question has some practical significance. Hooray!

Let's assume there is an algorithm faster than O(n), such as O(log n). What this means is that it will only access some of the elements of c. For example, an O(log n) algorithm might only have to check log(13) ~ 4 of the elements in our example array to determine the unique element. Our question is, is this possible?

First let's see if we can get away with removing any of the elements (by "removing" I mean not having to access it). How about if we remove 2 elements, so that our algorithm only checks a subarray of c with length 2n - 1? This is still linear complexity, but if we can do that then maybe we can improve upon it even further.

So, let's choose two elements of c completely at random to remove. There are actually several things that could happen here, which I'll summarize into cases:

// Case 1: Remove two identical elements
{6, 5, 6, 3, 4, 2, 5, 7, 2, 3, 4};

// Case 2: Remove the unique element and one other element
{6, 6, 3, 4, 2, 5, 6, 6, 2, 3, 4};

// Case 3: Remove two different elements, neither of which are unique
{6, 5, 6, 4, 2, 5, 7, 6, 6, 3, 4};

What does our array now look like? In the first case, 7 is still the unique element. In the second case there is a new unique element, 5. And in the third case there are now 3 unique elements...yeah it's a total mess there.

Now our question becomes: can we determine the unique element of c just by looking at this subarray? In the first case we see that 7 is the unique element of the subarray, but we can't be sure it is also the unique element of c; the two removed elements could have just as well been 7 and 1. A similar argument applies for the second case. In case 3, with 3 unique elements we have no way of telling which two are non-unique in c.

It becomes clear that even with 2n - 1 accesses, there is just not enough information to solve the problem. And so the optimal solution is a linear one.

Of course, a real proof would use induction and not use proof-by-example, but I'll leave that to someone else :)




回答3:


You can store the count of each value in a collection such as an array or hash map. O(n) then you can check the values of the other collection and stop as soon as you know you have a miss match. This could mean you only search half the second array on average.




回答4:


This is a little bit faster:

public static int getUniqueElement(int[] a, int[] b) {
    int ret = 0;
    int i;
    for (i = 0; i < a.length; i++) {
        ret += (a[i] - b[i]);
    }
    return Math.abs(ret - b[i]);
}

It's O(m), but the order doesn't tell the whole story. The loop part of the "official" solution has about 3 * m + 3 * n operations, and the slightly faster solution has 4 * m.

(Counting the loop "i++" and "i < a.length" as one operation each).

-Al.




回答5:


Assuming only one element was added, and the arrays were identical to start with, you could hit O(log(base 2) n).

The rationale is that any array is subject to searching binary-ly O(log n). Except that in this case you are not searching for a value in an ordered array, you are searching for the first non-matching element. In such a circumstance a[n] == b[n] means that you are too low, and a[n] != b[n] means that you might be too high, unless a[n-1] == b[n-1].

The rest is basic binary search. Check the middle element, decide which division must have the answer, and do a sub-search on that division.




回答6:


Let's say that there are two unsorted integer arrays a and b, with element repetition allowed. They are identical (with respect to contained elements) except one of the arrays has an extra element ..

You may note that I emphasised two point in your original question, and I'm adding an extra assumption of that the values are non-zero.

In C#, you can do this:

int[, , , , ,] a=new int[6, 5, 6, 3, 4, 2];
int[, , , , , ,] b=new int[5, 7, 6, 6, 2, 3, 4];
Console.WriteLine(b.Length/a.Length);

See? Whatever the extra element is, you will always know it by simply dividing their length.

With these statements, we are not storing the given series of integers as values to arrays, but as their dimensions.

As whatever the shorter series of integers is given, the longer one should have only one extra integer. So no matter the order of the integers, without the extra one, the total size of these two multi-dimensional array are identical. The extra dimension times the size of the longer, and to divide by the size of the shorter, we know what is the extra integer.

This solution would works only for this particular case as I quoted from your question. You might want to port it to Java.

This is just a trick, as I thought the question itself is a trick. We definitely will not consider it as a solution for production.




回答7:


Caution, it is wrong to use the O(n + m) notation. There is but one size parameter which is n (in the asymptotic sense, n and n+1 are equal). You should just say O(n). [For m > n+1, the problem is different and more challenging.]

As pointed by others, this is optimal as you must read all values.

All you can do is reducing the asymptotic constant. There is little room for improvement, as the obvious solutions are already very efficient. The single loop in (10) is probably hard to beat. Unrolling it a bit should improve (slightly) by avoiding a branch.

If your goal is sheer performance, than you should turn to non-portable solutions such as vectorization (using the AXV instructions, 8 ints at a time) and parallelization on multicores or GPGPU. In good old dirty C and a 64 bits processor, you could map the data to an array of 64 bit ints and xor the elements two pairs at a time ;)




回答8:


I think this is similar to Matching nuts and bolts problem.

You could achieve this possibly in O(nlogn). Not sure if thats smaller than O(n+m) in this case.




回答9:


There simply is no faster algorithm. The ones presented in the question are in O(n). Any arithmetic "trick" to solve this will require at least each element of both arrays to be read once, so we stay in O(n) (or worse).

Any search strategy that is in a real subset of O(n) (like O(log n)) will require sorted arrays or some other prebuild sorted structure (binary tree, hash). All sorting algorithms known to mankind are at least O(n*log n) (Quicksort, Hashsort) at average which is worse than O(n).

Therefore, from a mathematical point of view, there is no faster algorithm. There might be some code optimizations, but they won't matter on large scale, as runtime will grow linear with the length of the array(s).



来源:https://stackoverflow.com/questions/19203868/faster-algorithm-to-find-unique-element-between-two-arrays

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!