EDIT: For anyone new to this question, I have posted an answer clarifying what was going on. The accepted answer is the one I feel best answers my question
Alright here we go...apologies to anyone expecting a faster solution. It turns out my teacher was having a little fun with me and I completely missed the point of what he was saying.
I should begin by clarifying what I meant by:
he hinted that there was an even faster way of doing it
The gist of our conversation was this: he said that my XOR approach was interesting, and we talked for a while about how I arrived at my solution. He asked me whether I thought my solution was optimal. I said I did (for the reasons I mentioned in my question). Then he asked me, "Are you sure?" with a look on his face I can only describe as "smug". I was hesitant but said yeah. He asked me if I could think of a better way to do it. I was pretty much like, "You mean there's a faster way?" but instead of giving me a straight answer he told me to think about it. I said I would.
So I thought about it, sure that my teacher knew something I didn't. And after not coming up with anything for a day, I came here.
What my teacher actually wanted me to do was defend my solution as being optimal, not try to find a better solution. As he put it: creating a nice algorithm is the easy part, the hard part is proving it works (and that it's the best). He thought it was quite funny that I spent so much time in Find-A-Better-Way Land instead of working out a simple proof of O(n) that would have taken considerably less time (we ended up doing so, see below if you're interested).
So I guess, big lesson learned here. I'll be accepting Shashank Gupta's answer because I think that it does manage to answer the original question, even though the question was flawed.
I'll leave you guys with a neat little Python one-liner I found while typing the proof. It's not any more efficient but I like it:
def getUniqueElement(a, b):
return reduce(lambda x, y: x^y, a + b)
Let's start with the original two arrays from the question, a
and b
:
int[] a = {6, 5, 6, 3, 4, 2};
int[] b = {5, 7, 6, 6, 2, 3, 4};
We'll say here that the shorter array has length n
, then the longer array must have length n + 1
. The first step to proving linear complexity is to append the arrays together into a third array (we'll call it c
):
int[] c = {6, 5, 6, 3, 4, 2, 5, 7, 6, 6, 2, 3, 4};
which has length 2n + 1
. Why do this? Well, now we have another problem entirely: finding the element that occurs an odd number of times in c
(from here on "odd number of times" and "unique" are taken to mean the same thing). This is actually a pretty popular interview question and is apparently where my teacher got the idea for his problem, so now my question has some practical significance. Hooray!
Let's assume there is an algorithm faster than O(n), such as O(log n). What this means is that it will only access some of the elements of c
. For example, an O(log n) algorithm might only have to check log(13) ~ 4 of the elements in our example array to determine the unique element. Our question is, is this possible?
First let's see if we can get away with removing any of the elements (by "removing" I mean not having to access it). How about if we remove 2 elements, so that our algorithm only checks a subarray of c
with length 2n - 1
? This is still linear complexity, but if we can do that then maybe we can improve upon it even further.
So, let's choose two elements of c
completely at random to remove. There are actually several things that could happen here, which I'll summarize into cases:
// Case 1: Remove two identical elements
{6, 5, 6, 3, 4, 2, 5, 7, 2, 3, 4};
// Case 2: Remove the unique element and one other element
{6, 6, 3, 4, 2, 5, 6, 6, 2, 3, 4};
// Case 3: Remove two different elements, neither of which are unique
{6, 5, 6, 4, 2, 5, 7, 6, 6, 3, 4};
What does our array now look like? In the first case, 7 is still the unique element. In the second case there is a new unique element, 5. And in the third case there are now 3 unique elements...yeah it's a total mess there.
Now our question becomes: can we determine the unique element of c
just by looking at this subarray? In the first case we see that 7 is the unique element of the subarray, but we can't be sure it is also the unique element of c
; the two removed elements could have just as well been 7 and 1. A similar argument applies for the second case. In case 3, with 3 unique elements we have no way of telling which two are non-unique in c
.
It becomes clear that even with 2n - 1
accesses, there is just not enough information to solve the problem. And so the optimal solution is a linear one.
Of course, a real proof would use induction and not use proof-by-example, but I'll leave that to someone else :)
Caution, it is wrong to use the O(n + m) notation. There is but one size parameter which is n (in the asymptotic sense, n and n+1 are equal). You should just say O(n). [For m > n+1, the problem is different and more challenging.]
As pointed by others, this is optimal as you must read all values.
All you can do is reducing the asymptotic constant. There is little room for improvement, as the obvious solutions are already very efficient. The single loop in (10) is probably hard to beat. Unrolling it a bit should improve (slightly) by avoiding a branch.
If your goal is sheer performance, than you should turn to non-portable solutions such as vectorization (using the AXV instructions, 8 ints at a time) and parallelization on multicores or GPGPU. In good old dirty C and a 64 bits processor, you could map the data to an array of 64 bit ints and xor the elements two pairs at a time ;)
Let's say that there are two unsorted integer arrays a and b, with element repetition allowed. They are identical (with respect to contained elements) except one of the arrays has an extra element ..
You may note that I emphasised two point in your original question, and I'm adding an extra assumption of that the values are non-zero.
In C#, you can do this:
int[, , , , ,] a=new int[6, 5, 6, 3, 4, 2];
int[, , , , , ,] b=new int[5, 7, 6, 6, 2, 3, 4];
Console.WriteLine(b.Length/a.Length);
See? Whatever the extra element is, you will always know it by simply dividing their length.
With these statements, we are not storing the given series of integers as values to arrays, but as their dimensions.
As whatever the shorter series of integers is given, the longer one should have only one extra integer. So no matter the order of the integers, without the extra one, the total size of these two multi-dimensional array are identical. The extra dimension times the size of the longer, and to divide by the size of the shorter, we know what is the extra integer.
This solution would works only for this particular case as I quoted from your question. You might want to port it to Java.
This is just a trick, as I thought the question itself is a trick. We definitely will not consider it as a solution for production.
I think this is similar to Matching nuts and bolts problem.
You could achieve this possibly in O(nlogn). Not sure if thats smaller than O(n+m) in this case.
There simply is no faster algorithm. The ones presented in the question are in O(n). Any arithmetic "trick" to solve this will require at least each element of both arrays to be read once, so we stay in O(n) (or worse).
Any search strategy that is in a real subset of O(n) (like O(log n)) will require sorted arrays or some other prebuild sorted structure (binary tree, hash). All sorting algorithms known to mankind are at least O(n*log n) (Quicksort, Hashsort) at average which is worse than O(n).
Therefore, from a mathematical point of view, there is no faster algorithm. There might be some code optimizations, but they won't matter on large scale, as runtime will grow linear with the length of the array(s).
This is a little bit faster:
public static int getUniqueElement(int[] a, int[] b) {
int ret = 0;
int i;
for (i = 0; i < a.length; i++) {
ret += (a[i] - b[i]);
}
return Math.abs(ret - b[i]);
}
It's O(m), but the order doesn't tell the whole story. The loop part of the "official" solution has about 3 * m + 3 * n operations, and the slightly faster solution has 4 * m.
(Counting the loop "i++" and "i < a.length" as one operation each).
-Al.