I had an interesting job interview experience a while back. The question started really easy:
Q1: We have a bag containing numbers
If you want to solve the general-case problem, and you can store and edit the array, then Caf's solution is by far the most efficient. If you can't store the array (streaming version), then sdcvvc's answer is the only type of solution currently suggested.
The solution I propose is the most efficient answer (so far on this thread) if you can store the problem but can't edit it, and I got the idea from Svalorzen's solution, which solves for 1 or 2 missing items. This solution takes Θ(k*n)
time and O(k)
and Ω(log(k))
space - with a possibility that it might actually be O(min(k,log(n)))
space. It also works well with parallelism.
The idea is that if you use the original approach of comparing sums:
int sum = SumOf(1,n) - SumOf(array)
... then you take the average of the missing numbers:
average = sum/array_size
... which provides a boundary: Of the missing numbers, there's guaranteed to be at least one number less-or-equal to average
, and at least one number greater than average
. This means that we can split into sub problems that each scan the array [O(n)
] and are only concerned with their respective sub-arrays.
C-style solution (don't judge me for the global variables, I'm just trying to make the code readable for non-c folks):
#include "stdio.h"
// Example problem:
const int array [] = {0, 7, 3, 1, 5};
const int N = 8; // size of original array
const int array_size = 5;
int SumOneTo (int n)
{
return n*(n-1)/2; // non-inclusive
}
int MissingItems (const int begin, const int end, int & average)
{
// We consider only sub-array where elements, e:
// begin <= e < end
// Initialise info about missing elements.
// First assume all are missing:
int n = end - begin;
int sum = SumOneTo(end) - SumOneTo(begin);
// Minus everything that we see (ie not missing):
for (int i = 0; i < array_size; ++i)
{
if ((begin <= array[i]) && (array[i] < end))
{
n -= 1;
sum -= array[i];
}
}
// used by caller:
average = sum/n;
return n;
}
void Find (const int begin, const int end)
{
int average;
if (MissingItems(begin, end, average) == 1)
{
printf(" %d", average); // average(n) is same as n
return;
}
Find(begin, average + 1); // at least one missing here
Find(average + 1, end); // at least one here also
}
int main ()
{
printf("Missing items:");
Find(0, N);
printf("\n");
}
Ignoring recursion for a moment, each function call clearly takes O(n)
time and O(1)
space. Note that sum
can equal as much as n(n-1)/2
, so requires double the amount of bits needed to store n-1
. At most this means than we effectively need two extra elements worth of space, regardless of the size of the array or k
, hence it's still O(1)
space under the normal conventions.
It's not so obvious how many function calls there are for k
missing elements, so I'll provide a visual. Your original sub-array (connected array) is the full array, which has all k
missing elements in it. We'll imagine them in increasing order, where --
represent connections (part of same sub-array):
m1 -- m2 -- m3 -- m4 -- (...) -- mk-1 -- mk
The effect of the Find
function is to disconnect the missing elements into different non-overlapping sub-arrays. It guarantees that there's at least one missing element in each sub-array, which means breaking exactly one connection.
What this means is that regardless of how the splits occur, it will always take k-1
Find
function calls to do the work of finding the sub-arrays that have only one missing element in it.
So the time complexity is Θ((k-1 + k) * n) = Θ(k*n)
.
For the space complexity, if we divide proportionally each time then we get O(log(k))
space complexity, but if we only separate one at a time it gives us O(k)
.
I actually suspect we the space complexity is a smaller O(min(k,log(n)))
, but it's harder to prove. My intuition: Where the average performs badly at separation is when there's an outlier, but because of this the separation then removes that outlier. In normal arrays, elements could all be exponentially different, but in this case they're all bound by n
.