Given an array of integers size N, how can you efficiently find a subset of size K with elements that are closest to each other?
Le
This procedure can be done with O(N*K)
if A
is sorted. If A
is not sorted, then the time will be bounded by the sorting procedure.
This is based on 2 facts (relevant only when A
is ordered):
K
subsequent elements, the sum of distances can be calculated as the sum of each two subsequent elements time (K-i)*i
where i
is 1,...,K-1
.K
times the distance between the previously two smallest elements, and add K
times the distance of the two new largest elements. this fact is being used to calculate the closeness of a subset in O(1)
by using the closeness of the previous subset. Here's the pseudo-code
List<pair> FindClosestSubsets(int[] A, int K)
{
List<pair> minList = new List<pair>;
int minVal = infinity;
int tempSum;
int N = A.length;
for (int i = K - 1; i < N; i++)
{
tempSum = 0;
for (int j = i - K + 1; j <= i; j++)
tempSum += (K-i)*i * (A[i] - A[i-1]);
if (tempSum < minVal)
{
minVal = tempSum;
minList.clear();
minList.add(new pair(i-K, i);
}
else if (tempSum == minVal)
minList.add(new pair(i-K, i);
}
return minList;
}
This function will return a list of pairs of indexes representing the optimal solutions (the starting and ending index of each solution), it was implied in the question that you want to return all solutions of the minimal value.
After sorting, we can be sure that, if x1, x2, ... xk are the solution, then x1, x2, ... xk are contiguous elements, right?
So,
Your current solution is O(NK^2)
(assuming K > log N
). With some analysis, I believe you can reduce this to O(NK)
.
The closest set of size K will consist of elements that are adjacent in the sorted list. You essentially have to first sort the array, so the subsequent analysis will assume that each sequence of K
numbers is sorted, which allows the double sum to be simplified.
Assuming that the array is sorted such that x[j] >= x[i]
when j > i
, we can rewrite your closeness metric to eliminate the absolute value:
Next we rewrite your notation into a double summation with simple bounds:
Notice that we can rewrite the inner distance between x[i]
and x[j]
as a third summation:
where I've used d[l]
to simplify the notation going forward:
Notice that d[l]
is the distance between each adjacent element in the list. Look at the structure of the inner two summations for a fixed i
:
j=i+1 d[i]
j=i+2 d[i] + d[i+1]
j=i+3 d[i] + d[i+1] + d[i+2]
...
j=K=i+(K-i) d[i] + d[i+1] + d[i+2] + ... + d[K-1]
Notice the triangular structure of the inner two summations. This allows us to rewrite the inner two summations as a single summation in terms of the distances of adjacent terms:
total: (K-i)*d[i] + (K-i-1)*d[i+1] + ... + 2*d[K-2] + 1*d[K-1]
which reduces the total sum to:
Now we can look at the structure of this double summation:
i=1 (K-1)*d[1] + (K-2)*d[2] + (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
i=2 (K-2)*d[2] + (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
i=3 (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
...
i=K-2 2*d[K-2] + d[K-1]
i=K-1 d[K-1]
Again, notice the triangular pattern. The total sum then becomes:
1*(K-1)*d[1] + 2*(K-2)*d[2] + 3*(K-3)*d[3] + ... + (K-2)*2*d[K-2]
+ (K-1)*1*d[K-1]
Or, written as a single summation:
This compact single summation of adjacent differences is the basis for a more efficient algorithm:
O(N log N)
O(N)
N-K
sequence of differences and calculate the above sum, order O(NK)
Note that the second and third step could be combined, although with Python your mileage may vary.
The code:
def closeness(diff,K):
acc = 0.0
for (i,v) in enumerate(diff):
acc += (i+1)*(K-(i+1))*v
return acc
def closest(a,K):
a.sort()
N = len(a)
diff = [ a[i+1] - a[i] for i in xrange(N-1) ]
min_ind = 0
min_val = closeness(diff[0:K-1],K)
for ind in xrange(1,N-K+1):
cl = closeness(diff[ind:ind+K-1],K)
if cl < min_val:
min_ind = ind
min_val = cl
return a[min_ind:min_ind+K]
try the following:
N = input()
K = input()
assert 2 <= N <= 10**5
assert 2 <= K <= N
a = some_unsorted_list
a.sort()
cur_diff = sum([abs(a[i] - a[i + 1]) for i in range(K - 1)])
min_diff = cur_diff
min_last_idx = K - 1
for last_idx in range(K,N):
cur_diff = cur_diff - \
abs(a[last_idx - K - 1] - a[last_idx - K] + \
abs(a[last_idx] - a[last_idx - 1])
if min_diff > cur_diff:
min_diff = cur_diff
min_last_idx = last_idx
From the min_last_idx, you can calculate the min_first_idx. I use range to preserve the order of idx. If this is python 2.7, it will take linearly more RAM. This is the same algorithm that you use, but slightly more efficient (smaller constant in complexity), as it does less then summing all.
My initial solution was to look through all the K element window and multiply each element by m and take the sum in that range, where m is initialized by -(K-1) and incremented by 2 in each step and take the minimum sum from the entire list. So for a window of size 3, m is -2 and the values for the range will be -2 0 2. This is because I observed a property that each element in the K window add a certain weight to the sum. For an example if the elements are [10 20 30] the sum is (30-10) + (30-20) + (20-10). So if we break down the expression we have 2*30 + 0*20 + (-2)*10. This can be achieved in O(n) time and the entire operation would be in O(NK) time. However it turns out that this solution is not optimal, and there are certain edge cases where this algorithm fails. I am yet to figure out those cases, but shared the solution anyway if anyone can figure out something useful from it.
for(i = 0 ;i <= n - k;++i)
{
diff = 0;
l = -(k-1);
for(j = i;j < i + k;++j)
{
diff += a[j]*l;
if(min < diff)
break;
l += 2;
}
if(j == i + k && diff > 0)
min = diff;
}
itertools to the rescue?
from itertools import combinations
def closest_elements(iterable, K):
N = set(iterable)
assert(2 <= K <= len(N) <= 10**5)
combs = lambda it, k: combinations(it, k)
_abs = lambda it: abs(it[0] - it[1])
d = {}
v = 0
for x in combs(N, K):
for y in combs(x, 2):
v += _abs(y)
d[x] = v
v = 0
return min(d, key=d.get)
>>> a = [10,100,300,200,1000,20,30]
>>> b = [1,2,3,4,10,20,30,40,100,200]
>>> print closest_elements(a, 3); closest_elements(b, 4)
(10, 20, 30) (1, 2, 3, 4)