Find subset with K elements that are closest to eachother

后端 未结 6 897
误落风尘
误落风尘 2020-12-30 04:44

Given an array of integers size N, how can you efficiently find a subset of size K with elements that are closest to each other?

Le

相关标签:
6条回答
  • 2020-12-30 04:55

    This procedure can be done with O(N*K) if A is sorted. If A is not sorted, then the time will be bounded by the sorting procedure.

    This is based on 2 facts (relevant only when A is ordered):

    • The closest subsets will always be subsequent
    • When calculating the closeness of K subsequent elements, the sum of distances can be calculated as the sum of each two subsequent elements time (K-i)*i where i is 1,...,K-1.
    • When iterating through the sorted array, it is redundant to recompute the entire sum, we can instead remove K times the distance between the previously two smallest elements, and add K times the distance of the two new largest elements. this fact is being used to calculate the closeness of a subset in O(1) by using the closeness of the previous subset.

    Here's the pseudo-code

    List<pair> FindClosestSubsets(int[] A, int K)
    {
        List<pair> minList = new List<pair>;
        int minVal = infinity;
        int tempSum;
        int N = A.length;
    
        for (int i = K - 1; i < N; i++)
        {
            tempSum = 0;
    
            for (int j = i - K + 1; j <= i; j++)
                  tempSum += (K-i)*i * (A[i] - A[i-1]);
    
            if (tempSum < minVal)
            {
                  minVal = tempSum;
                  minList.clear();
                  minList.add(new pair(i-K, i);
            }
    
            else if (tempSum == minVal)
                  minList.add(new pair(i-K, i);
        }
    
        return minList;
    }
    

    This function will return a list of pairs of indexes representing the optimal solutions (the starting and ending index of each solution), it was implied in the question that you want to return all solutions of the minimal value.

    0 讨论(0)
  • 2020-12-30 04:55

    After sorting, we can be sure that, if x1, x2, ... xk are the solution, then x1, x2, ... xk are contiguous elements, right?

    So,

    1. take the intervals between numbers
    2. sum these intervals to get the intervals between k numbers
    3. Choose the smallest of them
    0 讨论(0)
  • 2020-12-30 05:01

    Your current solution is O(NK^2) (assuming K > log N). With some analysis, I believe you can reduce this to O(NK).

    The closest set of size K will consist of elements that are adjacent in the sorted list. You essentially have to first sort the array, so the subsequent analysis will assume that each sequence of K numbers is sorted, which allows the double sum to be simplified.

    Assuming that the array is sorted such that x[j] >= x[i] when j > i, we can rewrite your closeness metric to eliminate the absolute value:

    enter image description here

    Next we rewrite your notation into a double summation with simple bounds:

    enter image description here

    Notice that we can rewrite the inner distance between x[i] and x[j] as a third summation:

    enter image description here

    where I've used d[l] to simplify the notation going forward:

    enter image description here

    Notice that d[l] is the distance between each adjacent element in the list. Look at the structure of the inner two summations for a fixed i:

    j=i+1         d[i]
    j=i+2         d[i] + d[i+1]
    j=i+3         d[i] + d[i+1] + d[i+2]
    ...
    j=K=i+(K-i)   d[i] + d[i+1] + d[i+2] + ... + d[K-1]
    

    Notice the triangular structure of the inner two summations. This allows us to rewrite the inner two summations as a single summation in terms of the distances of adjacent terms:

    total: (K-i)*d[i] + (K-i-1)*d[i+1] + ... + 2*d[K-2] + 1*d[K-1]
    

    which reduces the total sum to:

    enter image description here

    Now we can look at the structure of this double summation:

    i=1     (K-1)*d[1] + (K-2)*d[2] + (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
    i=2                  (K-2)*d[2] + (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
    i=3                               (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
    ...
    i=K-2                                                2*d[K-2] + d[K-1]
    i=K-1                                                           d[K-1]
    

    Again, notice the triangular pattern. The total sum then becomes:

    1*(K-1)*d[1] + 2*(K-2)*d[2] + 3*(K-3)*d[3] + ... + (K-2)*2*d[K-2] 
      + (K-1)*1*d[K-1]
    

    Or, written as a single summation:

    enter image description here

    This compact single summation of adjacent differences is the basis for a more efficient algorithm:

    1. Sort the array, order O(N log N)
    2. Compute the differences of each adjacent element, order O(N)
    3. Iterate over each N-K sequence of differences and calculate the above sum, order O(NK)

    Note that the second and third step could be combined, although with Python your mileage may vary.

    The code:

    def closeness(diff,K):
      acc = 0.0
      for (i,v) in enumerate(diff):
        acc += (i+1)*(K-(i+1))*v
      return acc
    
    def closest(a,K):
      a.sort()
      N = len(a)
      diff = [ a[i+1] - a[i] for i in xrange(N-1) ]
    
      min_ind = 0
      min_val = closeness(diff[0:K-1],K)
    
      for ind in xrange(1,N-K+1):
        cl = closeness(diff[ind:ind+K-1],K)
        if cl < min_val:
          min_ind = ind
          min_val = cl
    
      return a[min_ind:min_ind+K]
    
    0 讨论(0)
  • 2020-12-30 05:10

    try the following:

    N = input()
    K = input()
    assert 2 <= N <= 10**5
    assert 2 <= K <= N
    a = some_unsorted_list
    a.sort()
    
    cur_diff = sum([abs(a[i] - a[i + 1]) for i in range(K - 1)])
    min_diff = cur_diff
    min_last_idx = K - 1
    for last_idx in range(K,N):
        cur_diff = cur_diff - \
                   abs(a[last_idx - K - 1] - a[last_idx - K] + \
                   abs(a[last_idx] - a[last_idx - 1])
        if min_diff > cur_diff:
            min_diff = cur_diff
            min_last_idx = last_idx
    

    From the min_last_idx, you can calculate the min_first_idx. I use range to preserve the order of idx. If this is python 2.7, it will take linearly more RAM. This is the same algorithm that you use, but slightly more efficient (smaller constant in complexity), as it does less then summing all.

    0 讨论(0)
  • 2020-12-30 05:15

    My initial solution was to look through all the K element window and multiply each element by m and take the sum in that range, where m is initialized by -(K-1) and incremented by 2 in each step and take the minimum sum from the entire list. So for a window of size 3, m is -2 and the values for the range will be -2 0 2. This is because I observed a property that each element in the K window add a certain weight to the sum. For an example if the elements are [10 20 30] the sum is (30-10) + (30-20) + (20-10). So if we break down the expression we have 2*30 + 0*20 + (-2)*10. This can be achieved in O(n) time and the entire operation would be in O(NK) time. However it turns out that this solution is not optimal, and there are certain edge cases where this algorithm fails. I am yet to figure out those cases, but shared the solution anyway if anyone can figure out something useful from it.

    for(i = 0 ;i <= n - k;++i)
    {
        diff = 0;
        l = -(k-1);
        for(j = i;j < i + k;++j)
        {
            diff += a[j]*l;
            if(min < diff)
                break;
            l += 2;
        }
        if(j == i + k && diff > 0)
        min = diff;
    }
    
    0 讨论(0)
  • 2020-12-30 05:16

    itertools to the rescue?

    from itertools import combinations
    
    def closest_elements(iterable, K):
        N = set(iterable)
        assert(2 <= K <= len(N) <= 10**5)
    
        combs = lambda it, k: combinations(it, k)
        _abs = lambda it: abs(it[0] - it[1])
        d = {}
        v = 0
    
        for x in combs(N, K):
            for y in combs(x, 2):
                v += _abs(y)
    
            d[x] = v
            v = 0
    
        return min(d, key=d.get)
    
    >>> a = [10,100,300,200,1000,20,30]
    >>> b = [1,2,3,4,10,20,30,40,100,200]
    >>> print closest_elements(a, 3); closest_elements(b, 4)
    (10, 20, 30) (1, 2, 3, 4)
    
    0 讨论(0)
提交回复
热议问题