Efficiently delete arrays that are close from each other given a threshold in python

问题

I am using python for this job and being very objective here, I want to find a 'pythonic' way to remove from an array of arrays the "duplicates" that are close each other from a threshold. For example, give this array:

[[ 5.024,  1.559,  0.281], [ 6.198,  4.827,  1.653], [ 6.199,  4.828,  1.653]]

observe that [ 6.198, 4.827, 1.653] and [ 6.199, 4.828, 1.653] are really close to each other, their Euclidian distance is 0.0014, so they are almost "duplicates", I want my final output to be just:

[[ 5.024,  1.559,  0.281], [ 6.198,  4.827,  1.653]]

The algorithm that I have right now is:

to_delete = [];
for i in unique_cluster_centers:
    for ii in unique_cluster_centers:
        if i == ii:
            pass;
        elif np.linalg.norm(np.array(i) - np.array(ii)) <= self.tolerance:
            to_delete.append(ii);
            break;

for i in to_delete:
    try:
        uniques.remove(i);
    except:
        pass;

but its really slow, I would like to know some faster and 'pythonic' way to solve this. My tolerance is 0.0001.

回答1:

A generic approach might be:

def filter_quadratic(data,condition):
    result = []
    for element in data:
        if all(condition(element,other) for other in result):
            result.append(element)
    return result

This is a generic higher order filter that has a condition. Only if the condition is satisfied for all elements that are already in the list*, that element is added.

Now we still need to define the condition:

def the_condition(xs,ys):
    # working with squares, 2.5e-05 is 0.005*0.005 
    return sum((x-y)*(x-y) for x,y in zip(xs,ys)) > 2.5e-05

This gives:

>>> filter_quadratic([[ 5.024,  1.559,  0.281], [ 6.198,  4.827,  1.653], [ 6.199,  4.828,  1.653]],the_condition)
[[5.024, 1.559, 0.281], [6.198, 4.827, 1.653]]

The algorithm runs in O(n²) where n is the number of elements you give to the function. You can however make it a bit more efficient with k-d trees, but this requires some more advanced data structures.

回答2:

If you can avoid having to compare each list element to every other one in a nested loop (which unavoidably is a O(n^2) operation) that would be much more efficient.

One approach is to generate a key such that two "almost duplicates" would produce the same key. Then you just iterate over your data once and only insert the values which are not already in your result set.

result = {}
for row in unique_cluster_centers:
    # round each value to 2 decimal places: 
    # [5.024,  1.559,  0.281] => (5.02,  1.56,  0.28)
    # you can be inventive and, say, multiply each value by 3 before rounding
    # if you want precision other than a whole decimal point.
    key = tuple([round(v, 2) for v in row])  # tuples can be keys of a dict
    if key not in result:
        result[key] = row
return result.values()  # I suppose the order of the items is not important, you can use OrderedDict otherwise

来源：https://stackoverflow.com/questions/43035503/efficiently-delete-arrays-that-are-close-from-each-other-given-a-threshold-in-py

标签

python

numpy

duplicates

distance