Finding unique points in numpy array

后端 未结 2 804
不思量自难忘°
不思量自难忘° 2020-12-16 03:40

What is a faster way of finding unique x,y points (removing duplicates) in a numpy array like:

points = numpy.random.randint(0, 5, (10,2))

相关标签:
2条回答
  • 2020-12-16 03:55

    I think you have a very good idea here. Think about the underlying block of memory used to represent the data in points. We tell numpy to regard that block as representing an array of shape (10,2) with dtype int32 (32-bit integers), but it is almost costless to tell numpy to regard that same block of memory as representing an array of shape (10,) with dtype c8 (64-bit complex).

    So the only real cost is calling np.unique, followed by another virtually costless call to view and reshape:

    import numpy as np
    np.random.seed(1)
    points = np.random.randint(0, 5, (10,2))
    print(points)
    print(len(points))
    

    yields

    [[3 4]
     [0 1]
     [3 0]
     [0 1]
     [4 4]
     [1 2]
     [4 2]
     [4 3]
     [4 2]
     [4 2]]
    10
    

    while

    cpoints = points.view('c8')
    cpoints = np.unique(cpoints)
    points = cpoints.view('i4').reshape((-1,2))
    print(points)
    print(len(points))
    

    yields

    [[0 1]
     [1 2]
     [3 0]
     [3 4]
     [4 2]
     [4 3]
     [4 4]]
    7
    

    If you don't need the result to be sorted, wim's method is faster (You might want to consider accepting his answer...)

    import numpy as np
    np.random.seed(1)
    N=10000
    points = np.random.randint(0, 5, (N,2))
    
    def using_unique():
        cpoints = points.view('c8')
        cpoints = np.unique(cpoints)
        return cpoints.view('i4').reshape((-1,2))
    
    def using_set():
        return np.vstack([np.array(u) for u in set([tuple(p) for p in points])])
    

    yields these benchmarks:

    % python -mtimeit -s'import test' 'test.using_set()'
    100 loops, best of 3: 18.3 msec per loop
    % python -mtimeit -s'import test' 'test.using_unique()'
    10 loops, best of 3: 40.6 msec per loop
    
    0 讨论(0)
  • 2020-12-16 04:03

    I would do it like this:

    numpy.array(list(set(tuple(p) for p in points)))

    For the fast solution in the most general case, maybe this recipe would interest you: http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/

    0 讨论(0)
提交回复
热议问题