Clustering 500,000 geospatial points in python

前端 未结 2 579
無奈伤痛
無奈伤痛 2020-12-15 12:02

I\'m currently faced with the problem of finding a way to cluster around 500,000 latitude/longitude pairs in python. So far I\'ve tried computing a distance matrix with nump

相关标签:
2条回答
  • 2020-12-15 12:17

    Older versions of DBSCAN in scikit learn would compute a complete distance matrix.

    Unfortunately, computing a distance matrix needs O(n^2) memory, and that is probably where you run out of memory.

    Newer versions (which version do you use?) of scikit learn should be able to work without a distance matrix; at least when using an index. At 500.000 objects, you do want to use index acceleration, as this reduces runtime from O(n^2) to O(n log n).

    I don't know how well scikit learn supports geodetic distance in its indexes though. ELKI is the only tool I know that can use R*-tree indexes for accelerating geodetic distance; making it extremely fast for this task (in particular when bulk-loading the index). You should give it a try.

    Have a look at the Scikit learn indexing documentation, and try setting algorithm='ball_tree'.

    0 讨论(0)
  • 2020-12-15 12:26

    I don't have your data so I just generated 500k random numbers into three columns.

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.cluster.vq import kmeans2, whiten
    
    arr = np.random.randn(500000*3).reshape((500000, 3))
    x, y = kmeans2(whiten(arr), 7, iter = 20)  #<--- I randomly picked 7 clusters
    plt.scatter(arr[:,0], arr[:,1], c=y, alpha=0.33333);
    
    out[1]:
    

    enter image description here

    I timed this and it took 1.96 seconds to run this Kmeans2 so I don't think it has to do with the size of your data. Put your data in a 500000 x 3 numpy array and try kmeans2.

    0 讨论(0)
提交回复
热议问题