Clustering using a custom distance metric for lat/long pairs

随声附和 提交于 2020-01-01 09:21:17

问题


I'm trying to specify a custom clustering function for the scikit-learn DBSCAN implementation:

def geodistance(latLngA, latLngB):
    print latLngA, latLngB
    return vincenty(latLngA, latLngB).miles

cluster_labels = DBSCAN(
            eps=500,
            min_samples=max(2, len(found_geopoints)/10),
            metric=geodistance
).fit(np.array(found_geopoints)).labels_

However, when I print out the arguments to my distance function they aren't at all what I would expect:

[ 0.53084126  0.19584111  0.99640966  0.88013373  0.33753788  0.79983037
  0.71716144  0.85832664  0.63559538  0.23032912]
[ 0.53084126  0.19584111  0.99640966  0.88013373  0.33753788  0.79983037
  0.71716144  0.85832664  0.63559538  0.23032912]

This is what my found_geopoints array looks like:

[[  4.24680600e+01   1.40868060e+02]
 [ -2.97677600e+01  -6.20477000e+01]
 [  3.97550400e+01   2.90069000e+00]
 [  4.21144200e+01   1.43442500e+01]
 [  8.56111000e+00   1.24771390e+02]
...

So why aren't the arguments to the distance function latitude longitude pairs?


回答1:


I seem to have found a work around where I compute a distance matrix using: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html then use it as an argument to DBSCAN(metric='precomputed').fit(distance_matrix)




回答2:


You can do this with scikit-learn: use the haversine metric with the ball-tree algorithm, and pass radian units into the DBSCAN fit method.

This tutorial demonstrates how to cluster spatial lat-long data with scikit-learn's DBSCAN using the haversine metric to cluster based on accurate geodetic distances between lat-long points:

df = pd.read_csv('gps.csv')
coords = df.as_matrix(columns=['lat', 'lon'])
db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Notice that the coordinates are passed into the .fit() method as radian units, and that the epsilon parameter value must also be in radian units as well.

If you want epsilon to be, say 1.5km, then the epsilon parameter value in radian units would = 1.5/6371.



来源:https://stackoverflow.com/questions/23420605/clustering-using-a-custom-distance-metric-for-lat-long-pairs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!