DBSCAN for clustering of geographic location data

纵饮孤独 提交于 2019-11-29 20:30:21

DBSCAN is meant to be used on the raw data, with a spatial index for acceleration. The only tool I know with acceleration for geo distances is ELKI (Java) - scikit-learn unfortunately only supports this for a few distances like Euclidean distance (see sklearn.neighbors.NearestNeighbors). But apparently, you can affort to precompute pairwise distances, so this is not (yet) an issue.

However, you did not read the documentation carefully enough, and your assumption that DBSCAN uses a distance matrix is wrong:

from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
db.fit_predict(distance_matrix)

uses Euclidean distance on the distance matrix rows, which obviously does not make any sense.

See the documentation of DBSCAN (emphasis added):

class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)

metric : string, or callable

The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.

similar for fit_predict:

X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)

A feature array, or array of distances between samples if metric='precomputed'.

In other words, you need to do

db = DBSCAN(eps=2, min_samples=5, metric="precomputed")

You can cluster spatial latitude-longitude data with scikit-learn's DBSCAN without precomputing a distance matrix.

db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))

This comes from this tutorial on clustering spatial data with scikit-learn DBSCAN. In particular, notice that the eps value is still 2km, but it's divided by 6371 to convert it to radians. Also, notice that .fit() takes the coordinates in radian units for the haversine metric.

Jamie Bull

I don't know what implementation of haversine you're using but it looks like it returns results in km so eps should be 0.2, not 2 for 200 m.

For the min_samples parameter, that depends on what your expected output is. Here are a couple of examples. My outputs are using an implementation of haversine based on this answer which gives a distance matrix similar, but not identical to yours.

This is with db = DBSCAN(eps=0.2, min_samples=5)

[ 0 -1 -1 -1 1 1 1 -1 -1 1 1 1 2 2 1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 1 1 1 2 0 -1 1 2 2 0 0 0 -1 -1 -1 1 1 1 -1 -1 1 -1 -1 1]

This creates three clusters, 0, 1 and 2, and a lot of the samples don't fall into a cluster with at least 5 members and so are not assigned to a cluster (shown as -1).

Trying again with a smaller min_samples value:

db = DBSCAN(eps=0.2, min_samples=2)

[ 0 1 1 2 3 3 3 4 4 3 3 3 5 5 3 3 3 2 6 6 7 3 2 2 8 8 8 3 3 6 3 3 3 3 3 5 0 -1 3 5 5 0 0 0 6 -1 -1 3 3 3 7 -1 3 -1 -1 3]

Here most of the samples are within 200m of at least one other sample and so fall into one of eight clusters 0 to 7.

Edited to add

It looks like @Anony-Mousse is right, though I didn't see anything wrong in my results. For the sake of contributing something, here's the code I was using to see the clusters:

from math import radians, cos, sin, asin, sqrt

from scipy.spatial.distance import pdist, squareform
from sklearn.cluster import DBSCAN

import matplotlib.pyplot as plt
import pandas as pd


def haversine(lonlat1, lonlat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lat1, lon1 = lonlat1
    lat2, lon2 = lonlat2
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r


X = pd.read_csv('dbscan_test.csv')
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))

db = DBSCAN(eps=0.2, min_samples=2, metric='precomputed')  # using "precomputed" as recommended by @Anony-Mousse
y_db = db.fit_predict(distance_matrix)

X['cluster'] = y_db

plt.scatter(X['lat'], X['lng'], c=X['cluster'])
plt.show()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!