Python: String clustering with scikit-learn's dbscan, using Levenshtein distance as metric:
问题 I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a similarity metric, along with dbscan as the clustering algorithm as k-means algorithms won't work because I do not know the number of clusters. I am facing some problems using Scikit-learn's implementation of dbscan. This snippet below works on small datasets in the format I an using, but since it is precomputing the