问题
I'm trying to assign flat, single-linkage clusters to sequence IDs separated by an edit distance < n, given a square distance matrix. I believe scipy.cluster.hierarchy.fclusterdata()
with criterion='distance'
may be a way to do this, but it isn't quite returning the clusters I'd expect for this toy example.
Specifically, in the 4x4 distance matrix example below, I would expect clusters_50
(which uses t=50
) to create 2 clusters, where actually it finds 3. I think the issue is that fclusterdata()
doesn't expect a distance matrix, but fcluster()
doesn't seem to do what I want either.
I've also looked at sklearn.cluster.AgglomerativeClustering
but this requires n_clusters
to be specified, and I want to create as many clusters as needed until the distance threshold I specify has been satisfied.
I see that there is a currently unmerged scikit-learn pull request for this exact feature: https://github.com/scikit-learn/scikit-learn/pull/9069
Can anyone point me in the right direction? Clustering with an absolute distance threshold criterion seems like a commmon use case.
import pandas as pd
from scipy.cluster.hierarchy import fclusterdata
cols = ['a', 'b', 'c', 'd']
df = pd.DataFrame([{'a': 0, 'b': 29467, 'c': 35, 'd': 13},
{'a': 29467, 'b': 0, 'c': 29468, 'd': 29470},
{'a': 35, 'b': 29468, 'c': 0, 'd': 38},
{'a': 13, 'b': 29470, 'c': 38, 'd': 0}],
index=cols)
clusters_20 = fclusterdata(df.values, t=20, criterion='distance')
clusters_50 = fclusterdata(df.values, t=50, criterion='distance')
clusters_100 = fclusterdata(df.values, t=100, criterion='distance')
names_clusters_20 = {n: c for n, c in zip(cols, clusters_20)}
names_clusters_50 = {n: c for n, c in zip(cols, clusters_50)}
names_clusters_100 = {n: c for n, c in zip(cols, clusters_100)}
names_clusters_20 # Expecting 3 clusters, finds 3
>>> {'a': 1, 'b': 3, 'c': 2, 'd': 1}
names_clusters_50 # Expecting 2 clusters, finds 3
>>> {'a': 1, 'b': 3, 'c': 2, 'd': 1}
names_clusters_100 # Expecting 2 clusters, finds 2
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}
回答1:
You did not set the metric parameter.
The default then is metric='euclidean'
, not precomputed.
回答2:
Figured it out by passing linkage()
to fcluster()
, which supports metric='precomputed'
unlike fclusterdata()
.
fcluster(linkage(condensed_dm, metric='precomputed'), criterion='distance', t=20)
Solution:
import pandas as pd
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage, fcluster
cols = ['a', 'b', 'c', 'd']
df = pd.DataFrame([{'a': 0, 'b': 29467, 'c': 35, 'd': 13},
{'a': 29467, 'b': 0, 'c': 29468, 'd': 29470},
{'a': 35, 'b': 29468, 'c': 0, 'd': 38},
{'a': 13, 'b': 29470, 'c': 38, 'd': 0}],
index=cols)
dm_cnd = squareform(df.values)
clusters_20 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=20)
clusters_50 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=50)
clusters_100 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=100)
names_clusters_20 = {n: c for n, c in zip(cols, clusters_20)}
names_clusters_50 = {n: c for n, c in zip(cols, clusters_50)}
names_clusters_100 = {n: c for n, c in zip(cols, clusters_100)}
names_clusters_20
>>> {'a': 1, 'b': 3, 'c': 2, 'd': 1}
names_clusters_50
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}
names_clusters_100
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}
As a function:
import pandas as pd
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import fcluster, linkage
def cluster_df(df, method='single', threshold=100):
'''
Accepts a square distance matrix as an indexed DataFrame and returns a dict of index keyed flat clusters
Performs single linkage clustering by default, see scipy.cluster.hierarchy.linkage docs for others
'''
dm_cnd = squareform(df.values)
clusters = fcluster(linkage(dm_cnd,
method=method,
metric='precomputed'),
criterion='distance',
t=threshold)
names_clusters = {s:c for s, c in zip(df.columns, clusters)}
return names_clusters
来源:https://stackoverflow.com/questions/55950591/single-linkage-clustering-of-edit-distance-matrix-with-distance-threshold-stoppi