Interpreting the output of SciPy's hierarchical clustering dendrogram? (maybe found a bug…)

泄露秘密 提交于 2020-01-02 07:11:49

问题


I am trying to figure out how the output of scipy.cluster.hierarchy.dendrogram works... I thought I knew how it worked and I was able to use the output to reconstruct the dendrogram but it seems as if I am not understanding it anymore or there is a bug in Python 3's version of this module.

This answer, how do I get the subtrees of dendrogram made by scipy.cluster.hierarchy, implies that the dendrogram output dictionary gives dict_keys(['icoord', 'ivl', 'color_list', 'leaves', 'dcoord']) w/ all of the same size so you can zip them and plt.plot them to reconstruct the dendrogram.

Seems simple enough and I did get it work back when I used Python 2.7.11 but once I upgraded to Python 3.5.1 my old scripts weren't giving me the same results.

I started reworking my clusters for a very simple repeatable example and think I may have found a bug in Python 3.5.1's version of SciPy version 0.17.1-np110py35_1. Going to use the Scikit-learn datasets b/c most people have that module from the conda distribution.

Why aren't these lining up and how come I am unable to reconstruct the dendrogram in this way?

# Init
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

# Load data
from sklearn.datasets import load_diabetes

# Clustering
from scipy.cluster.hierarchy import dendrogram, fcluster, leaves_list
from scipy.spatial import distance
from fastcluster import linkage # You can use SciPy one too

%matplotlib inline

# Dataset
A_data = load_diabetes().data
DF_diabetes = pd.DataFrame(A_data, columns = ["attr_%d" % j for j in range(A_data.shape[1])])

# Absolute value of correlation matrix, then subtract from 1 for disimilarity
DF_dism = 1 - np.abs(DF_diabetes.corr())

# Compute average linkage
A_dist = distance.squareform(DF_dism.as_matrix())
Z = linkage(A_dist,method="average")

# I modded the SO code from the above answer for the plot function
def plot_tree( D_dendro, ax ):
    # Set up plotting data
    leaves = D_dendro["ivl"]
    icoord = np.array( D_dendro['icoord'] )
    dcoord = np.array( D_dendro['dcoord'] )
    color_list = D_dendro["color_list"]

    # Plot colors
    for leaf, xs, ys, color in zip(leaves, icoord, dcoord, color_list):
        print(leaf, xs, ys, color, sep="\t")
        plt.plot(xs, ys,  color)

    # Set min/max of plots
    xmin, xmax = icoord.min(), icoord.max()
    ymin, ymax = dcoord.min(), dcoord.max()

    plt.xlim( xmin-10, xmax + 0.1*abs(xmax) )
    plt.ylim( ymin, ymax + 0.1*abs(ymax) )

    # Set up ticks
    ax.set_xticks( np.arange(5, len(leaves) * 10 + 5, 10))
    ax.set_xticklabels(leaves, fontsize=10, rotation=45)

    plt.show()

fig, ax = plt.subplots()
D1 = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=None, no_plot=True)
plot_tree(D_dendro=D1, ax=ax)

attr_1  [ 15.  15.  25.  25.]   [ 0.          0.10333704  0.10333704  0.        ]   g
attr_4  [ 55.  55.  65.  65.]   [ 0.          0.26150727  0.26150727  0.        ]   r
attr_5  [ 45.  45.  60.  60.]   [ 0.          0.4917828   0.4917828   0.26150727]   r
attr_2  [ 35.   35.   52.5  52.5]   [ 0.          0.59107459  0.59107459  0.4917828 ]   b
attr_8  [ 20.    20.    43.75  43.75]   [ 0.10333704  0.65064998  0.65064998  0.59107459]   b
attr_6  [ 85.  85.  95.  95.]   [ 0.          0.60957062  0.60957062  0.        ]   b
attr_7  [ 75.  75.  90.  90.]   [ 0.          0.68142114  0.68142114  0.60957062]   b
attr_0  [ 31.875  31.875  82.5    82.5  ]   [ 0.65064998  0.72066112  0.72066112  0.68142114]   b
attr_3  [  5.       5.      57.1875  57.1875]   [ 0.          0.80554653  0.80554653  0.72066112]   b

Here's one w/o the labels and just the icoord values for the x-axis

So check out the colors aren't mapping correctly. It says [ 15. 15. 25. 25.] for the icoord goes with attr_1 but based on the values it looks like it goes with attr_4. Also, it doesn't go to all the way to the last leaf (attr_9) and that's b/c the length of icoord and dcoord is 1 less than the amount of ivl labels.

print([len(x) for x in [leaves, icoord, dcoord, color_list]]) 
#[10, 9, 9, 9]

回答1:


icoord, dcoord and color_list describe the links, not the leaves. icoord and dcoord give the coordinates of the "arches" (i.e. upside-down U or J shapes) for each link in a plot, and color_list is the color of those arches. In a full plot, the length of icoord, etc., will be one less than the length of ivl, as you have observed.

Don't try to line up the ivl list with the icoord, dcoord and color_list lists. They are associated with different things.



来源:https://stackoverflow.com/questions/38166327/interpreting-the-output-of-scipys-hierarchical-clustering-dendrogram-maybe-fo

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!