how to plot and annotate hierarchical clustering dendrograms in scipy/matplotlib

后端 未结 2 2019
一向
一向 2020-12-07 09:05

I\'m using dendrogram from scipy to plot hierarchical clustering using matplotlib as follows:

mat = array([[1, 0.5, 0.         


        
2条回答
  •  死守一世寂寞
    2020-12-07 09:58

    The input to linkage() is either an n x m array, representing n points in m-dimensional space, or a one-dimensional array containing the condensed distance matrix. In your example, mat is 3 x 3, so you are clustering three 3-d points. Clustering is based on the distance between these points.

    Why does mat and 1-mat give identical clusterings here?

    The arrays mat and 1-mat produce the same clustering because the clustering is based on distances between the points, and neither a reflection (-mat) nor a translation (mat + offset) of the entire data set change the relative distances between the points.

    How can I annotate the distance along each branch of the tree using dendrogram so that the distances between pairs of nodes can be compared?

    In the code below, I show how you can use the data returned by dendrogram to label the horizontal segments of the diagram with the corresponding distance. The values associated with the keys icoord and dcoord give the x and y coordinates of each three-segment inverted-U of the figure. In augmented_dendrogram this data is used to add a label of the distance (i.e. y value) of each horizontal line segment in dendrogram.

    from scipy.cluster.hierarchy import dendrogram
    import matplotlib.pyplot as plt
    
    
    def augmented_dendrogram(*args, **kwargs):
    
        ddata = dendrogram(*args, **kwargs)
    
        if not kwargs.get('no_plot', False):
            for i, d in zip(ddata['icoord'], ddata['dcoord']):
                x = 0.5 * sum(i[1:3])
                y = d[1]
                plt.plot(x, y, 'ro')
                plt.annotate("%.3g" % y, (x, y), xytext=(0, -8),
                             textcoords='offset points',
                             va='top', ha='center')
    
        return ddata
    

    For your mat array, the augmented dendrogram is

    dendrogram for three points

    So point 'a' and 'c' are 1.01 units apart, and point 'b' is 1.57 units from the cluster ['a', 'c'].

    It seems that show_leaf_counts flag is ignored, is there a way to turn it on so that the number of objects in each class is shown?

    The flag show_leaf_counts only applies when not all the original data points are shown as leaves. For example, when trunc_mode = "lastp", only the last p nodes are show.

    Here's an example with 100 points:

    import numpy as np
    from scipy.cluster.hierarchy import linkage
    import matplotlib.pyplot as plt
    from augmented_dendrogram import augmented_dendrogram
    
    
    # Generate a random sample of `n` points in 2-d.
    np.random.seed(12312)
    n = 100
    x = np.random.multivariate_normal([0, 0], np.array([[4.0, 2.5], [2.5, 1.4]]),
                                      size=(n,))
    
    plt.figure(1, figsize=(6, 5))
    plt.clf()
    plt.scatter(x[:, 0], x[:, 1])
    plt.axis('equal')
    plt.grid(True)
    
    linkage_matrix = linkage(x, "single")
    
    plt.figure(2, figsize=(10, 4))
    plt.clf()
    
    plt.subplot(1, 2, 1)
    show_leaf_counts = False
    ddata = augmented_dendrogram(linkage_matrix,
                   color_threshold=1,
                   p=6,
                   truncate_mode='lastp',
                   show_leaf_counts=show_leaf_counts,
                   )
    plt.title("show_leaf_counts = %s" % show_leaf_counts)
    
    plt.subplot(1, 2, 2)
    show_leaf_counts = True
    ddata = augmented_dendrogram(linkage_matrix,
                   color_threshold=1,
                   p=6,
                   truncate_mode='lastp',
                   show_leaf_counts=show_leaf_counts,
                   )
    plt.title("show_leaf_counts = %s" % show_leaf_counts)
    
    plt.show()
    

    These are the points in the data set:

    scatter plot of 100 points

    With p=6 and trunc_mode="lastp", dendrogram only shows the "top" of the dendrogram. The following shows the effect of show_leaf_counts.

    Show effect of show_leaf_counts

提交回复
热议问题