Plot correlation matrix using pandas

前端 未结 12 729
渐次进展
渐次进展 2020-11-30 16:23

I have a data set with huge number of features, so analysing the correlation matrix has become very difficult. I want to plot a correlation matrix which we get using d

12条回答
  •  悲哀的现实
    2020-11-30 16:58

    If your main goal is to visualize the correlation matrix, rather than creating a plot per se, the convenient pandas styling options is a viable built-in solution:

    import pandas as pd
    import numpy as np
    
    rs = np.random.RandomState(0)
    df = pd.DataFrame(rs.rand(10, 10))
    corr = df.corr()
    corr.style.background_gradient(cmap='coolwarm')
    # 'RdBu_r' & 'BrBG' are other good diverging colormaps
    

    Note that this needs to be in a backend that supports rendering HTML, such as the JupyterLab Notebook. (The automatic light text on dark backgrounds is from an existing PR and not the latest released version, pandas 0.23).


    Styling

    You can easily limit the digit precision:

    corr.style.background_gradient(cmap='coolwarm').set_precision(2)
    

    Or get rid of the digits altogether if you prefer the matrix without annotations:

    corr.style.background_gradient(cmap='coolwarm').set_properties(**{'font-size': '0pt'})
    

    The styling documentation also includes instructions of more advanced styles, such as how to change the display of the cell the mouse pointer is hovering over. To save the output you could return the HTML by appending the render() method and then write it to a file (or just take a screenshot for less formal purposes).


    Time comparison

    In my testing, style.background_gradient() was 4x faster than plt.matshow() and 120x faster than sns.heatmap() with a 10x10 matrix. Unfortunately it doesn't scale as well as plt.matshow(): the two take about the same time for a 100x100 matrix, and plt.matshow() is 10x faster for a 1000x1000 matrix.


    Saving

    There are a few possible ways to save the stylized dataframe:

    • Return the HTML by appending the render() method and then write the output to a file.
    • Save as an .xslx file with conditional formatting by appending the to_excel() method.
    • Combine with imgkit to save a bitmap
    • Take a screenshot (for less formal purposes).

    Update for pandas >= 0.24

    By setting axis=None, it is now possible to compute the colors based on the entire matrix rather than per column or per row:

    corr.style.background_gradient(cmap='coolwarm', axis=None)
    

提交回复
热议问题