Plot correlation matrix using pandas

前端 未结 12 698
渐次进展
渐次进展 2020-11-30 16:23

I have a data set with huge number of features, so analysing the correlation matrix has become very difficult. I want to plot a correlation matrix which we get using d

相关标签:
12条回答
  • 2020-11-30 16:52

    statmodels graphics also gives a nice view of correlation matrix

    import statsmodels.api as sm
    import matplotlib.pyplot as plt
    
    corr = dataframe.corr()
    sm.graphics.plot_corr(corr, xnames=list(corr.columns))
    plt.show()
    
    0 讨论(0)
  • 2020-11-30 16:52

    Along with other methods it is also good to have pairplot which will give scatter plot for all the cases-

    import pandas as pd
    import numpy as np
    import seaborn as sns
    rs = np.random.RandomState(0)
    df = pd.DataFrame(rs.rand(10, 10))
    sns.pairplot(df)
    
    0 讨论(0)
  • 2020-11-30 16:54

    Form correlation matrix, in my case zdf is the dataframe which i need perform correlation matrix.

    corrMatrix =zdf.corr()
    corrMatrix.to_csv('sm_zscaled_correlation_matrix.csv');
    html = corrMatrix.style.background_gradient(cmap='RdBu').set_precision(2).render()
    
    # Writing the output to a html file.
    with open('test.html', 'w') as f:
       print('<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-widthinitial-scale=1.0"><title>Document</title></head><style>table{word-break: break-all;}</style><body>' + html+'</body></html>', file=f)
    

    Then we can take screenshot. or convert html to an image file.

    0 讨论(0)
  • 2020-11-30 16:56

    Surprised to see no one mentioned more capable, interactive and easier to use alternatives.

    A) You can use plotly:

    1. Just two lines and you get:

    2. interactivity,

    3. smooth scale,

    4. colors based on whole dataframe instead of individual columns,

    5. column names & row indices on axes,

    6. zooming in,

    7. panning,

    8. built-in one-click ability to save it as a PNG format,

    9. auto-scaling,

    10. comparison on hovering,

    11. bubbles showing values so heatmap still looks good and you can see values wherever you want:

    import plotly.express as px
    fig = px.imshow(df.corr())
    fig.show()
    

    B) You can also use Bokeh:

    All the same functionality with a tad much hassle. But still worth it if you do not want to opt-in for plotly and still want all these things:

    from bokeh.plotting import figure, show, output_notebook
    from bokeh.models import ColumnDataSource, LinearColorMapper
    from bokeh.transform import transform
    output_notebook()
    colors = ['#d7191c', '#fdae61', '#ffffbf', '#a6d96a', '#1a9641']
    TOOLS = "hover,save,pan,box_zoom,reset,wheel_zoom"
    data = df.corr().stack().rename("value").reset_index()
    p = figure(x_range=list(df.columns), y_range=list(df.index), tools=TOOLS, toolbar_location='below',
               tooltips=[('Row, Column', '@level_0 x @level_1'), ('value', '@value')], height = 500, width = 500)
    
    p.rect(x="level_1", y="level_0", width=1, height=1,
           source=data,
           fill_color={'field': 'value', 'transform': LinearColorMapper(palette=colors, low=data.value.min(), high=data.value.max())},
           line_color=None)
    color_bar = ColorBar(color_mapper=LinearColorMapper(palette=colors, low=data.value.min(), high=data.value.max()), major_label_text_font_size="7px",
                         ticker=BasicTicker(desired_num_ticks=len(colors)),
                         formatter=PrintfTickFormatter(format="%f"),
                         label_standoff=6, border_line_color=None, location=(0, 0))
    p.add_layout(color_bar, 'right')
    
    show(p)
    

    0 讨论(0)
  • 2020-11-30 16:58

    If your main goal is to visualize the correlation matrix, rather than creating a plot per se, the convenient pandas styling options is a viable built-in solution:

    import pandas as pd
    import numpy as np
    
    rs = np.random.RandomState(0)
    df = pd.DataFrame(rs.rand(10, 10))
    corr = df.corr()
    corr.style.background_gradient(cmap='coolwarm')
    # 'RdBu_r' & 'BrBG' are other good diverging colormaps
    

    Note that this needs to be in a backend that supports rendering HTML, such as the JupyterLab Notebook. (The automatic light text on dark backgrounds is from an existing PR and not the latest released version, pandas 0.23).


    Styling

    You can easily limit the digit precision:

    corr.style.background_gradient(cmap='coolwarm').set_precision(2)
    

    Or get rid of the digits altogether if you prefer the matrix without annotations:

    corr.style.background_gradient(cmap='coolwarm').set_properties(**{'font-size': '0pt'})
    

    The styling documentation also includes instructions of more advanced styles, such as how to change the display of the cell the mouse pointer is hovering over. To save the output you could return the HTML by appending the render() method and then write it to a file (or just take a screenshot for less formal purposes).


    Time comparison

    In my testing, style.background_gradient() was 4x faster than plt.matshow() and 120x faster than sns.heatmap() with a 10x10 matrix. Unfortunately it doesn't scale as well as plt.matshow(): the two take about the same time for a 100x100 matrix, and plt.matshow() is 10x faster for a 1000x1000 matrix.


    Saving

    There are a few possible ways to save the stylized dataframe:

    • Return the HTML by appending the render() method and then write the output to a file.
    • Save as an .xslx file with conditional formatting by appending the to_excel() method.
    • Combine with imgkit to save a bitmap
    • Take a screenshot (for less formal purposes).

    Update for pandas >= 0.24

    By setting axis=None, it is now possible to compute the colors based on the entire matrix rather than per column or per row:

    corr.style.background_gradient(cmap='coolwarm', axis=None)
    

    0 讨论(0)
  • 2020-11-30 16:58

    For completeness, the simplest solution i know with seaborn as of late 2019, if one is using Jupyter:

    import seaborn as sns
    sns.heatmap(dataframe.corr())
    
    0 讨论(0)
提交回复
热议问题