Capturing high multi-collinearity in statsmodels

后端 未结 2 936
清酒与你
清酒与你 2020-12-04 13:29

Say I fit a model in statsmodels

mod = smf.ols(\'dependent ~ first_category + second_category + other\', data=df).fit()

When I do mod

2条回答
  •  抹茶落季
    2020-12-04 14:03

    Based on a similar question for R, there are some other options that may help people. I was looking for a single number that captured the collinearity, and options include the determinant and condition number of the correlation matrix.

    According to one of the R answers, determinant of the correlation matrix will "range from 0 (Perfect Collinearity) to 1 (No Collinearity)". I found the bounded range helpful.

    Translated example for determinant:

    import numpy as np
    import pandas as pd
    
    # Create a sample random dataframe
    np.random.seed(321)
    x1 = np.random.rand(100)
    x2 = np.random.rand(100)
    x3 = np.random.rand(100)
    df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
    
    # Now create a dataframe with multicollinearity
    multicollinear_df = df.copy()
    multicollinear_df['x3'] = multicollinear_df['x1'] + multicollinear_df['x2']
    
    # Compute both correlation matrices
    corr = np.corrcoef(df, rowvar=0)
    multicollinear_corr = np.corrcoef(multicollinear_df, rowvar=0)
    
    # Compare the determinants
    print np.linalg.det(corr) . # 0.988532159861
    print np.linalg.det(multicollinear_corr) . # 2.97779797328e-16
    

    And similarly, the condition number of the covariance matrix will approach infinity with perfect linear dependence.

    print np.linalg.cond(corr) . # 1.23116253259
    print np.linalg.cond(multicollinear_corr) . # 6.19985218873e+15
    

提交回复
热议问题