Run an OLS regression with Pandas Data Frame

前端 未结 5 1982
温柔的废话
温柔的废话 2020-11-30 16:48

I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:



        
相关标签:
5条回答
  • 2020-11-30 17:32

    I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas' optional dependencies before pandas' version 0.20.0 (it was used for a few things in pandas.stats.)

    >>> import pandas as pd
    >>> import statsmodels.formula.api as sm
    >>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
    >>> result = sm.ols(formula="A ~ B + C", data=df).fit()
    >>> print(result.params)
    Intercept    14.952480
    B             0.401182
    C             0.000352
    dtype: float64
    >>> print(result.summary())
                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                      A   R-squared:                       0.579
    Model:                            OLS   Adj. R-squared:                  0.158
    Method:                 Least Squares   F-statistic:                     1.375
    Date:                Thu, 14 Nov 2013   Prob (F-statistic):              0.421
    Time:                        20:04:30   Log-Likelihood:                -18.178
    No. Observations:                   5   AIC:                             42.36
    Df Residuals:                       2   BIC:                             41.19
    Df Model:                           2                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
    ------------------------------------------------------------------------------
    Intercept     14.9525     17.764      0.842      0.489       -61.481    91.386
    B              0.4012      0.650      0.617      0.600        -2.394     3.197
    C              0.0004      0.001      0.650      0.583        -0.002     0.003
    ==============================================================================
    Omnibus:                          nan   Durbin-Watson:                   1.061
    Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.498
    Skew:                          -0.123   Prob(JB):                        0.780
    Kurtosis:                       1.474   Cond. No.                     5.21e+04
    ==============================================================================
    
    Warnings:
    [1] The condition number is large, 5.21e+04. This might indicate that there are
    strong multicollinearity or other numerical problems.
    
    0 讨论(0)
  • 2020-11-30 17:32

    This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place.

    No it doesn't, just convert to a NumPy array:

    >>> data = np.asarray(df)
    

    This takes constant time because it just creates a view on your data. Then feed it to scikit-learn:

    >>> from sklearn.linear_model import LinearRegression
    >>> lr = LinearRegression()
    >>> X, y = data[:, 1:], data[:, 0]
    >>> lr.fit(X, y)
    LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
    >>> lr.coef_
    array([  4.01182386e-01,   3.51587361e-04])
    >>> lr.intercept_
    14.952479503953672
    
    0 讨论(0)
  • 2020-11-30 17:35

    Note: pandas.stats has been removed with 0.20.0


    It's possible to do this with pandas.stats.ols:

    >>> from pandas.stats.api import ols
    >>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
    >>> res = ols(y=df['A'], x=df[['B','C']])
    >>> res
    -------------------------Summary of Regression Analysis-------------------------
    
    Formula: Y ~ <B> + <C> + <intercept>
    
    Number of Observations:         5
    Number of Degrees of Freedom:   3
    
    R-squared:         0.5789
    Adj R-squared:     0.1577
    
    Rmse:             14.5108
    
    F-stat (2, 2):     1.3746, p-value:     0.4211
    
    Degrees of Freedom: model 2, resid 2
    
    -----------------------Summary of Estimated Coefficients------------------------
          Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
    --------------------------------------------------------------------------------
                 B     0.4012     0.6497       0.62     0.5999    -0.8723     1.6746
                 C     0.0004     0.0005       0.65     0.5826    -0.0007     0.0014
         intercept    14.9525    17.7643       0.84     0.4886   -19.8655    49.7705
    ---------------------------------End of Summary---------------------------------
    

    Note that you need to have statsmodels package installed, it is used internally by the pandas.stats.ols function.

    0 讨论(0)
  • 2020-11-30 17:36

    Statsmodels kan build an OLS model with column references directly to a pandas dataframe.

    Short and sweet:

    model = sm.OLS(df[y], df[x]).fit()


    Code details and regression summary:

    # imports
    import pandas as pd
    import statsmodels.api as sm
    import numpy as np
    
    # data
    np.random.seed(123)
    df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC'))
    
    # assign dependent and independent / explanatory variables
    variables = list(df.columns)
    y = 'A'
    x = [var for var in variables if var not in y ]
    
    # Ordinary least squares regression
    model_Simple = sm.OLS(df[y], df[x]).fit()
    
    # Add a constant term like so:
    model = sm.OLS(df[y], sm.add_constant(df[x])).fit()
    
    model.summary()
    

    Output:

                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                      A   R-squared:                       0.019
    Model:                            OLS   Adj. R-squared:                 -0.001
    Method:                 Least Squares   F-statistic:                    0.9409
    Date:                Thu, 14 Feb 2019   Prob (F-statistic):              0.394
    Time:                        08:35:04   Log-Likelihood:                -484.49
    No. Observations:                 100   AIC:                             975.0
    Df Residuals:                      97   BIC:                             982.8
    Df Model:                           2                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    const         43.4801      8.809      4.936      0.000      25.996      60.964
    B              0.1241      0.105      1.188      0.238      -0.083       0.332
    C             -0.0752      0.110     -0.681      0.497      -0.294       0.144
    ==============================================================================
    Omnibus:                       50.990   Durbin-Watson:                   2.013
    Prob(Omnibus):                  0.000   Jarque-Bera (JB):                6.905
    Skew:                           0.032   Prob(JB):                       0.0317
    Kurtosis:                       1.714   Cond. No.                         231.
    ==============================================================================
    

    How to directly get R-squared, Coefficients and p-value:

    # commands:
    model.params
    model.pvalues
    model.rsquared
    
    # demo:
    In[1]: 
    model.params
    Out[1]:
    const    43.480106
    B         0.124130
    C        -0.075156
    dtype: float64
    
    In[2]: 
    model.pvalues
    Out[2]: 
    const    0.000003
    B        0.237924
    C        0.497400
    dtype: float64
    
    Out[3]:
    model.rsquared
    Out[2]:
    0.0190
    
    0 讨论(0)
  • 2020-11-30 17:41

    I don't know if this is new in sklearn or pandas, but I'm able to pass the data frame directly to sklearn without converting the data frame to a numpy array or any other data types.

    from sklearn import linear_model
    
    reg = linear_model.LinearRegression()
    reg.fit(df[['B', 'C']], df['A'])
    
    >>> reg.coef_
    array([  4.01182386e-01,   3.51587361e-04])
    
    0 讨论(0)
提交回复
热议问题