Python Negative Binomial Regression - Results Don't Match those from R

前端 未结 1 1616
执笔经年
执笔经年 2021-01-13 10:08

I\'m experimenting with negative binomial regression using Python. I found this example using R, along with a data set:

http://www.karlin.mff.cuni.cz/~pesta/NMFM404/

相关标签:
1条回答
  • 2021-01-13 11:09

    The reason for the discrepancy is that when you are reading in the dataset with Pandas, the prog variable is treated as type float by default:

    df.prog.head()
    
    0    2.0
    1    2.0
    2    2.0
    3    2.0
    4    2.0
    Name: prog, dtype: float32
    

    In the R example, on the other hand, the prog variable was explicitly cast to a factor (categorical) variable:

    dat <- within(dat, {
        prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
        id <- factor(id)
    })
    

    As a result, when you look at the fit summary in R, you can see that the prog variable has been split into n-1 binary-encoded terms:

    > summary(m1 <- glm.nb(daysabs ~ math + prog, data = dat))
    
    Call:
    glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156, 
        link = log)
    
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -2.1547  -1.0192  -0.3694   0.2285   2.5273  
    
    Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
    (Intercept)     2.615265   0.197460  13.245  < 2e-16 ***
    math           -0.005993   0.002505  -2.392   0.0167 *  
    progAcademic   -0.440760   0.182610  -2.414   0.0158 *  
    progVocational -1.278651   0.200720  -6.370 1.89e-10 ***
    

    Compare this to how the prog variable appears in the Python fit summary you posted.

    To fix the issue, you can use the C() function to cast the variable to categorical in statsmodels. This way you will arrive at the same result:

    model = smf.glm(formula = "daysabs ~ math + C(prog)", data=df, family=sm.families.NegativeBinomial()).fit()
    model.summary()
    
    <class 'statsmodels.iolib.summary.Summary'>
    """
                     Generalized Linear Model Regression Results                  
    ==============================================================================
    Dep. Variable:                daysabs   No. Observations:                  314
    Model:                            GLM   Df Residuals:                      310
    Model Family:        NegativeBinomial   Df Model:                            3
    Link Function:                    log   Scale:                   1.06830885199
    Method:                          IRLS   Log-Likelihood:                -865.68
    Date:                Thu, 16 Feb 2017   Deviance:                       350.98
    Time:                        10:34:16   Pearson chi2:                     331.
    No. Iterations:                     6                                         
    ==================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
    ----------------------------------------------------------------------------------
    Intercept          2.6150      0.207     12.630      0.000       2.209       3.021
    C(prog)[T.2.0]    -0.4408      0.192     -2.302      0.021      -0.816      -0.065
    C(prog)[T.3.0]    -1.2786      0.210     -6.079      0.000      -1.691      -0.866
    math              -0.0060      0.003     -2.281      0.023      -0.011      -0.001
    ==================================================================================
    """
    
    0 讨论(0)
提交回复
热议问题