I'm doing logistic regression using pandas 0.11.0
(data handling) and statsmodels 0.4.3
to do the actual regression, on Mac OSX Lion.
I'm going to be running ~2,900 different logistic regression models and need the results output to csv file and formatted in a particular way.
Currently, I'm only aware of doing print result.summary()
which prints the results (as follows) to the shell:
Logit Regression Results
==============================================================================
Dep. Variable: death_death No. Observations: 9752
Model: Logit Df Residuals: 9747
Method: MLE Df Model: 4
Date: Wed, 22 May 2013 Pseudo R-squ.: -0.02672
Time: 22:15:05 Log-Likelihood: -5806.9
converged: True LL-Null: -5655.8
LLR p-value: 1.000
===============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
-------------------------------------------------------------------------------
age_age5064 -0.1999 0.055 -3.619 0.000 -0.308 -0.092
age_age6574 -0.2553 0.053 -4.847 0.000 -0.359 -0.152
sex_female -0.2515 0.044 -5.765 0.000 -0.337 -0.166
stage_early -0.1838 0.041 -4.528 0.000 -0.263 -0.104
access -0.0102 0.001 -16.381 0.000 -0.011 -0.009
===============================================================================
I will also need the odds ratio, which is computed by print np.exp(result.params)
, and is printed in the shell as such:
age_age5064 0.818842
age_age6574 0.774648
sex_female 0.777667
stage_early 0.832098
access 0.989859
dtype: float64
What I need is for these each to be written to a csv file in form of a very lon row like (am not sure, at this point, whether I will need things like Log-Likelihood
, but have included it for the sake of thoroughness):
`Log-Likelihood, age_age5064_coef, age_age5064_std_err, age_age5064_z, age_age5064_p>|z|,...age_age6574_coef, age_age6574_std_err, ......access_coef, access_std_err, ....age_age5064_odds_ratio, age_age6574_odds_ratio, ...sex_female_odds_ratio,.....access_odds_ratio`
I think you get the picture - a very long row, with all of these actual values, and a header with all the column designations in a similar format.
I am familiar with the csv module
in Python, and am becoming more familiar with pandas
. Not sure whether this info could be formatted and stored in a pandas dataframe
and then written, using to_csv
to a file once all ~2,900 logistic regression models have completed; that would certainly be fine. Also, writing them as each model is completed is also fine (using csv module
).
UPDATE:
So, I was looking more at statsmodels site, specifically trying to figure out how the results of a model are stored within classes. It looks like there is a class called 'Results', which will need to be used. I think using inheritance from this class to create another class, where some of the methods/operators get changed might be the way to go, in order to get the formatting I require. I have very little experience in the ways of doing this, and will need to spend quite a bit of time figuring this out (which is fine). If anybody can help/has more experience that would be awesome!
Here is the site where the classes are laid out: statsmodels results class
There is no premade table of parameters and their result statistics currently available.
Essentially you need to stack all the results yourself, whether in a list, numpy array or pandas DataFrame depends on what's more convenient for you.
for example, if I want one numpy array that has the results for a model, llf and results in the summary parameter table, then I could use
res_all = []
for res in results:
low, upp = res.confint().T # unpack columns
res_all.append(numpy.concatenate(([res.llf], res.params, res.tvalues, res.pvalues,
low, upp)))
But it might be better to align with pandas, depending on what structure you have across models.
You could write a helper function that takes all the results from the results instance and concatenates them in a row.
(I'm not sure what's the most convenient for writing to csv by rows)
edit:
Here is an example storing the regression results in a dataframe
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/multilinear.py#L21
the loop is on line 159.
summary() and similar code outside of statsmodels, for example http://johnbeieler.org/py_apsrtable/ for combining several results, is oriented towards printing and not to store variables.
If you want to find coefficient results.params will give you coefficients. If you want to find pvalues then use results.pvalues. In any way you can use dir(results) to find out all the attribute of a object.
I found this formulation to be a little more straightforward. You can add/subtract columns by following the syntax from the examples (pvals,coeff,conf_lower,conf_higher).
import pandas as pd #This can be left out if already present...
def results_summary_to_dataframe(results):
'''This takes the result of an statsmodel results table and transforms it into a dataframe'''
pvals = results.pvalues
coeff = results.params
conf_lower = results.conf_int()[0]
conf_higher = results.conf_int()[1]
results_df = pd.DataFrame({"pvals":pvals,
"coeff":coeff,
"conf_lower":conf_lower,
"conf_higher":conf_higher
})
#Reordering...
results_df = results_df[["coeff","pvals","conf_lower","conf_higher"]]
return results_df
write_path = '/my/path/here/output.csv'
with open(write_path, 'w') as f:
f.write(result.summary().as_csv())
There is actually a built-in method documented in the documentation here:
f = open('csvfile.csv','w')
f.write(result.summary().as_csv())
f.close()
I believe this is a much easier (and clean) way to output the summaries to csv files.
来源:https://stackoverflow.com/questions/16705598/python-2-7-statsmodels-formatting-and-writing-summary-output