Comparison of R, statmodels, sklearn for a classification task with logistic regression

前端 未结 2 811
感情败类
感情败类 2020-12-14 21:38

I have made some experiments with logistic regression in R, python statmodels and sklearn. While the results given by R and statmodels agree, there is some discrepency with

2条回答
  •  长情又很酷
    2020-12-14 22:08

    I ran into a similar issue and ended up posting about it on /r/MachineLearning. It turns out the difference can be attributed to data standardization. Whatever approach scikit-learn is using to find the parameters of the model will yield better results if the data is standardized. scikit-learn has some documentation discussing preprocessing data (including standardization), which can be found here.

    Results

    Number of 'default' values : 333
    Intercept: [-6.12556565]
    Coefficients: [[ 2.73145133  0.27750788]]
    
    Confusion matrix
    [[9629   38]
     [ 225  108]]
    
    Score          0.9737
    Precision      0.7397
    Recall         0.3243
    

    Code

    # scikit-learn vs. R
    # http://stackoverflow.com/questions/28747019/comparison-of-r-statmodels-sklearn-for-a-classification-task-with-logistic-reg
    
    import pandas as pd
    import sklearn
    
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import confusion_matrix
    from sklearn import preprocessing
    
    # Data is available here.
    Default = pd.read_csv('https://d1pqsl2386xqi9.cloudfront.net/notebooks/Default.csv', index_col = 0)
    
    Default['default'] = Default['default'].map({'No':0, 'Yes':1})
    Default['student'] = Default['student'].map({'No':0, 'Yes':1})
    
    I = Default['default'] == 0
    print("Number of 'default' values : {0}".format(Default[~I]['balance'].count()))
    
    feats = ['balance', 'income']
    
    Default[feats] = preprocessing.scale(Default[feats])
    
    # C = 1e6 ~ no regularization.
    classifier = LogisticRegression(C = 1e6, random_state = 42) 
    
    classifier.fit(Default[feats], Default['default'])  #fit classifier on whole base
    print("Intercept: {0}".format(classifier.intercept_))
    print("Coefficients: {0}".format(classifier.coef_))
    
    y_true = Default['default']
    y_pred_cls = classifier.predict_proba(Default[feats])[:,1] > 0.5
    
    confusion = confusion_matrix(y_true, y_pred_cls)
    score = float((confusion[0, 0] + confusion[1, 1])) / float((confusion[0, 0] + confusion[1, 1] + confusion[0, 1] + confusion[1, 0]))
    precision = float((confusion[1, 1])) / float((confusion[1, 1] + confusion[0, 1]))
    recall = float((confusion[1, 1])) / float((confusion[1, 1] + confusion[1, 0]))
    print("\nConfusion matrix")
    print(confusion)
    print('\n{s:{c}<{n}}{num:2.4}'.format(s = 'Score', n = 15, c = '', num = score))
    print('{s:{c}<{n}}{num:2.4}'.format(s = 'Precision', n = 15, c = '', num = precision))
    print('{s:{c}<{n}}{num:2.4}'.format(s = 'Recall', n = 15, c = '', num = recall))
    

提交回复
热议问题