Assessing/Improving prediction with linear discriminant analysis or logistic regression

后端 未结 1 477
青春惊慌失措
青春惊慌失措 2021-01-01 04:42

I recently needed to combine two or more variables on some data set to evaluate if their combination could enhance predictivity, thus I made some logistic regression in R. N

相关标签:
1条回答
  • 2021-01-01 05:11

    Here's my input concerning logistic regression and prediction (I don't know much about linear discrimination but understand it's closely related to logistic regression, which I know much better). I'm not sure I'm following all of your reasoning, nor if this will be a satisfactory answer, but hopefully it won't hurt. This has been a review of some epidemiology classes for me. I hope it's not too formal and addresses at least in part some of your questions. If not, and if other users think this would better belong on Cross Validated, I won't take offense. :)

    Sample data

    We'll first generate 200 observations, having increasing levels of probability for Case=1. The first predictor (pred1) will follow a distribution that is nonlinear, close to the one being modeled when doing logistic regression. It will be rather closely related to the proportion of Cases. The second predictor will just be random, uniformly distributed noise.

    set.seed(2351)
    df <- data.frame(Case = c(sample(c(0,1), size = 67, prob = c(0.8, 0.2), replace = TRUE), 
                              sample(c(0,1), size = 66, prob = c(0.5, 0.5), replace = TRUE), 
                              sample(c(0,1), size = 67, prob = c(0.2, 0.8), replace = TRUE)),
                     pred1 = 6/(1+4*exp(-seq(from = -3, to = 5, length.out = 200))) + rnorm(n = 200, mean = 2, sd=.5),
                     pred2 = runif(n = 200, min = 0, max = 100))
    

    We see in the boxplot below that the observations where case==1 generally have higher pred1, which is intended (from the way we generated the data). At the same time, there is an overlap, otherwise it would make it too easy to decide on a cutoff point/threshold.

    boxplot(pred1 ~ Case, data=df, xlab="Case", ylab="pred1")
    

    boxplot

    Fitting the logistic model

    First using both predictors:

    model.1 <- glm(Case ~ pred1 + pred2, data=df, family=binomial(logit))
    summary(model.1)
    
    # Coefficients:
    #              Estimate Std. Error z value Pr(>|z|)    
    # (Intercept) -2.058258   0.479094  -4.296 1.74e-05 ***
    # pred1        0.428491   0.075373   5.685 1.31e-08 ***
    # pred2        0.003399   0.005500   0.618    0.537    
    # ---
    # Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    # 
    # (Dispersion parameter for binomial family taken to be 1)
    # 
    #     Null deviance: 276.76  on 199  degrees of freedom
    # Residual deviance: 238.51  on 197  degrees of freedom
    # AIC: 244.51
    

    As we'd expect, the first predictor is rather strongly related, and the second, poorly related to the outcome.

    Note that to get Odds Ratios from those coefficients, we need to exponentiate them:

    exp(model.1$coefficients[2:3])
    
    #    pred1    pred2 
    # 1.534939 1.003405   # Odds Ratios (making the relationships appear more clearly). 
                          # Use `exp(confint(model.1))` to get confidence intervals.
    

    We'll compare this model to a simpler model, removing the second predictor:

    model.2 <- glm(Case ~ pred1, data=df, family=binomial(logit))
    summary(model.2)
    
    # Coefficients:
    #             Estimate Std. Error z value Pr(>|z|)    
    # (Intercept) -1.87794    0.37452  -5.014 5.32e-07 ***
    # pred1        0.42651    0.07514   5.676 1.38e-08 ***
    # ---
    # Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    #
    # (Dispersion parameter for binomial family taken to be 1)
    #
    #     Null deviance: 276.76  on 199  degrees of freedom
    # Residual deviance: 238.89  on 198  degrees of freedom
    # AIC: 242.89
    
    exp(model.2$coefficients)[2]
    
    #    pred1 
    # 1.531907  # Odds Ratio
    

    We could also run an anova(model.1, model.2), but let's skip this part and move on to prediction, keeping this simpler model as the second variable doesn't add much predictive value, if any. In practive, having more predictors is rarely a problem unless it's truly random noise, but here I focus more on the operation of predicting and choosing a proper threshold.

    Stored predictions

    In the model.2 object (a list), there is an item named fitted.values. Those values are the exact same that we'd get from predict(model.2, type="response") and can be interpreted as probabilities; one for each row, based on the predictor(s) and their coefficient(s).

    New predictions

    It is also possible to predict the outcome for hypothetical rows not in our initial dataframe.

    With model.1 (2 predictors):

    predict(model.1, newdata = list(pred1=1, pred2=42), type="response")
    
    #         1 
    # 0.1843701 
    

    With model.2 (1 predictor):

    predict(model.2, newdata = list(pred1=12), type="response")
    
    #       1 
    # 0.96232 
    

    Going from probability to binary response

    Looking back at the link between our predictor pred1 and the calculated probability of having Case=1:

    plot(df$pred1, model.2$fitted.values, 
         xlab="pred1", ylab="probability that Case=1")
    

    P(case) against pred1

    We note that since we have only one predictor, the probability is a direct function of it. If we had kept the other predictor in the equation, we'd see points grouped around the same line, but in a cloud of points.

    But this doesn't change the fact that if we are to evaluate how well our model can predict binary outcomes, we need to settle on a threshold above which we'll consider that the observation is a Case. Several packages have tools to help picking that threshold. But even without any additional package, we can calculate various properties over a range of thresholds using a function such as the following, which will calculate the sensitivity (ability to detect True Cases), specificity (ability to identify True Non Cases), and other properties well described here.

    df.ana <- data.frame(thresh=seq(from = 0, to = 100, by = 0.5) / 100)
    for(i in seq_along(df.ana$thresh)) {
        df.ana$sensitivity[i] <- sum(df$Case==1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(df$Case==1)
        df.ana$specificity[i] <- sum(df$Case==0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(df$Case==0)
        df.ana$pos.pred.value[i] <- sum(df$Case == 1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(predict(model.2, type="resp") >= df.ana$thresh[i])
        df.ana$neg.pred.value[i] <- sum(df$Case == 0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(predict(model.2, type="resp") < df.ana$thresh[i])
        df.ana$accuracy[i] <- sum((predict(model.2, type="resp") >= df.ana$thresh[i]) == df$Case) / nrow(df)
    }
    
    which.max(df.ana$accuracy)
    
    # [1] 46 
    
    optimal.thresh <- df.ana$thresh[which.max(df.ana$accuracy)] # 0.46
    

    The accuracy is the proportion of correct predictions over all predictions. The 46th threshold (0.46) is the "best" for that matter. Let's check a few other neighboring rows in the generated dataframe; it tells us that 0.47 would work as well on all fronts. Fine-tuning would involve adding some new data to our initial dataframe.

    df.ana[45:48,]
    
    #    thresh sensitivity specificity pos.pred.value neg.pred.value accuracy
    # 45   0.45   0.7142857   0.6947368      0.7211538      0.6875000    0.705
    # 46   0.46   0.7142857   0.7157895      0.7352941      0.6938776    0.715
    # 47   0.47   0.7142857   0.7157895      0.7352941      0.6938776    0.715
    # 48   0.48   0.7047619   0.7157895      0.7326733      0.6868687    0.710
    

    Note that the auc function (area under the curve) will give the same number as the accuracy for that threshold:

    library(pROC)
    auc(Case ~ as.numeric(predict(model.2, type="response") >= optimal.thresh), data=df)
    
    # Area under the curve: 0.715
    

    Some plots

    # thresholds against accuracy
    plot(x=df.ana$thresh, y=df.ana$accuracy, type="l",
             xlab="Threshold", ylab="", xlim=c(0,1), ylim=c(0,1))
    text(x = 0.1, y = 0.5, labels = "Accuracy", col="black")
    
    # thresholds against Sensitivity 
    lines(x=df.ana$thresh, y=df.ana$sensitivity, type="l",col="blue") # Sensitivity We want to maximize this, but not too much
    text(x = 0.1, y = 0.95, labels = "Sensitivity", col="blue")
    
    # thresholds against specificity 
    lines(x=df.ana$thresh, y=df.ana$specificity, type="l", col="red") # Specificity we want to maximize also, but not too much
    text(x = 0.1, y = 0.05, labels = "Specificity", col="red")
    
    # optimal threshold vertical line
    abline(v=optimal.thresh)
    text(x=optimal.thresh + .01, y=0.05, labels= optimal.thresh)
    

    Accuracy sensitivity specificity

    Incidentally, all lines converge more or less to the same point, which suggests this is a good compromise between all the qualities we look for in a predictive tool. But depending on your objectives, it might be better picking a lower or a higher threshold. Statistical tools are useful, but in the end, some other considerations are often more important in making a final decision.

    About ROC

    The following graph is the same as the one which would be produced with pROC's roc:

    plot(x=df.ana$specificity, y = df.ana$sensitivity, type="l", col="blue",
             xlim = c(1,0), xlab = "Specificity", ylab = "Sensitivity") 
    
    # Equivalent to
    # plot(roc(predictor=model.2$fitted.values, response = model.2$y))
    

    ROC


    Tabulations and other stats

    The following function allows one to calculate, for a logistic model fit, the same stats seen above, and gives a 2x2 table for any chosen threshold.

    diagnos.test <- function(model, threshold) {
        output <- list()
        output$stats <- c(
          sensitivity = sum(model.1$y==1 & (predict(model, type="resp") >= threshold)) / sum(model.1$y==1),
          specificity = sum(model.1$y==0 & (predict(model, type="resp") < threshold)) / sum(model.1$y==0),
          pos.pr.value = sum(model.1$y==1 & (predict(model.2, type="resp") >= threshold)) / sum(predict(model.2, type="resp") >= threshold),
          neg.pr.value = sum(df$Case == 0 & (predict(model.2, type="resp") < threshold)) / sum(predict(model.2, type="resp") < threshold),
          accuracy = sum((predict(model.2, type="resp") >= threshold) == df$Case) / nrow(df))
        output$tab <- addmargins(t(table(model$y, as.numeric(predict(model, type="response") > threshold),dnn = list("Cases", "Predictions")))[2:1,2:1])
        return(output)
    }
    
    diagnos.test(model.2, 0.47)
    
    # $stats
    #  sensitivity  specificity pos.pr.value neg.pr.value     accuracy 
    #    0.7142857    0.7157895    0.7352941    0.6938776    0.7150000 
    # 
    # $tab
    #            Cases
    # Predictions   1  0 Sum
    #         1    75 27 102
    #         0    30 68  98
    #         Sum 105 95 200
    

    Final note

    I don't pretend I have covered everything on prediction, sensitivity and specificity; my goal was more to go as far as possible using common language and calculations, not relying on any specific packages.

    0 讨论(0)
提交回复
热议问题