Assessing/Improving prediction with linear discriminant analysis or logistic regression

后端未结

关注

 1  484

I recently needed to combine two or more variables on some data set to evaluate if their combination could enhance predictivity, thus I made some logistic regression in R. N

Sample data

We'll first generate 200 observations, having increasing levels of probability for Case=1. The first predictor (pred1) will follow a distribution that is nonlinear, close to the one being modeled when doing logistic regression. It will be rather closely related to the proportion of Cases. The second predictor will just be random, uniformly distributed noise.

set.seed(2351)
df <- data.frame(Case = c(sample(c(0,1), size = 67, prob = c(0.8, 0.2), replace = TRUE), 
                          sample(c(0,1), size = 66, prob = c(0.5, 0.5), replace = TRUE), 
                          sample(c(0,1), size = 67, prob = c(0.2, 0.8), replace = TRUE)),
                 pred1 = 6/(1+4*exp(-seq(from = -3, to = 5, length.out = 200))) + rnorm(n = 200, mean = 2, sd=.5),
                 pred2 = runif(n = 200, min = 0, max = 100))

We see in the boxplot below that the observations where case==1 generally have higher pred1, which is intended (from the way we generated the data). At the same time, there is an overlap, otherwise it would make it too easy to decide on a cutoff point/threshold.

boxplot(pred1 ~ Case, data=df, xlab="Case", ylab="pred1")

boxplot

Fitting the logistic model

First using both predictors:

model.1 <- glm(Case ~ pred1 + pred2, data=df, family=binomial(logit))
summary(model.1)

# Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -2.058258   0.479094  -4.296 1.74e-05 ***
# pred1        0.428491   0.075373   5.685 1.31e-08 ***
# pred2        0.003399   0.005500   0.618    0.537    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 276.76  on 199  degrees of freedom
# Residual deviance: 238.51  on 197  degrees of freedom
# AIC: 244.51

As we'd expect, the first predictor is rather strongly related, and the second, poorly related to the outcome.

Note that to get Odds Ratios from those coefficients, we need to exponentiate them:

exp(model.1$coefficients[2:3])

#    pred1    pred2 
# 1.534939 1.003405   # Odds Ratios (making the relationships appear more clearly). 
                      # Use `exp(confint(model.1))` to get confidence intervals.

We'll compare this model to a simpler model, removing the second predictor:

model.2 <- glm(Case ~ pred1, data=df, family=binomial(logit))
summary(model.2)

# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -1.87794    0.37452  -5.014 5.32e-07 ***
# pred1        0.42651    0.07514   5.676 1.38e-08 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance: 276.76  on 199  degrees of freedom
# Residual deviance: 238.89  on 198  degrees of freedom
# AIC: 242.89

exp(model.2$coefficients)[2]

#    pred1 
# 1.531907  # Odds Ratio

We could also run an anova(model.1, model.2), but let's skip this part and move on to prediction, keeping this simpler model as the second variable doesn't add much predictive value, if any. In practive, having more predictors is rarely a problem unless it's truly random noise, but here I focus more on the operation of predicting and choosing a proper threshold.

Stored predictions

In the model.2 object (a list), there is an item named fitted.values. Those values are the exact same that we'd get from predict(model.2, type="response") and can be interpreted as probabilities; one for each row, based on the predictor(s) and their coefficient(s).

New predictions

It is also possible to predict the outcome for hypothetical rows not in our initial dataframe.

With model.1 (2 predictors):

predict(model.1, newdata = list(pred1=1, pred2=42), type="response")

#         1 
# 0.1843701

With model.2 (1 predictor):

predict(model.2, newdata = list(pred1=12), type="response")

#       1 
# 0.96232

Going from probability to binary response

Looking back at the link between our predictor pred1 and the calculated probability of having Case=1:

plot(df$pred1, model.2$fitted.values, 
     xlab="pred1", ylab="probability that Case=1")

P(case) against pred1

We note that since we have only one predictor, the probability is a direct function of it. If we had kept the other predictor in the equation, we'd see points grouped around the same line, but in a cloud of points.

But this doesn't change the fact that if we are to evaluate how well our model can predict binary outcomes, we need to settle on a threshold above which we'll consider that the observation is a Case. Several packages have tools to help picking that threshold. But even without any additional package, we can calculate various properties over a range of thresholds using a function such as the following, which will calculate the sensitivity (ability to detect True Cases), specificity (ability to identify True Non Cases), and other properties well described here.

df.ana <- data.frame(thresh=seq(from = 0, to = 100, by = 0.5) / 100)
for(i in seq_along(df.ana$thresh)) {
    df.ana$sensitivity[i] <- sum(df$Case==1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(df$Case==1)
    df.ana$specificity[i] <- sum(df$Case==0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(df$Case==0)
    df.ana$pos.pred.value[i] <- sum(df$Case == 1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(predict(model.2, type="resp") >= df.ana$thresh[i])
    df.ana$neg.pred.value[i] <- sum(df$Case == 0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(predict(model.2, type="resp") < df.ana$thresh[i])
    df.ana$accuracy[i] <- sum((predict(model.2, type="resp") >= df.ana$thresh[i]) == df$Case) / nrow(df)
}

which.max(df.ana$accuracy)

# [1] 46 

optimal.thresh <- df.ana$thresh[which.max(df.ana$accuracy)] # 0.46

The accuracy is the proportion of correct predictions over all predictions. The 46th threshold (0.46) is the "best" for that matter. Let's check a few other neighboring rows in the generated dataframe; it tells us that 0.47 would work as well on all fronts. Fine-tuning would involve adding some new data to our initial dataframe.

df.ana[45:48,]

#    thresh sensitivity specificity pos.pred.value neg.pred.value accuracy
# 45   0.45   0.7142857   0.6947368      0.7211538      0.6875000    0.705
# 46   0.46   0.7142857   0.7157895      0.7352941      0.6938776    0.715
# 47   0.47   0.7142857   0.7157895      0.7352941      0.6938776    0.715
# 48   0.48   0.7047619   0.7157895      0.7326733      0.6868687    0.710

Note that the auc function (area under the curve) will give the same number as the accuracy for that threshold:

library(pROC)
auc(Case ~ as.numeric(predict(model.2, type="response") >= optimal.thresh), data=df)

# Area under the curve: 0.715

Some plots

# thresholds against accuracy
plot(x=df.ana$thresh, y=df.ana$accuracy, type="l",
         xlab="Threshold", ylab="", xlim=c(0,1), ylim=c(0,1))
text(x = 0.1, y = 0.5, labels = "Accuracy", col="black")

# thresholds against Sensitivity 
lines(x=df.ana$thresh, y=df.ana$sensitivity, type="l",col="blue") # Sensitivity We want to maximize this, but not too much
text(x = 0.1, y = 0.95, labels = "Sensitivity", col="blue")

# thresholds against specificity 
lines(x=df.ana$thresh, y=df.ana$specificity, type="l", col="red") # Specificity we want to maximize also, but not too much
text(x = 0.1, y = 0.05, labels = "Specificity", col="red")

# optimal threshold vertical line
abline(v=optimal.thresh)
text(x=optimal.thresh + .01, y=0.05, labels= optimal.thresh)

Accuracy sensitivity specificity

Incidentally, all lines converge more or less to the same point, which suggests this is a good compromise between all the qualities we look for in a predictive tool. But depending on your objectives, it might be better picking a lower or a higher threshold. Statistical tools are useful, but in the end, some other considerations are often more important in making a final decision.

About ROC

The following graph is the same as the one which would be produced with pROC's roc:

plot(x=df.ana$specificity, y = df.ana$sensitivity, type="l", col="blue",
         xlim = c(1,0), xlab = "Specificity", ylab = "Sensitivity") 

# Equivalent to
# plot(roc(predictor=model.2$fitted.values, response = model.2$y))

ROC

Tabulations and other stats

The following function allows one to calculate, for a logistic model fit, the same stats seen above, and gives a 2x2 table for any chosen threshold.

diagnos.test <- function(model, threshold) {
    output <- list()
    output$stats <- c(
      sensitivity = sum(model.1$y==1 & (predict(model, type="resp") >= threshold)) / sum(model.1$y==1),
      specificity = sum(model.1$y==0 & (predict(model, type="resp") < threshold)) / sum(model.1$y==0),
      pos.pr.value = sum(model.1$y==1 & (predict(model.2, type="resp") >= threshold)) / sum(predict(model.2, type="resp") >= threshold),
      neg.pr.value = sum(df$Case == 0 & (predict(model.2, type="resp") < threshold)) / sum(predict(model.2, type="resp") < threshold),
      accuracy = sum((predict(model.2, type="resp") >= threshold) == df$Case) / nrow(df))
    output$tab <- addmargins(t(table(model$y, as.numeric(predict(model, type="response") > threshold),dnn = list("Cases", "Predictions")))[2:1,2:1])
    return(output)
}

diagnos.test(model.2, 0.47)

# $stats
#  sensitivity  specificity pos.pr.value neg.pr.value     accuracy 
#    0.7142857    0.7157895    0.7352941    0.6938776    0.7150000 
# 
# $tab
#            Cases
# Predictions   1  0 Sum
#         1    75 27 102
#         0    30 68  98
#         Sum 105 95 200

Final note

I don't pretend I have covered everything on prediction, sensitivity and specificity; my goal was more to go as far as possible using common language and calculations, not relying on any specific packages.

0 讨论(0)