best way to create response curves for a GLM species distribution model in R?

落爺英雄遲暮 提交于 2021-02-08 10:21:09

问题


I'm running a binomial GLM to predict the probability of a species occurrence, where I am training on one dataset and testing the model on another dataset:

TrainingData<-read.csv("TrainingData.csv")[,-1]
TrainingData[,1]<-as.factor(TrainingData[,1])
TrainingData[,4]<-as.factor(TrainingData[,4])

TestData<-read.csv("TestData.csv")[,-1]
TestData[,1]<-as.factor(TestData[,1])
TestData[,4]<-as.factor(TestData[,4])

mod<-glm(presence~var1+var2+var3, family=binomial, data=TrainingData)

probs=predict(mod, TestData, type="response")

What is the best way (or function) to create response curves to plot the relationship between the probability of presence and each predictor variable?

Thanks!


回答1:


The marginal probabilities can be calculated from predict.glm with type = "terms", since each of the terms are calculated with the remaining variables set at their mean values.
This is converted back to a probabilty scale with plogis(term + intercept).

Second, because your data set contains and combination of continuous values and factors for your predictor variables, separate plots were made for each type and combined with grid.arrange.

Although this answers your question directly based on the glm model you presented, I would still recommend examining the spatial autocorrelation of both your predictor and response variables, as this could have a likely impact on your final model.

library(reshape2)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)

TrainingData <- read.csv("~/Downloads/TrainingData.csv", header = TRUE)
TrainingData[['presence']] <- as.factor(TrainingData[['presence']])
TrainingData[['var3']] <- as.factor(TrainingData[['var3']])
TrainingData[['X']] <- NULL # Not used in the model

TestData <- read.csv("~/Downloads/TestData.csv", header = TRUE)
TestData[['presence']] <- as.factor(TestData[['presence']])
TestData[['var3']] <- as.factor(TestData[['var3']])
TestData[['X']] <- NULL

Presence/Absence model

mod <- glm(presence ~ var1 + var2 + var3, family = binomial, data = TrainingData)

Get predicted probabilities for each of the centered variables (i.e remaining variables set to their mean).

mod_terms <- predict(mod, newdata = TestData, type = "terms")
mod_prob <- data.frame(idx = 1:nrow(TestData), plogis(mod_terms + 
    attr(mod_terms, "constant")))
mod_probg <- mod_prob %>% gather(variable, probability, -idx)

Melt the Test data into long format

TestData['idx'] <- 1:nrow(TestData) # Add index to data
TestData[['X']] <- NULL # Drop the X variable since it was not used in the model

data_long <- melt(TestData, id = c("presence","idx"))

data_long[['value']] <- as.numeric(data_df[['value']])

Merge Testdata with predictions and separate the data containing continuous (var1 and var2) and factors (var3).

# Merge Testdata with predictions
data_df <- merge(data_long, mod_probg, by = c("idx", "variable"))
data_df <- data_df %>% arrange(variable, value) 

data_continuous <- data_df %>% filter(., variable != "var3") %>% 
    transform(value = as.numeric(value)) %>% arrange(variable, value) 

data_factor <- data_df %>% filter(., variable == "var3") %>%
    transform(value = as.factor(value))%>% 
   arrange(idx) 

ggplot output

g_continuous <- ggplot(data_continuous, aes(x = value, y = probability)) + geom_point()+
 facet_wrap(~variable, scales = "free_x") 

g_factor <-  ggplot(data = data_factor, aes(x = value, y = probability)) + geom_boxplot() +
facet_wrap(~variable) 

grid.arrange(g_continuous, g_factor, nrow = 1)



来源:https://stackoverflow.com/questions/32915628/best-way-to-create-response-curves-for-a-glm-species-distribution-model-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!