问题
I stole this example from the following post: LINK
set.seed(1)
TDT <- data.table(Group = c(rep("A",40),rep("B",60)),
Id = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
norm = round(runif(100)/10,2),
x1 = sample(100,100),
x2 = round(rnorm(100,0.75,0.3),2),
x3 = round(rnorm(100,0.75,0.3),2),
x4 = round(rnorm(100,0.75,0.3),2),
x5 = round(rnorm(100,0.75,0.3),2))
In order to get the correlations of x1 - x5 by time, one could use:
TDT[, x := .(list(cor(.SD))), by = Time, .SDcols = 5:9]
However, instead of the correlations of x1 - x5 by time, I am interested in the correlations of x1
with ALL other variables (not only x1 - x5) over time. In addition, I somehow would like to have the significance of the correlation: for an explanation see video around 5:16.
In order to deal with non-numerical columns and NA
's, I tried to go this way:
numcols <- which(sapply(TDT, is.numeric))
TDTcor <- TDT[, x := .(list(cor(.SD, use= "pairwise.complete.obs", method= "pearson"))), by = time, .SDcols = numcols]
The primary problem is that this still gives all correlations (which gives problems due to the very large datasets I use). It also does not give the significance of the correlations.
Could anyone tell me how to proceed ?
来源:https://stackoverflow.com/questions/56182271/getting-the-correlation-with-significance-of-one-variable-with-the-rest-of-the-d