Getting the correlation with significance of one variable with the rest of the dataset, by time, in data.table

一曲冷凌霜 提交于 2019-12-02 06:35:49

问题


I stole this example from the following post: LINK

set.seed(1)
TDT <- data.table(Group = c(rep("A",40),rep("B",60)),
                      Id = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
                      Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
                      norm = round(runif(100)/10,2),
                      x1 = sample(100,100),
                      x2 = round(rnorm(100,0.75,0.3),2),
                      x3 = round(rnorm(100,0.75,0.3),2),
                      x4 = round(rnorm(100,0.75,0.3),2),
                      x5 = round(rnorm(100,0.75,0.3),2))

In order to get the correlations of x1 - x5 by time, one could use:

TDT[, x := .(list(cor(.SD))), by = Time, .SDcols = 5:9]

However, instead of the correlations of x1 - x5 by time, I am interested in the correlations of x1 with ALL other variables (not only x1 - x5) over time. In addition, I somehow would like to have the significance of the correlation: for an explanation see video around 5:16.

In order to deal with non-numerical columns and NA's, I tried to go this way:

numcols <- which(sapply(TDT, is.numeric))
TDTcor <- TDT[, x := .(list(cor(.SD, use= "pairwise.complete.obs", method= "pearson"))), by = time, .SDcols = numcols]

The primary problem is that this still gives all correlations (which gives problems due to the very large datasets I use). It also does not give the significance of the correlations.

Could anyone tell me how to proceed ?

来源:https://stackoverflow.com/questions/56182271/getting-the-correlation-with-significance-of-one-variable-with-the-rest-of-the-d

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!