问题
If I have the following data table:
set.seed(1)
TDT <- data.table(Group = c(rep("A",40),rep("B",60)),
Id = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
norm = round(runif(100)/10,2),
x1 = sample(100,100),
x2 = round(rnorm(100,0.75,0.3),2),
x3 = round(rnorm(100,0.75,0.3),2),
x4 = round(rnorm(100,0.75,0.3),2),
x5 = round(rnorm(100,0.75,0.3),2))
How can I calculate the correlations between x1, x2, x3, x4 and x5 by Time?
This:
TDT[,x:= list(cor(TDT[,5:9])), by = Time]
does not work.
How can it be done in datatable
?
回答1:
You were so close in your attempt! All you missed was an extra list()
.
This works:
TDT[,x:= list(list(cor(TDT[,5:9]))), by = Time]
And TDT$x
returns:
[[1]]
x1 x2 x3 x4 x5
x1 1.00000000 0.72185099 0.07368766 -0.7031890 -0.36895449
x2 0.72185099 1.00000000 0.68058833 -0.7393130 0.05066973
x3 0.07368766 0.68058833 1.00000000 -0.5021462 0.10645894
x4 -0.70318896 -0.73931299 -0.50214616 1.0000000 0.11671020
x5 -0.36895449 0.05066973 0.10645894 0.1167102 1.00000000
[[2]]
x1 x2 x3 x4 x5
x1 1.0000000 -0.1011948 -0.85191422 -0.15571603 0.4855237
x2 -0.1011948 1.0000000 0.56691559 -0.44002621 -0.6699172
x3 -0.8519142 0.5669156 1.00000000 0.02189754 -0.6168013
x4 -0.1557160 -0.4400262 0.02189754 1.00000000 0.2236542
x5 0.4855237 -0.6699172 -0.61680132 0.22365419 1.0000000
[...]
The extra list()
is needed because of how data.table
parses the second element of the DT[1,2]
syntax. This has been discussed in depth elsewhere in stackoverflow, with a most excellent answer that I invite you to read.
As a side note, it seems preferable to replace the outermost call to list()
with .()
to clarify the intent. I also like to single out explicitly the columns with a reference to .SD
and .SDcols
. With the same outcome, you could rewrite your code as:
TDT[, x := .(list(cor(.SD))), by = Time, .SDcols = 5:9]
回答2:
You may find the corrr package useful for this. In combination with dplyr commands, you can easily get the correlation matrix for different groups.
library(data.table) # not necessary unless you want the data in this format for other reasons
library(dplyr)
library(corrr)
Get correlation matrix for each Id:
> TDT %>%
+ group_by(Id) %>%
+ do({
+ correlate(select(., x1:x5))
+ })
Source: local data frame [25 x 7]
Groups: Id [5]
Id rowname x1 x2 x3 x4 x5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 x1 NA -0.246252411 -0.24589380 -0.181120555 0.14781414
2 1 x2 -0.24625241 NA 0.32098291 -0.175603686 -0.08863810
3 1 x3 -0.24589380 0.320982911 NA 0.161336670 0.07934436
4 1 x4 -0.18112056 -0.175603686 0.16133667 NA -0.19662700
5 1 x5 0.14781414 -0.088638098 0.07934436 -0.196627000 NA
6 2 x1 NA 0.075760735 0.41276725 0.425032505 0.37178993
7 2 x2 0.07576074 NA 0.07747543 -0.004202306 -0.08086958
8 2 x3 0.41276725 0.077475426 NA 0.248151847 0.07619264
9 2 x4 0.42503251 -0.004202306 0.24815185 NA 0.37647798
10 2 x5 0.37178993 -0.080869584 0.07619264 0.376477979 NA
# ... with 15 more rows
Get correlation matrix for each Time:
> TDT %>%
+ group_by(Time) %>%
+ do({
+ correlate(select(., x1:x5))
+ })
Source: local data frame [100 x 7]
Groups: Time [20]
Time rowname x1 x2 x3 x4 x5
<date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2010-01-02 x1 NA -0.66584960 -0.58788152 0.92540707 0.37316217
2 2010-01-02 x2 -0.66584960 NA -0.06102424 -0.69292534 0.19440850
3 2010-01-02 x3 -0.58788152 -0.06102424 NA -0.54623949 -0.78714932
4 2010-01-02 x4 0.92540707 -0.69292534 -0.54623949 NA 0.53697784
5 2010-01-02 x5 0.37316217 0.19440850 -0.78714932 0.53697784 NA
6 2010-02-02 x1 NA -0.10444724 -0.62424401 0.30109335 0.04834057
7 2010-02-02 x2 -0.10444724 NA -0.12010431 0.08966978 -0.68762698
8 2010-02-02 x3 -0.62424401 -0.12010431 NA -0.92782037 0.52099983
9 2010-02-02 x4 0.30109335 0.08966978 -0.92782037 NA -0.58214861
10 2010-02-02 x5 0.04834057 -0.68762698 0.52099983 -0.58214861 NA
# ... with 90 more rows
回答3:
split
by Time
and then run cor
for each sub-group
lapply(split(TDT, TDT$Time), function(a) cor(a[,5:9]))
#OR
lapply(split(TDT[,5:9], TDT$Time), cor)
来源:https://stackoverflow.com/questions/42611836/correlationmatrix-from-data-table