Correlationmatrix from data table

问题

If I have the following data table:

set.seed(1)
TDT <- data.table(Group = c(rep("A",40),rep("B",60)),
                      Id = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
                      Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
                      norm = round(runif(100)/10,2),
                      x1 = sample(100,100),
                      x2 = round(rnorm(100,0.75,0.3),2),
                      x3 = round(rnorm(100,0.75,0.3),2),
                      x4 = round(rnorm(100,0.75,0.3),2),
                      x5 = round(rnorm(100,0.75,0.3),2))

How can I calculate the correlations between x1, x2, x3, x4 and x5 by Time?

This:

TDT[,x:= list(cor(TDT[,5:9])), by = Time]

does not work.

How can it be done in datatable?

回答1:

You were so close in your attempt! All you missed was an extra list().

This works:

TDT[,x:= list(list(cor(TDT[,5:9]))), by = Time]

And TDT$x returns:

[[1]]
            x1          x2          x3         x4          x5
x1  1.00000000  0.72185099  0.07368766 -0.7031890 -0.36895449
x2  0.72185099  1.00000000  0.68058833 -0.7393130  0.05066973
x3  0.07368766  0.68058833  1.00000000 -0.5021462  0.10645894
x4 -0.70318896 -0.73931299 -0.50214616  1.0000000  0.11671020
x5 -0.36895449  0.05066973  0.10645894  0.1167102  1.00000000

[[2]]
           x1         x2          x3          x4         x5
x1  1.0000000 -0.1011948 -0.85191422 -0.15571603  0.4855237
x2 -0.1011948  1.0000000  0.56691559 -0.44002621 -0.6699172
x3 -0.8519142  0.5669156  1.00000000  0.02189754 -0.6168013
x4 -0.1557160 -0.4400262  0.02189754  1.00000000  0.2236542
x5  0.4855237 -0.6699172 -0.61680132  0.22365419  1.0000000

[...]

The extra list() is needed because of how data.table parses the second element of the DT[1,2] syntax. This has been discussed in depth elsewhere in stackoverflow, with a most excellent answer that I invite you to read.

As a side note, it seems preferable to replace the outermost call to list() with .() to clarify the intent. I also like to single out explicitly the columns with a reference to .SD and .SDcols. With the same outcome, you could rewrite your code as:

TDT[, x := .(list(cor(.SD))), by = Time, .SDcols = 5:9]

回答2:

You may find the corrr package useful for this. In combination with dplyr commands, you can easily get the correlation matrix for different groups.

library(data.table) # not necessary unless you want the data in this format for other reasons
library(dplyr)
library(corrr)

Get correlation matrix for each Id:

> TDT %>% 
+   group_by(Id) %>%
+   do({
+      correlate(select(., x1:x5))
+     }) 
Source: local data frame [25 x 7]
Groups: Id [5]

      Id rowname          x1           x2          x3           x4          x5
   <dbl>   <chr>       <dbl>        <dbl>       <dbl>        <dbl>       <dbl>
1      1      x1          NA -0.246252411 -0.24589380 -0.181120555  0.14781414
2      1      x2 -0.24625241           NA  0.32098291 -0.175603686 -0.08863810
3      1      x3 -0.24589380  0.320982911          NA  0.161336670  0.07934436
4      1      x4 -0.18112056 -0.175603686  0.16133667           NA -0.19662700
5      1      x5  0.14781414 -0.088638098  0.07934436 -0.196627000          NA
6      2      x1          NA  0.075760735  0.41276725  0.425032505  0.37178993
7      2      x2  0.07576074           NA  0.07747543 -0.004202306 -0.08086958
8      2      x3  0.41276725  0.077475426          NA  0.248151847  0.07619264
9      2      x4  0.42503251 -0.004202306  0.24815185           NA  0.37647798
10     2      x5  0.37178993 -0.080869584  0.07619264  0.376477979          NA
# ... with 15 more rows

Get correlation matrix for each Time:

> TDT %>% 
+   group_by(Time) %>%
+   do({
+     correlate(select(., x1:x5))
+   })
Source: local data frame [100 x 7]
Groups: Time [20]

         Time rowname          x1          x2          x3          x4          x5
       <date>   <chr>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
1  2010-01-02      x1          NA -0.66584960 -0.58788152  0.92540707  0.37316217
2  2010-01-02      x2 -0.66584960          NA -0.06102424 -0.69292534  0.19440850
3  2010-01-02      x3 -0.58788152 -0.06102424          NA -0.54623949 -0.78714932
4  2010-01-02      x4  0.92540707 -0.69292534 -0.54623949          NA  0.53697784
5  2010-01-02      x5  0.37316217  0.19440850 -0.78714932  0.53697784          NA
6  2010-02-02      x1          NA -0.10444724 -0.62424401  0.30109335  0.04834057
7  2010-02-02      x2 -0.10444724          NA -0.12010431  0.08966978 -0.68762698
8  2010-02-02      x3 -0.62424401 -0.12010431          NA -0.92782037  0.52099983
9  2010-02-02      x4  0.30109335  0.08966978 -0.92782037          NA -0.58214861
10 2010-02-02      x5  0.04834057 -0.68762698  0.52099983 -0.58214861          NA
# ... with 90 more rows

回答3:

split by Time and then run cor for each sub-group

lapply(split(TDT, TDT$Time), function(a) cor(a[,5:9]))

#OR

lapply(split(TDT[,5:9], TDT$Time), cor)

来源：https://stackoverflow.com/questions/42611836/correlationmatrix-from-data-table

标签

data.table

correlation