Add missing rows to data.table according to multiple keyed columns

心已入冬 提交于 2019-12-04 21:49:18

问题


I have a data.table object that contains multiple columns that specify unique cases. In the small example below, the variables "name", "job", and "sex" specify the unique IDs. I would like to add missing rows so that each each case has a row for each possible instance of another variable, "from" (similar to expand.grid).

library(data.table)
set.seed(1)
mydata <- data.table(name = c("john","john","john","john","mary","chris","chris","chris"),
                 job = c("teacher","teacher","teacher","teacher","police","lawyer","lawyer","doctor"),
                 sex = c("male","male","male","male","female","female","male","male"),
                 from = c("NYT","USAT","BG","TIME","USAT","BG","NYT","NYT"),
                 score = rnorm(8))

setkeyv(mydata, cols=c("name","job","sex"))

mydata[CJ(unique(name, job, sex), unique(from))]

Here's the current data.table object:

> mydata
    name     job    sex from      score
1:  john teacher   male  NYT -0.6264538
2:  john teacher   male USAT  0.1836433
3:  john teacher   male   BG -0.8356286
4:  john teacher   male TIME  1.5952808
5:  mary  police female USAT  0.3295078
6: chris  lawyer female   BG -0.8204684
7: chris  lawyer   male  NYT  0.4874291
8: chris  doctor   male  NYT  0.7383247

Here's the result I'd like:

> mydata
     name     job    sex from      score
1:   john teacher   male  NYT -0.6264538
2:   john teacher   male USAT  0.1836433
3:   john teacher   male   BG -0.8356286
4:   john teacher   male TIME  1.5952808
5:   mary  police female  NYT  NA
6:   mary  police female USAT  0.3295078
7:   mary  police female   BG  NA
8:   mary  police female TIME  NA
9:  chris  lawyer female  NYT -NA
10: chris  lawyer female USAT -NA
11: chris  lawyer female   BG -0.8204684
12: chris  lawyer female TIME -NA
13: chris  lawyer   male  NYT  0.4874291
14: chris  lawyer   male USAT  NA
15: chris  lawyer   male   BG  NA
16: chris  lawyer   male TIME  NA
17: chris  doctor   male  NYT  0.7383247
18: chris  doctor   male USAT  NA
19: chris  doctor   male   BG  NA
20: chris  doctor   male TIME  NA

Here's what I've tried:

setkeyv(mydata, cols=c("name","job","sex"))
mydata[CJ(unique(name, job, sex), unique(from))]

But I receive the following error and adding fromLast=TRUE (or FALSE) does not give me the right solution:

Error in unique.default(name, job, sex) : 
  'fromLast' must be TRUE or FALSE

Here are the relevant answers I've come across (but none appears to deal with multiple keyed columns): add missing rows to a data table

Efficiently inserting default missing rows in a data.table

Fastest way to add rows for missing values in a data.frame?


回答1:


A couple of possibilities are here - https://github.com/Rdatatable/data.table/pull/814

CJ.dt = function(...) {
  rows = do.call(CJ, lapply(list(...), function(x) if(is.data.frame(x)) seq_len(nrow(x)) else seq_along(x)));
  do.call(data.table, Map(function(x, y) x[y], list(...), rows))
}

setkey(mydata, name, job, sex, from)

mydata[CJ.dt(unique(data.table(name, job, sex)), unique(from))]
#     name     job    sex from      score
# 1: chris  doctor   male  NYT  0.7383247
# 2: chris  doctor   male   BG         NA
# 3: chris  doctor   male TIME         NA
# 4: chris  doctor   male USAT         NA
# 5: chris  lawyer female  NYT         NA
# 6: chris  lawyer female   BG -0.8204684
# 7: chris  lawyer female TIME         NA
# 8: chris  lawyer female USAT         NA
# 9: chris  lawyer   male  NYT  0.4874291
#10: chris  lawyer   male   BG         NA
#11: chris  lawyer   male TIME         NA
#12: chris  lawyer   male USAT         NA
#13:  john teacher   male  NYT -0.6264538
#14:  john teacher   male   BG -0.8356286
#15:  john teacher   male TIME  1.5952808
#16:  john teacher   male USAT  0.1836433
#17:  mary  police female  NYT         NA
#18:  mary  police female   BG         NA
#19:  mary  police female TIME         NA
#20:  mary  police female USAT  0.3295078



回答2:


The dev version of tidyr now has an elegant way to do this because the expand() function now supports nesting and crossing:

library(dplyr)

mydata <- data_frame(
  name = c("john","john","john","john","mary","chris","chris","chris"),
  job = c("teacher","teacher","teacher","teacher","police","lawyer","lawyer","doctor"),
  sex = c("male","male","male","male","female","female","male","male"),
  from = c("NYT","USAT","BG","TIME","USAT","BG","NYT","NYT"),
  score = rnorm(8)
)

mydata %>% 
  expand(c(name, job, sex), from) %>% 
  left_join(mydata)

#> Joining by: c("name", "job", "sex", "from")
#> Source: local data frame [20 x 5]
#> 
#>     name     job    sex from      score
#> 1  chris  doctor   male   BG         NA
#> 2  chris  doctor   male  NYT  0.5448206
#> 3  chris  doctor   male TIME         NA
#> 4  chris  doctor   male USAT         NA
#> 5  chris  lawyer female   BG  1.2015173
#> 6  chris  lawyer female  NYT         NA
#> 7  chris  lawyer female TIME         NA
#> 8  chris  lawyer female USAT         NA
#> 9  chris  lawyer   male   BG         NA
#> 10 chris  lawyer   male  NYT -1.0930237
#> 11 chris  lawyer   male TIME         NA
#> 12 chris  lawyer   male USAT         NA
#> 13  john teacher   male   BG  1.1345461
#> 14  john teacher   male  NYT  1.3032946
#> 15  john teacher   male TIME  2.4901830
#> 16  john teacher   male USAT -1.6449096
#> 17  mary  police female   BG         NA
#> 18  mary  police female  NYT         NA
#> 19  mary  police female TIME         NA
#> 20  mary  police female USAT -0.2443080



回答3:


One possibility would be to paste the columns name, job, and sex together, get the unique values, and then do CJ with the unique values of from. After that, use cSplit from library(splitstackshape) to split the pasted column back to three columns, rename those columns with setnames, and join with mydata after setting the key.

library(splitstackshape)
library(data.table)
mydata1 <- setnames(cSplit(mydata[,CJ(unique(paste(name, job, sex)), 
            from=unique(from))], 'V1', ' '), 2:4, c('name', 'job', 'sex'))[,
                     c(2:4,1), with=FALSE]
setkeyv(mydata, cols=colnames(mydata)[1:4])
mydata[mydata1]
#      name     job    sex from      score
#1: chris  doctor   male   BG         NA
#2: chris  doctor   male  NYT  0.7383247
#3: chris  doctor   male TIME         NA
#4: chris  doctor   male USAT         NA
#5: chris  lawyer female   BG -0.8204684
#6: chris  lawyer female  NYT         NA
#7: chris  lawyer female TIME         NA
#8: chris  lawyer female USAT         NA
#9: chris  lawyer   male   BG         NA
#10: chris  lawyer   male  NYT  0.4874291
#11: chris  lawyer   male TIME         NA
#12: chris  lawyer   male USAT         NA
#13:  john teacher   male   BG -0.8356286
#14:  john teacher   male  NYT -0.6264538
#15:  john teacher   male TIME  1.5952808
#16:  john teacher   male USAT  0.1836433
#17:  mary  police female   BG         NA
#18:  mary  police female  NYT         NA
#19:  mary  police female TIME         NA
#20:  mary  police female USAT  0.3295078


来源:https://stackoverflow.com/questions/27372027/add-missing-rows-to-data-table-according-to-multiple-keyed-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!