问题
I have a data.table
object that contains multiple columns that specify unique cases. In the small example below, the variables "name
", "job
", and "sex
" specify the unique IDs. I would like to add missing rows so that each each case has a row for each possible instance of another variable, "from
" (similar to expand.grid
).
library(data.table)
set.seed(1)
mydata <- data.table(name = c("john","john","john","john","mary","chris","chris","chris"),
job = c("teacher","teacher","teacher","teacher","police","lawyer","lawyer","doctor"),
sex = c("male","male","male","male","female","female","male","male"),
from = c("NYT","USAT","BG","TIME","USAT","BG","NYT","NYT"),
score = rnorm(8))
setkeyv(mydata, cols=c("name","job","sex"))
mydata[CJ(unique(name, job, sex), unique(from))]
Here's the current data.table object:
> mydata
name job sex from score
1: john teacher male NYT -0.6264538
2: john teacher male USAT 0.1836433
3: john teacher male BG -0.8356286
4: john teacher male TIME 1.5952808
5: mary police female USAT 0.3295078
6: chris lawyer female BG -0.8204684
7: chris lawyer male NYT 0.4874291
8: chris doctor male NYT 0.7383247
Here's the result I'd like:
> mydata
name job sex from score
1: john teacher male NYT -0.6264538
2: john teacher male USAT 0.1836433
3: john teacher male BG -0.8356286
4: john teacher male TIME 1.5952808
5: mary police female NYT NA
6: mary police female USAT 0.3295078
7: mary police female BG NA
8: mary police female TIME NA
9: chris lawyer female NYT -NA
10: chris lawyer female USAT -NA
11: chris lawyer female BG -0.8204684
12: chris lawyer female TIME -NA
13: chris lawyer male NYT 0.4874291
14: chris lawyer male USAT NA
15: chris lawyer male BG NA
16: chris lawyer male TIME NA
17: chris doctor male NYT 0.7383247
18: chris doctor male USAT NA
19: chris doctor male BG NA
20: chris doctor male TIME NA
Here's what I've tried:
setkeyv(mydata, cols=c("name","job","sex"))
mydata[CJ(unique(name, job, sex), unique(from))]
But I receive the following error and adding fromLast=TRUE (or FALSE) does not give me the right solution:
Error in unique.default(name, job, sex) :
'fromLast' must be TRUE or FALSE
Here are the relevant answers I've come across (but none appears to deal with multiple keyed columns): add missing rows to a data table
Efficiently inserting default missing rows in a data.table
Fastest way to add rows for missing values in a data.frame?
回答1:
A couple of possibilities are here - https://github.com/Rdatatable/data.table/pull/814
CJ.dt = function(...) {
rows = do.call(CJ, lapply(list(...), function(x) if(is.data.frame(x)) seq_len(nrow(x)) else seq_along(x)));
do.call(data.table, Map(function(x, y) x[y], list(...), rows))
}
setkey(mydata, name, job, sex, from)
mydata[CJ.dt(unique(data.table(name, job, sex)), unique(from))]
# name job sex from score
# 1: chris doctor male NYT 0.7383247
# 2: chris doctor male BG NA
# 3: chris doctor male TIME NA
# 4: chris doctor male USAT NA
# 5: chris lawyer female NYT NA
# 6: chris lawyer female BG -0.8204684
# 7: chris lawyer female TIME NA
# 8: chris lawyer female USAT NA
# 9: chris lawyer male NYT 0.4874291
#10: chris lawyer male BG NA
#11: chris lawyer male TIME NA
#12: chris lawyer male USAT NA
#13: john teacher male NYT -0.6264538
#14: john teacher male BG -0.8356286
#15: john teacher male TIME 1.5952808
#16: john teacher male USAT 0.1836433
#17: mary police female NYT NA
#18: mary police female BG NA
#19: mary police female TIME NA
#20: mary police female USAT 0.3295078
回答2:
The dev version of tidyr now has an elegant way to do this because the expand()
function now supports nesting and crossing:
library(dplyr)
mydata <- data_frame(
name = c("john","john","john","john","mary","chris","chris","chris"),
job = c("teacher","teacher","teacher","teacher","police","lawyer","lawyer","doctor"),
sex = c("male","male","male","male","female","female","male","male"),
from = c("NYT","USAT","BG","TIME","USAT","BG","NYT","NYT"),
score = rnorm(8)
)
mydata %>%
expand(c(name, job, sex), from) %>%
left_join(mydata)
#> Joining by: c("name", "job", "sex", "from")
#> Source: local data frame [20 x 5]
#>
#> name job sex from score
#> 1 chris doctor male BG NA
#> 2 chris doctor male NYT 0.5448206
#> 3 chris doctor male TIME NA
#> 4 chris doctor male USAT NA
#> 5 chris lawyer female BG 1.2015173
#> 6 chris lawyer female NYT NA
#> 7 chris lawyer female TIME NA
#> 8 chris lawyer female USAT NA
#> 9 chris lawyer male BG NA
#> 10 chris lawyer male NYT -1.0930237
#> 11 chris lawyer male TIME NA
#> 12 chris lawyer male USAT NA
#> 13 john teacher male BG 1.1345461
#> 14 john teacher male NYT 1.3032946
#> 15 john teacher male TIME 2.4901830
#> 16 john teacher male USAT -1.6449096
#> 17 mary police female BG NA
#> 18 mary police female NYT NA
#> 19 mary police female TIME NA
#> 20 mary police female USAT -0.2443080
回答3:
One possibility would be to paste
the columns name
, job
, and sex
together, get the unique
values, and then do CJ
with the unique
values of from
. After that, use cSplit
from library(splitstackshape)
to split the pasted
column back to three columns, rename those columns with setnames
, and join
with mydata
after setting the key
.
library(splitstackshape)
library(data.table)
mydata1 <- setnames(cSplit(mydata[,CJ(unique(paste(name, job, sex)),
from=unique(from))], 'V1', ' '), 2:4, c('name', 'job', 'sex'))[,
c(2:4,1), with=FALSE]
setkeyv(mydata, cols=colnames(mydata)[1:4])
mydata[mydata1]
# name job sex from score
#1: chris doctor male BG NA
#2: chris doctor male NYT 0.7383247
#3: chris doctor male TIME NA
#4: chris doctor male USAT NA
#5: chris lawyer female BG -0.8204684
#6: chris lawyer female NYT NA
#7: chris lawyer female TIME NA
#8: chris lawyer female USAT NA
#9: chris lawyer male BG NA
#10: chris lawyer male NYT 0.4874291
#11: chris lawyer male TIME NA
#12: chris lawyer male USAT NA
#13: john teacher male BG -0.8356286
#14: john teacher male NYT -0.6264538
#15: john teacher male TIME 1.5952808
#16: john teacher male USAT 0.1836433
#17: mary police female BG NA
#18: mary police female NYT NA
#19: mary police female TIME NA
#20: mary police female USAT 0.3295078
来源:https://stackoverflow.com/questions/27372027/add-missing-rows-to-data-table-according-to-multiple-keyed-columns