Faster equivalent to group_by %>% expand in R

自闭症网瘾萝莉.ら 提交于 2019-12-06 03:44:18

You could do

out <- DT[, .(col = seq.int(Start_year, 2015L)), by = ID]
out
#    ID  col
# 1:  1 1999
# 2:  1 2000
# 3:  1 2001
# 4:  1 2002
# 5:  1 2003
# 6:  1 2004
# 7:  1 2005
# 8:  1 2006
# 9:  1 2007
# ...

In your case you would probably need to do

setDT(df)[, .(col = seq.int(Start_year, 2015L)), by = ID]

A tidyverse way of the same idea

library(readr); library(dplyr); library(tidyr)
tbl <- read_table(text)

tbl %>% 
  group_by(ID) %>% 
  mutate(Start_year = list(seq.int(Start_year, 2015L))) %>%
  # rename(new_col = Start_year)
  unnest()

data

text <- "ID    Start_year
01          1999
02          2004
03          2015
04          2007"

library(data.table)
DT <- fread(text)

If you have enough memory, you could take full set of IDs x years and filter with a rolling join:

res <- DT[
  CJ(ID, Start_year = seq.int(min(Start_year), 2015L)), 
  on=.(ID, Start_year), 
  roll=TRUE, 
  nomatch=0
]

setnames(res, "Start_year", "Year")[]

CJ takes the "cross join" of the vector of IDs and years. If you are not on the latest version of data.table, you may need to name both arguments (ie, CJ(ID = ID, Start_year = seq.int(min(Start_year), 2015L))).

Comment. The OP says @markus' approach already brings the operation down to seconds, so maybe further improvement is not needed... Also, I'm not really sure that there are any circumstances under which my approach would be faster.

a tidyverse solution could be:

df <- data.table::fread("
ID    Start_year
01          1999
02          2004
03          2015
04          2007")

library(padr)
library(tidyverse)

df %>% 
  pad_int('Start_year', 
          end_val = 2015, 
          group = "ID")
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!