问题
I have a nested list
combine <- list(c('A', 'B', 'C'), c('D', 'H', 'G', 'J'), c('A', 'E'))
and a df
df <- data.frame(appln_id = c(1, 1, 2, 2, 4, 4, 4, 3, 3, 3, 3, 5, 9, 9),
prior_year = c(1997,1997,1997,1997,1997,1997,1997,1998,1998,1998,1998,2000,2000,2000),
IPC = c('B','E','E','B','H','J','D','H','J','D','E','A','E','B'))
I want to aggregate IPC according to appln_id (eg: for appln_id=1: c('B','E'), for appln_id=2: c('E','B'), for appln_id=4: c('H','J','D'), etc.). Then for each value of prior_year, I want to compare the IPC sets to elements of the list combine.
For the IPC set that is not a subset of any element of combine, I want to save its data in df in another dataframe called new as follows:
new <- data.frame(appln_id = c(1, 1, 3, 3, 3, 3),
prior_year = c(1997,1997,1998,1998,1998,1998),
IPC = c('B','E','H','J','D','E'))
and add this IPC set into combine as follows:
combine <- list(c('A', 'B', 'C'), c('D', 'H', 'G', 'J'), c('A', 'E'), c('B', 'E'), c('D','E','J','H'))
This is my code:
new <- data.frame(appln_id=integer(),prio_year=integer(), IPC=character())
new_combine=list()
prio_year <- unique(df$prio_year)
appln_id <- unique(df$appln_id)
for (i in prio_year){
for (j in appln_id){
x <- sort((df[(df$prio_year==i) & (df$appln_id==j),3])[[1]])
for (k in combine){
if (all(x %in% k) == FALSE){
new <- rbind(new, df[df$appln_id==j,])
new_combine[[length(new_combine)+1]] <- x
}
}
}
combine <- c(combine,unique(new_combine))
}
However, it takes too long for my code to run. Could anyone have another way to make it faster? Thank you.
回答1:
Here's something that is only a single loop. Up front, though, I changed $IPC from factor to character, since merging differing factor levels can be a little annoying. (If you're on R-4.0 or $IPC is already character, then no need to do this step.)
df$usable <- TRUE
df$grps <- interaction(df$appln_id, df$prior_year)
newlist <- list()
for (grp in levels(df$grps)) {
rows <- df$grps == grp & df$usable
if (!length(rows)) next
thisIPC <- df$IPC[rows]
matches <- sapply(combine, function(comb) all(thisIPC %in% comb))
if (any(matches)) {
# repeat
} else {
# new!
combine <- c(combine, list(thisIPC))
newlist <- c(newlist, list(df[rows,]))
df$usable[rows] <- FALSE
}
}
df <- df[df$usable,]
new <- do.call(rbind, newlist)
df$usable <- df$grps <-
new$usable <- new$grps <- NULL
df
# appln_id prior_year IPC
# 3 2 1997 E
# 4 2 1997 B
# 5 4 1997 H
# 6 4 1997 J
# 7 4 1997 D
# 12 5 2000 A
# 13 9 2000 E
# 14 9 2000 B
new
# appln_id prior_year IPC
# 1 1 1997 B
# 2 1 1997 E
# 8 3 1998 H
# 9 3 1998 J
# 10 3 1998 D
# 11 3 1998 E
str(combine)
# List of 5
# $ : chr [1:3] "A" "B" "C"
# $ : chr [1:4] "D" "H" "G" "J"
# $ : chr [1:2] "A" "E"
# $ : chr [1:2] "B" "E"
# $ : chr [1:4] "H" "J" "D" "E"
Notes:
- I create the
$grpsvariable to make easy single-loop grouping; once this loop is done, feel free to remove it. Usingfactorand thenlevelsensures that I iterate over else present combination, nothing more. I may be going to more extremes than necessary, but iteratively growing frames is bad in the long-term for performance: each time you "add rows", the entire frame is perfectly copied in memory, so with each addition, you duplicate the memory take for the frame. Granted, the memory is cleared, but it is a "known thing" that this slows down noticeably asymptotically. (See chapter 2, Growing Objects, in the R Inferno.) This applies (to a slightly-lesser degree) to iteratively removing rows, too.
Because of this, I don't actually change the contents of the frame until the very end. To accommodate this, I also add a column
$usableto indicate if it should be removed in the end. (In the unlikely event that you run this code twice on the same frame, I also use$enablein grabbing$IPC, that might just be overly-defensive.)Post-loop, I remove the relevant rows from
dfonce, and do a single row-concatenation (rbind) on thenewlist, which is a list with frames (or nothing, if nothing happened).
来源:https://stackoverflow.com/questions/62077664/comparison-loop