问题
I am looking for a more efficient way of doing an operation with a given dataframe.
library(purrr)
library(dplyr)
Here is a step by step description:
First, there is the function possible_matches
, that for each observation i
in df
, gives the index of rows that are possibly matchable to i
, which are going to be used on the next step:
possible_matches <- function(i, df) {
k1 <- df$j[df$id_0 == df$id_0[i]]
j2 <- setdiff(df$j, k1)
k2 <- map(j2, ~ df$j[df$id_0[.] == df$id_0])
k3 <- map(k2, ~ map(.x, ~ df$j[df$Year[k1] == df$Year[.] &
df$Quarter[k1] == df$Quarter[.]]
) %>% unlist(.))
k4 <- map(k3, ~ length(.) == 0) %>% unlist()
j2[k4]
}
Basically, it takes all rows with the same id
to i
, and then filter some out according to some criteria. This function is used inside function match1
, which loops through all rows given by possible_matches
, filtering out more of them according to some other criteria (simplified here):
match_1 <- function(i, df) {
j <- possible_matches(i, df)
if (is_empty(j)) {
out <- i
} else {
g1 <-
abs(df$V2009[i] - df$V2009[j]) <= 5
out <- ifelse(!g1, i, j[g1])
}
return(out)
}
Since match1
possibly returns multiple observations per row, I have to try to group all paired ones as much as I can. I do this by defining:
modes <- function(x, y) {
ux <- unique(x)
tab <- tabulate(match(y, ux))
ux[tab == max(tab)]
}
Running it inside function equalize_indices
, which also splits df
into groups so that there is no looping through unnecessary rows:
equalize_indices <- function(df, prev_id) {
df1 <- df %>%
group_split()
w <- df1 %>%
map(~ .x %>%
nrow() %>%
seq())
df1 <- map(df1, ~.x %>%
mutate(j = row_number())
)
x <- map2(w, df1, ~ map(.x, match_1, df = .y))
z <- map(x, function(x){
map(x, ~ modes(., x) %>%
min(.))
})
df3 <- map2(df1, z, ~.x %>%
mutate(index = .y) %>%
group_by(index) %>%
mutate(index = min({{prev_id}})) %>%
select(-j)
)
df <- bind_rows(df3)
return(df)
}
EDIT Finally, here is some larger data with the expected output:
set.seed(1)
DF <- data.frame(
UPA = 1,
Quarter = sample(1:4, 8, replace = TRUE),
Year = sample(2010:2015, 8, replace = TRUE),
id_0 = sample(2:10, 8, replace = TRUE),
V2009 = c(19, 22, 17, 10, 37, 19, 22, 17)
) %>%
group_by(UPA)
DF %>%
equalize_indices(prev_id = id_0)
Here is my question: it takes too long to run this procedure with a data frame with 25k rows (about 30 min) using more conditionalities. Why is this? Is there some way to turn the process faster? This needs to be scalable to very large data frames. I know looping may take time, but by using group_split
inside equalize_indices
, I can turn the loops smaller.
How can I optimize this procedure? - It basically is a process splitting a dataset, eliminating unmatchable observations to each row, picking the most common matched index for each row, and then binding the rows again.
I am not even sure which exact part is so time consuming.
回答1:
Here is a data.table approach:
First we create a new function that instead of returning an entire data.frame
, we just return the index
equalize_indices2 = function(DF) {
n = nrow(DF)
if (n == 1L) return(DF$id_0) ## short-circuit; no need to do any comparisons!
j = seq(n)
##check equalities up front so we only have to do it once for all rows
id_0_eq = outer(DF$id_0, DF$id_0, `==`)
year_quarter_eq = outer(DF$Year, DF$Year, `==`) & outer(DF$Quarter, DF$Quarter, `==`)
##pre-allocate vector
ans = vector('integer', n)
##simplify logic; id_0 == id_0 but none of the other elements are equal (e.g., id_0[1] != id_0[2])
only_diag_true = sum(id_0_eq) == n
for (i in j) {
######replacement of possible matches#######
if (only_diag_true) {
j2 = j[-i]
tmp = j2[!year_quarter_eq[j2, i]]
} else {
##TODO match expected output of larger dataset. May be due to downstream issues.
k1 = j[id_0_eq[i, ]]
j2 = j[-k1]
# k4 = unlist(Map(function(x) j[!year_quarter_eq[k1, j[id_0_eq[x, ]]]], j2), use.names = FALSE)
k4 = vapply(j2, function(x) length(j[year_quarter_eq[k1, j[id_0_eq[x, ]]]]) == 0L, NA)
# k4 = unlist(Map(function(x) j[!year_quarter_eq[k1, x]], k2), use.names = FALSE)
tmp = j2[k4]
}
#######replacement of match_1##########
if (!length(tmp)) { ##This happens when nrow(df) == 1L or the year and quarters match...I think
out = i
} else {
g1 = abs(DF$V2009[i] - DF$V2009[tmp]) <= 5
out = ifelse(!g1, i, tmp[g1])
}
########replacement of modes#######
ans[i] = if (length(out) == 1L) out else out[which.max(out)]
}
## replacement of df3 call in OP
index = ave(DF$id_0, ans, FUN = function(x) {x[1L]})
return(index)
}
Then, we modify data.table
in place using the function we just created:
library(data.table)
setDT(df)
df[, index := equalize_indices2(.SD), by = UPA]
df
## UPA Quarter Year id_0 V2009 index
## <num> <char> <num> <int> <num> <int>
## 1: 1 1 2012 1 19 1
## 2: 1 1 2012 2 22 1
## 3: 1 1 2011 3 17 1
## 4: 1 1 2012 4 10 4
## 5: 2 1 2012 5 37 5
The main benefits are that we minimize equality operators because for each UPA
group, we only look at the equalities once per group instead of for each row. The other nice thing is that we do not have to deal with as many nested lists. In fact, this includes no lists.
This is still a WIP, I am going to do the TODO
logic line later today but this works for the current dataset. If you provide a larger dataset, I will also test against that as well.
As far as performance, this is more than 100x faster than OP. Note check = FALSE
because they return different classes (e.g., data.table
vs. only a data.frame
)
dt = as.data.table(df)
bench::mark(dt_new_fx = dt[, index := equalize_indices2(.SD), by = UPA],
OP = {df %>%
equalize_indices(prev_id = id_0)},
check = FALSE
)
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch> <bch:> <dbl> <bch:byt>
## 1 dt_new_fx 370us 390us 2439. 32.5KB
## 2 OP 46ms 46ms 21.8 39.5KB
来源:https://stackoverflow.com/questions/64364441/how-to-optimize-a-function-that-matches-observations-according-to-certain-criter