How to optimize a function that matches observations according to certain criteria

问题

I am looking for a more efficient way of doing an operation with a given dataframe.

 library(purrr)
 library(dplyr)

Here is a step by step description:

First, there is the function possible_matches, that for each observation i in df, gives the index of rows that are possibly matchable to i, which are going to be used on the next step:

 possible_matches <-  function(i, df) {
  k1 <- df$j[df$id_0 == df$id_0[i]]
  j2 <- setdiff(df$j, k1)
  k2 <- map(j2, ~ df$j[df$id_0[.] == df$id_0])
  k3 <- map(k2, ~ map(.x,  ~ df$j[df$Year[k1] == df$Year[.] & 
                                    df$Quarter[k1] == df$Quarter[.]]
  ) %>% unlist(.))
  k4 <- map(k3,  ~ length(.) == 0) %>% unlist()
  j2[k4]
}

Basically, it takes all rows with the same id to i, and then filter some out according to some criteria. This function is used inside function match1, which loops through all rows given by possible_matches, filtering out more of them according to some other criteria (simplified here):

 match_1 <-  function(i, df) {
j <- possible_matches(i, df) 
if (is_empty(j)) {
  out <- i
} else {
  
  g1 <-
    abs(df$V2009[i] - df$V2009[j]) <= 5
     out <- ifelse(!g1, i, j[g1])    
}
return(out)
}

Since match1 possibly returns multiple observations per row, I have to try to group all paired ones as much as I can. I do this by defining:

modes <- function(x, y) {
  ux <- unique(x)
  tab <- tabulate(match(y, ux))
  ux[tab == max(tab)]
}

Running it inside function equalize_indices, which also splits df into groups so that there is no looping through unnecessary rows:

 equalize_indices <- function(df, prev_id) {
    df1 <- df %>%
      group_split()
    
    w <- df1 %>%
      map(~ .x %>%
            nrow() %>%
            seq())
    
    df1 <- map(df1, ~.x %>%
                 mutate(j = row_number())
    )
    x <- map2(w, df1, ~ map(.x, match_1, df = .y))
    
    z <- map(x, function(x){
      map(x, ~ modes(., x) %>%
            min(.))
      })
    
    df3 <- map2(df1, z,  ~.x %>%
                  mutate(index = .y) %>%
                  group_by(index) %>%
                  mutate(index = min({{prev_id}})) %>%
                  select(-j)
    )
    df <- bind_rows(df3)
    return(df)
  }

EDIT Finally, here is some larger data with the expected output:

set.seed(1)

DF <- data.frame(
  UPA = 1,
  Quarter = sample(1:4, 8, replace  = TRUE),
  Year = sample(2010:2015, 8, replace  = TRUE),
  id_0 = sample(2:10, 8, replace  = TRUE),
  V2009 = c(19, 22, 17, 10, 37, 19, 22, 17)
  ) %>%
  group_by(UPA)

  DF %>%
  equalize_indices(prev_id = id_0)

Here is my question: it takes too long to run this procedure with a data frame with 25k rows (about 30 min) using more conditionalities. Why is this? Is there some way to turn the process faster? This needs to be scalable to very large data frames. I know looping may take time, but by using group_split inside equalize_indices, I can turn the loops smaller.

How can I optimize this procedure? - It basically is a process splitting a dataset, eliminating unmatchable observations to each row, picking the most common matched index for each row, and then binding the rows again.

I am not even sure which exact part is so time consuming.

回答1:

Here is a data.table approach:

First we create a new function that instead of returning an entire data.frame, we just return the index

equalize_indices2 = function(DF) {
  n = nrow(DF)
  if (n == 1L) return(DF$id_0) ## short-circuit; no need to do any comparisons!
  j = seq(n)
  
  ##check equalities up front so we only have to do it once for all rows
  id_0_eq = outer(DF$id_0, DF$id_0, `==`)
  year_quarter_eq = outer(DF$Year, DF$Year, `==`) & outer(DF$Quarter, DF$Quarter, `==`)
  
  ##pre-allocate vector
  ans = vector('integer', n)
  
  ##simplify logic; id_0 == id_0 but none of the other elements are equal (e.g., id_0[1] != id_0[2])
  only_diag_true = sum(id_0_eq) == n
  
  for (i in j) {
    ######replacement of possible matches#######
    if (only_diag_true) { 
      j2 = j[-i]
      tmp = j2[!year_quarter_eq[j2, i]]
    } else {
      ##TODO match expected output of larger dataset. May be due to downstream issues.
      k1 = j[id_0_eq[i, ]]
      j2 = j[-k1]
      # k4 = unlist(Map(function(x) j[!year_quarter_eq[k1, j[id_0_eq[x, ]]]], j2), use.names = FALSE)
      k4 = vapply(j2, function(x) length(j[year_quarter_eq[k1, j[id_0_eq[x, ]]]]) == 0L, NA)
      # k4 = unlist(Map(function(x) j[!year_quarter_eq[k1, x]], k2), use.names = FALSE)
      
      tmp = j2[k4]
    }
    
    #######replacement of match_1##########
    if (!length(tmp)) { ##This happens when nrow(df) == 1L or the year and quarters match...I think
      out = i
    } else {
      g1 = abs(DF$V2009[i] - DF$V2009[tmp]) <= 5
      out = ifelse(!g1, i, tmp[g1])  
    }
    
    ########replacement of modes#######
    ans[i] = if (length(out) == 1L) out else out[which.max(out)]
  } 
  
  ## replacement of df3 call in OP
  index = ave(DF$id_0, ans, FUN = function(x) {x[1L]})
  return(index)
}

Then, we modify data.table in place using the function we just created:

library(data.table)
setDT(df)
df[, index := equalize_indices2(.SD), by = UPA]
df

##      UPA Quarter  Year  id_0 V2009 index
##    <num>  <char> <num> <int> <num> <int>
## 1:     1       1  2012     1    19     1
## 2:     1       1  2012     2    22     1
## 3:     1       1  2011     3    17     1
## 4:     1       1  2012     4    10     4
## 5:     2       1  2012     5    37     5

The main benefits are that we minimize equality operators because for each UPA group, we only look at the equalities once per group instead of for each row. The other nice thing is that we do not have to deal with as many nested lists. In fact, this includes no lists.

This is still a WIP, I am going to do the TODO logic line later today but this works for the current dataset. If you provide a larger dataset, I will also test against that as well.

As far as performance, this is more than 100x faster than OP. Note check = FALSE because they return different classes (e.g., data.table vs. only a data.frame)

dt = as.data.table(df)
bench::mark(dt_new_fx = dt[, index := equalize_indices2(.SD), by = UPA],
            OP = {df %>%
              equalize_indices(prev_id = id_0)},
            check = FALSE
)

## # A tibble: 2 x 13
##   expression   min median `itr/sec` mem_alloc
##   <bch:expr> <bch> <bch:>     <dbl> <bch:byt>
## 1 dt_new_fx  370us  390us    2439.     32.5KB
## 2 OP          46ms   46ms      21.8    39.5KB

来源：https://stackoverflow.com/questions/64364441/how-to-optimize-a-function-that-matches-observations-according-to-certain-criter

标签

function

for-loop

optimization

purrr