Avoiding lapply() in R, and finding all elements of Vector B that meet a condition of for each element of Vector A

橙三吉。 提交于 2019-12-24 11:02:37

问题


I have two vectors. For each element of vector A, I would like to know all the elements of vector B that fulfill a certain condition. So, for example, two dataframes containing the vectors:

person <- data.frame(name = c("Albert", "Becca", "Celine", "Dagwood"),
                 tickets = c(20, 24, 16, 17))
prize <- data.frame(type = c("potato", "lollipop", "yo-yo", "stickyhand", 
                         "moodring", "figurine", "whistle", "saxophone"),
                cost = c(6, 11, 13, 17, 21, 23, 25, 30))

For this example, each person in the "person" dataframe has a number of tickets from a carnival game, and each prize in the "prize" dataframe has a cost. But I'm not looking for perfect matches; instead of simply buying a prize, they randomly receive any prize that is within a 5-ticket cost tolerance of what they have.

The output I'm looking for is a dataframe of all the possible prizes each person could win. It would be something like:

    person      prize
1   Albert stickyhand
2   Albert   moodring
3   Albert   figurine
4   Albert    whistle
5    Becca   moodring
6    Becca   figurine
       ...        ...

And so on. Right now, I'm doing this with lapply(), but this is really no faster than a for() loop in R.

library(dplyr)
matching_Function <- function(person, prize, tolerance = 5){
  matchlist <- lapply(split(person, list(person$name)),
                      function(x) filter(prize, abs(x$tickets-cost)<=tolerance)$type)
  longlist <- data.frame("person" = rep(names(matchlist), 
                                    times = unlist(lapply(matchlist, length))),
                         "prize" = unname(unlist(matchlist))
  )
  return(longlist)
}
matching_Function(person, prize)

My actual datasets are much larger (in the hundreds of thousands), and my matching conditions are more complicated (checking coordinates from B to see whether they are within a set radius of coordinates from A), so this is taking forever (several hours).

Are there any smarter ways than for() and lapply() to solve this?


回答1:


An alternative with foverlaps from data.table doing what you wish:

require(data.table)

# Turn the datasets into data.table
setDT(person)
setDT(prize)
# Add the min and max from tolerance
person[,`:=`(start=tickets-tolerance,end=tickets+tolerance)]
# add a dummy column for use as range
prize[,dummy:=cost]
# Key the person table on start and end
setkey(person,start,end)
# As foverlaps to get the corresponding rows from prize into person, filter the NA results and return only the name and type of prize
r<-foverlaps(prize,person,type="within",by.x=c("cost","dummy"))[!is.na(name),list(name=name,prize=type)]
# Re order the result by name instead of prize cost
setorder(r,name)

Output:

       name      prize
 1:  Albert stickyhand
 2:  Albert   moodring
 3:  Albert   figurine
 4:  Albert    whistle
 5:   Becca   moodring
 6:   Becca   figurine
 7:   Becca    whistle
 8:  Celine   lollipop
 9:  Celine      yo-yo
10:  Celine stickyhand
11:  Celine   moodring
12: Dagwood      yo-yo
13: Dagwood stickyhand
14: Dagwood   moodring

I hope I commented enough the code to be self explanatory.


For the second part of the question, using coordinates and testing within a radius.

person <- structure(list(name = c("Albert", "Becca", "Celine", "Dagwood"), 
                         x = c(26, 16, 32, 51), 
                         y = c(92, 51, 25, 4)), 
                    .Names = c("name", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
antenas <- structure(list(name = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L"), 
                          x = c(40, 25, 38, 17, 58, 19, 34, 38, 67, 26, 46, 17), 
                          y = c(36, 72, 48, 6, 78, 41, 18, 28, 54, 8, 28, 47)), 
                     .Names = c("name", "x", "y"), row.names = c(NA, -12L), class = "data.frame")

setDT(person)
setDT(antenas)
r<-10

results <- person[,{dx=x-antenas$x;dy=y-antenas$y; list(antena=antenas$name[dx^2+dy^2<=r^2])},by=name]

Data.table allow expression in j, so we can do the maths of the outer join for each person against antennas and return only relevant rows with antenna name.

This should not be to much memory consuming as it's done for each row on person and not as a whole.

Maths inspired by this question

This give:

> results
     name antena
1:  Becca      L
2: Celine      G
3: Celine      H



回答2:


This is pretty simple to do with your test data and a full outer join:

library(data.table)

setDT(person)
setDT(prize)

person[, JA := 1]
prize[, JA := 1]

merge(person,prize, by = "JA", allow.cartesian = TRUE)[abs(tickets - cost) < 6, .(name, type)]

#       name       type
# 1:  Albert stickyhand
# 2:  Albert   moodring
# 3:  Albert   figurine
# 4:  Albert    whistle
# 5:   Becca   moodring
# 6:   Becca   figurine
# 7:   Becca    whistle
# 8:  Celine   lollipop
# 9:  Celine      yo-yo
# 10:  Celine stickyhand
# 11:  Celine   moodring
# 12: Dagwood      yo-yo
# 13: Dagwood stickyhand
# 14: Dagwood   moodring

What we are doing is a full outer join, and then excluding any rows that do not meet the criteria.

However, if this is a full outer join of 100,000 on 100,000, you may run out of memory with this approach. In this case I would parallelize:

library(data.table)
library(foreach)
library(doParallel)    

setDT(person)
setDT(prize)
person[, JA := 1]
prize[, JA := 1]

seq_s <- seq(1,nrow(person), by = 500) #change the 500 here based on memory/speed tradeoff
ln_s <- length(seq_s)
str_seq <- paste0(seq_s,":",c(seq_s[2:ln_s],nrow(person) + 1) - 1)

cl<-makeCluster(4)
registerDoParallel(cl)

ls<-foreach(i = 1:ln_s) %dopar% {

  library(data.table)
  person_batch <- person[eval(parse(text = str_seq[i]))]
  Output <- merge(person_batch,prize, by = "JA", allow.cartesian = TRUE)
  Output <- Output[abs(tickets - cost) < 6, .(name, type)]
}

stopCluster(cl)

Output <- unique(do.call(rbind,ls))

This is essentially the exact same process, just split into smaller batches that will not run into memory limits because we are filtering on the fly



来源:https://stackoverflow.com/questions/39596195/avoiding-lapply-in-r-and-finding-all-elements-of-vector-b-that-meet-a-conditi

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!