问题
I have two vectors. For each element of vector A, I would like to know all the elements of vector B that fulfill a certain condition. So, for example, two dataframes containing the vectors:
person <- data.frame(name = c("Albert", "Becca", "Celine", "Dagwood"),
tickets = c(20, 24, 16, 17))
prize <- data.frame(type = c("potato", "lollipop", "yo-yo", "stickyhand",
"moodring", "figurine", "whistle", "saxophone"),
cost = c(6, 11, 13, 17, 21, 23, 25, 30))
For this example, each person in the "person" dataframe has a number of tickets from a carnival game, and each prize in the "prize" dataframe has a cost. But I'm not looking for perfect matches; instead of simply buying a prize, they randomly receive any prize that is within a 5-ticket cost tolerance of what they have.
The output I'm looking for is a dataframe of all the possible prizes each person could win. It would be something like:
person prize
1 Albert stickyhand
2 Albert moodring
3 Albert figurine
4 Albert whistle
5 Becca moodring
6 Becca figurine
... ...
And so on. Right now, I'm doing this with lapply()
, but this is really no faster than a for()
loop in R.
library(dplyr)
matching_Function <- function(person, prize, tolerance = 5){
matchlist <- lapply(split(person, list(person$name)),
function(x) filter(prize, abs(x$tickets-cost)<=tolerance)$type)
longlist <- data.frame("person" = rep(names(matchlist),
times = unlist(lapply(matchlist, length))),
"prize" = unname(unlist(matchlist))
)
return(longlist)
}
matching_Function(person, prize)
My actual datasets are much larger (in the hundreds of thousands), and my matching conditions are more complicated (checking coordinates from B to see whether they are within a set radius of coordinates from A), so this is taking forever (several hours).
Are there any smarter ways than for()
and lapply()
to solve this?
回答1:
An alternative with foverlaps
from data.table
doing what you wish:
require(data.table)
# Turn the datasets into data.table
setDT(person)
setDT(prize)
# Add the min and max from tolerance
person[,`:=`(start=tickets-tolerance,end=tickets+tolerance)]
# add a dummy column for use as range
prize[,dummy:=cost]
# Key the person table on start and end
setkey(person,start,end)
# As foverlaps to get the corresponding rows from prize into person, filter the NA results and return only the name and type of prize
r<-foverlaps(prize,person,type="within",by.x=c("cost","dummy"))[!is.na(name),list(name=name,prize=type)]
# Re order the result by name instead of prize cost
setorder(r,name)
Output:
name prize
1: Albert stickyhand
2: Albert moodring
3: Albert figurine
4: Albert whistle
5: Becca moodring
6: Becca figurine
7: Becca whistle
8: Celine lollipop
9: Celine yo-yo
10: Celine stickyhand
11: Celine moodring
12: Dagwood yo-yo
13: Dagwood stickyhand
14: Dagwood moodring
I hope I commented enough the code to be self explanatory.
For the second part of the question, using coordinates and testing within a radius.
person <- structure(list(name = c("Albert", "Becca", "Celine", "Dagwood"),
x = c(26, 16, 32, 51),
y = c(92, 51, 25, 4)),
.Names = c("name", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
antenas <- structure(list(name = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L"),
x = c(40, 25, 38, 17, 58, 19, 34, 38, 67, 26, 46, 17),
y = c(36, 72, 48, 6, 78, 41, 18, 28, 54, 8, 28, 47)),
.Names = c("name", "x", "y"), row.names = c(NA, -12L), class = "data.frame")
setDT(person)
setDT(antenas)
r<-10
results <- person[,{dx=x-antenas$x;dy=y-antenas$y; list(antena=antenas$name[dx^2+dy^2<=r^2])},by=name]
Data.table allow expression in j
, so we can do the maths of the outer join for each person against antennas and return only relevant rows with antenna name.
This should not be to much memory consuming as it's done for each row on person and not as a whole.
Maths inspired by this question
This give:
> results
name antena
1: Becca L
2: Celine G
3: Celine H
回答2:
This is pretty simple to do with your test data and a full outer join:
library(data.table)
setDT(person)
setDT(prize)
person[, JA := 1]
prize[, JA := 1]
merge(person,prize, by = "JA", allow.cartesian = TRUE)[abs(tickets - cost) < 6, .(name, type)]
# name type
# 1: Albert stickyhand
# 2: Albert moodring
# 3: Albert figurine
# 4: Albert whistle
# 5: Becca moodring
# 6: Becca figurine
# 7: Becca whistle
# 8: Celine lollipop
# 9: Celine yo-yo
# 10: Celine stickyhand
# 11: Celine moodring
# 12: Dagwood yo-yo
# 13: Dagwood stickyhand
# 14: Dagwood moodring
What we are doing is a full outer join, and then excluding any rows that do not meet the criteria.
However, if this is a full outer join of 100,000 on 100,000, you may run out of memory with this approach. In this case I would parallelize:
library(data.table)
library(foreach)
library(doParallel)
setDT(person)
setDT(prize)
person[, JA := 1]
prize[, JA := 1]
seq_s <- seq(1,nrow(person), by = 500) #change the 500 here based on memory/speed tradeoff
ln_s <- length(seq_s)
str_seq <- paste0(seq_s,":",c(seq_s[2:ln_s],nrow(person) + 1) - 1)
cl<-makeCluster(4)
registerDoParallel(cl)
ls<-foreach(i = 1:ln_s) %dopar% {
library(data.table)
person_batch <- person[eval(parse(text = str_seq[i]))]
Output <- merge(person_batch,prize, by = "JA", allow.cartesian = TRUE)
Output <- Output[abs(tickets - cost) < 6, .(name, type)]
}
stopCluster(cl)
Output <- unique(do.call(rbind,ls))
This is essentially the exact same process, just split into smaller batches that will not run into memory limits because we are filtering on the fly
来源:https://stackoverflow.com/questions/39596195/avoiding-lapply-in-r-and-finding-all-elements-of-vector-b-that-meet-a-conditi