问题
I need to merge (left join) two data sets x
and y
.
merge(x,y, by.x = "z", by.y = "zP", all.x = TRUE)
Every value of z
is not there in zP
but there must be nearest value in zP
. So we need to use nearest value in zP
for process of merging.
For example
z <- c(0.231, 0.045, 0.632, 0.217, 0.092, ...)
zP <- c(0.010,0.013, 0.017, 0.021, ...)
How can we do it in R ?
回答1:
Based on the information you provided it sounds like you want to keep all the observations in x
, and then for each observation in x
you want to find the observation in y
that minimizes the distance between columns z
and zP
. If that is what you are looking for then something like this might work
> library(data.table)
> x <- data.table(z = c(0.231, 0.045, 0.632, 0.217, 0.092), k = c("A","A","B","B","B"))
> y <- data.table(zP = c(0.010, 0.813, 0.017, 0.421), m = c(1,2,3,4))
> x
z k
1: 0.231 A
2: 0.045 A
3: 0.632 B
4: 0.217 B
5: 0.092 B
> y
zP m
1: 0.010 1
2: 0.813 2
3: 0.017 3
4: 0.421 4
> find.min.zP <- function(x){
+ y[which.min(abs(x - zP)), zP]
+ }
> x[, zP := find.min.zP(z), by = z]
> x
z k zP
1: 0.231 A 0.421
2: 0.045 A 0.017
3: 0.632 B 0.813
4: 0.217 B 0.017
5: 0.092 B 0.017
> merge(x, y, by="zP", all.x = T, all.y = F)
zP z k m
1: 0.017 0.045 A 3
2: 0.017 0.217 B 3
3: 0.017 0.092 B 3
4: 0.421 0.231 A 4
5: 0.813 0.632 B 2
This is the solution that popped into my head given that I use data.table
quite a bit. Please note that using data.table
here may or may not be the most elegant way and it may not even be the fastest way (although if x
and y
are large some solution involving data.table
probably will be the fastest). Also note that this is likely an example of using data.table
"badly" as I didn't make any effort to optimize for speed. If speed is important I would highly recommend reading the helpful documentation on the github wiki. Hope that helps.
Edit:
As I suspected, data.table
provides a much better way, which Arun pointed out in the comments.
> setkey(x, z)
> setkey(y, zP)
> y[x, roll="nearest"]
zP m k
1: 0.045 3 A
2: 0.092 3 B
3: 0.217 3 B
4: 0.231 4 A
5: 0.632 2 B
The only difference is that the z
column is now named zP
and the original zP
column is gone. If preserving that column is important you can always copy the zP
column in y
to a new column named z
and join on that.
> y[, z := zP]
> setkey(x, z)
> setkey(y, z)
> y[x, roll='nearest']
zP m z k
1: 0.017 3 0.045 A
2: 0.017 3 0.092 B
3: 0.017 3 0.217 B
4: 0.421 4 0.231 A
5: 0.813 2 0.632 B
This is slightly less typing, but the real improvement is in compute times with large datasets.
> x <- data.table(z = runif(100000, 0, 100), k = sample(LETTERS, 100000, replace = T))
> y <- data.table(zP = runif(50000, 0, 100), m = sample(letters, 50000, replace = T))
> start <- proc.time()
> x[, zP := find.min.zP(z), by = z]
> slow <- merge(x, y, by="zP", all.x = T, all.y = F)
> proc.time() - start
user system elapsed
104.849 0.072 106.432
> x[, zP := NULL] # Drop the zP column we added to x doing the merge the slow way
> start <- proc.time()
> y[, z := zP]
> setkey(x, z)
> setkey(y, z)
> fast <- y[x, roll='nearest']
> proc.time() - start
user system elapsed
0.046 0.000 0.045
# Reorder the rows and columns so that we can compare the two data tables
> setkey(slow, z)
> setcolorder(slow, c("z", "zP", "k", "m"))
> setcolorder(fast, c("z", "zP", "k", "m"))
> all.equal(slow, fast)
TRUE
Notice, that the faster method is 2,365 times faster! I would expect the time gains to be even more dramatic for a data set with more than 100,000 observations (which is relatively small these days). This is why reading the data.table
documentation is worth while if you are working with large data sets. You can often achieve very large speed ups by using the built in methods, but you won't know that they're there unless you look.
来源:https://stackoverflow.com/questions/29527964/merging-two-datasets-on-approximate-values