Merging two datasets on approximate values

问题

I need to merge (left join) two data sets x and y.

merge(x,y, by.x = "z", by.y = "zP", all.x = TRUE)

Every value of z is not there in zP but there must be nearest value in zP. So we need to use nearest value in zP for process of merging. For example

z <- c(0.231, 0.045, 0.632, 0.217, 0.092, ...)
zP <- c(0.010,0.013, 0.017, 0.021, ...)

How can we do it in R ?

回答1:

Based on the information you provided it sounds like you want to keep all the observations in x, and then for each observation in x you want to find the observation in y that minimizes the distance between columns z and zP. If that is what you are looking for then something like this might work

> library(data.table)

> x <- data.table(z = c(0.231, 0.045, 0.632, 0.217, 0.092), k = c("A","A","B","B","B"))
> y <- data.table(zP = c(0.010, 0.813, 0.017, 0.421), m = c(1,2,3,4))

> x
       z k 
1: 0.231 A 
2: 0.045 A 
3: 0.632 B 
4: 0.217 B 
5: 0.092 B 
> y
      zP m
1: 0.010 1
2: 0.813 2
3: 0.017 3
4: 0.421 4

> find.min.zP <- function(x){
+   y[which.min(abs(x - zP)), zP]
+ }

> x[, zP := find.min.zP(z), by = z]

> x
   z k    zP
1: 0.231 A 0.421
2: 0.045 A 0.017
3: 0.632 B 0.813
4: 0.217 B 0.017
5: 0.092 B 0.017

> merge(x, y, by="zP", all.x = T, all.y = F)
      zP     z k m
1: 0.017 0.045 A 3
2: 0.017 0.217 B 3
3: 0.017 0.092 B 3
4: 0.421 0.231 A 4
5: 0.813 0.632 B 2

This is the solution that popped into my head given that I use data.table quite a bit. Please note that using data.table here may or may not be the most elegant way and it may not even be the fastest way (although if x and y are large some solution involving data.table probably will be the fastest). Also note that this is likely an example of using data.table "badly" as I didn't make any effort to optimize for speed. If speed is important I would highly recommend reading the helpful documentation on the github wiki. Hope that helps.

Edit:

As I suspected, data.table provides a much better way, which Arun pointed out in the comments.

> setkey(x, z)
> setkey(y, zP)
> y[x, roll="nearest"]

      zP m k
1: 0.045 3 A
2: 0.092 3 B
3: 0.217 3 B
4: 0.231 4 A
5: 0.632 2 B

The only difference is that the z column is now named zP and the original zP column is gone. If preserving that column is important you can always copy the zP column in y to a new column named z and join on that.

> y[, z := zP]
> setkey(x, z)
> setkey(y, z)
> y[x, roll='nearest']
      zP m     z k
1: 0.017 3 0.045 A
2: 0.017 3 0.092 B
3: 0.017 3 0.217 B
4: 0.421 4 0.231 A
5: 0.813 2 0.632 B

This is slightly less typing, but the real improvement is in compute times with large datasets.

> x <- data.table(z = runif(100000, 0, 100), k = sample(LETTERS, 100000, replace = T))
> y <- data.table(zP = runif(50000, 0, 100), m = sample(letters, 50000, replace = T))

> start <- proc.time()
> x[, zP := find.min.zP(z), by = z]
> slow <- merge(x, y, by="zP", all.x = T, all.y = F)
> proc.time() - start
  user  system elapsed 
104.849  0.072 106.432 

> x[, zP := NULL] # Drop the zP column we added to x doing the merge the slow way
> start <- proc.time()
> y[, z := zP]
> setkey(x, z)
> setkey(y, z)
> fast <- y[x, roll='nearest']
> proc.time() - start
 user  system elapsed 
0.046   0.000   0.045

# Reorder the rows and columns so that we can compare the two data tables
> setkey(slow, z)
> setcolorder(slow, c("z", "zP", "k", "m"))
> setcolorder(fast, c("z", "zP", "k", "m"))
> all.equal(slow, fast)
TRUE

Notice, that the faster method is 2,365 times faster! I would expect the time gains to be even more dramatic for a data set with more than 100,000 observations (which is relatively small these days). This is why reading the data.table documentation is worth while if you are working with large data sets. You can often achieve very large speed ups by using the built in methods, but you won't know that they're there unless you look.

来源：https://stackoverflow.com/questions/29527964/merging-two-datasets-on-approximate-values

标签

dataset