Merging two datasets on approximate values

陌路散爱 提交于 2019-12-11 10:10:01

问题


I need to merge (left join) two data sets x and y.

merge(x,y, by.x = "z", by.y = "zP", all.x = TRUE)

Every value of z is not there in zP but there must be nearest value in zP. So we need to use nearest value in zP for process of merging. For example

z <- c(0.231, 0.045, 0.632, 0.217, 0.092, ...)
zP <- c(0.010,0.013, 0.017, 0.021, ...)

How can we do it in R ?


回答1:


Based on the information you provided it sounds like you want to keep all the observations in x, and then for each observation in x you want to find the observation in y that minimizes the distance between columns z and zP. If that is what you are looking for then something like this might work

> library(data.table)

> x <- data.table(z = c(0.231, 0.045, 0.632, 0.217, 0.092), k = c("A","A","B","B","B"))
> y <- data.table(zP = c(0.010, 0.813, 0.017, 0.421), m = c(1,2,3,4))

> x
       z k 
1: 0.231 A 
2: 0.045 A 
3: 0.632 B 
4: 0.217 B 
5: 0.092 B 
> y
      zP m
1: 0.010 1
2: 0.813 2
3: 0.017 3
4: 0.421 4

> find.min.zP <- function(x){
+   y[which.min(abs(x - zP)), zP]
+ }

> x[, zP := find.min.zP(z), by = z]

> x
   z k    zP
1: 0.231 A 0.421
2: 0.045 A 0.017
3: 0.632 B 0.813
4: 0.217 B 0.017
5: 0.092 B 0.017

> merge(x, y, by="zP", all.x = T, all.y = F)
      zP     z k m
1: 0.017 0.045 A 3
2: 0.017 0.217 B 3
3: 0.017 0.092 B 3
4: 0.421 0.231 A 4
5: 0.813 0.632 B 2

This is the solution that popped into my head given that I use data.table quite a bit. Please note that using data.table here may or may not be the most elegant way and it may not even be the fastest way (although if x and y are large some solution involving data.table probably will be the fastest). Also note that this is likely an example of using data.table "badly" as I didn't make any effort to optimize for speed. If speed is important I would highly recommend reading the helpful documentation on the github wiki. Hope that helps.

Edit:

As I suspected, data.table provides a much better way, which Arun pointed out in the comments.

> setkey(x, z)
> setkey(y, zP)
> y[x, roll="nearest"]

      zP m k
1: 0.045 3 A
2: 0.092 3 B
3: 0.217 3 B
4: 0.231 4 A
5: 0.632 2 B

The only difference is that the z column is now named zP and the original zP column is gone. If preserving that column is important you can always copy the zP column in y to a new column named z and join on that.

> y[, z := zP]
> setkey(x, z)
> setkey(y, z)
> y[x, roll='nearest']
      zP m     z k
1: 0.017 3 0.045 A
2: 0.017 3 0.092 B
3: 0.017 3 0.217 B
4: 0.421 4 0.231 A
5: 0.813 2 0.632 B

This is slightly less typing, but the real improvement is in compute times with large datasets.

> x <- data.table(z = runif(100000, 0, 100), k = sample(LETTERS, 100000, replace = T))
> y <- data.table(zP = runif(50000, 0, 100), m = sample(letters, 50000, replace = T))

> start <- proc.time()
> x[, zP := find.min.zP(z), by = z]
> slow <- merge(x, y, by="zP", all.x = T, all.y = F)
> proc.time() - start
  user  system elapsed 
104.849  0.072 106.432 

> x[, zP := NULL] # Drop the zP column we added to x doing the merge the slow way
> start <- proc.time()
> y[, z := zP]
> setkey(x, z)
> setkey(y, z)
> fast <- y[x, roll='nearest']
> proc.time() - start
 user  system elapsed 
0.046   0.000   0.045

# Reorder the rows and columns so that we can compare the two data tables
> setkey(slow, z)
> setcolorder(slow, c("z", "zP", "k", "m"))
> setcolorder(fast, c("z", "zP", "k", "m"))
> all.equal(slow, fast)
TRUE

Notice, that the faster method is 2,365 times faster! I would expect the time gains to be even more dramatic for a data set with more than 100,000 observations (which is relatively small these days). This is why reading the data.table documentation is worth while if you are working with large data sets. You can often achieve very large speed ups by using the built in methods, but you won't know that they're there unless you look.



来源:https://stackoverflow.com/questions/29527964/merging-two-datasets-on-approximate-values

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!