R: Fuzzy merge using agrep and data.table

隐身守侯 提交于 2021-02-07 09:33:44

问题


I try to merge two data.tables, but due to different spelling in stock names I lose a substantial number of data points. Hence, instead of an exact match I was looking into a fuzzy merge.

library("data.table")
dt1 = data.table(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2))
dt2 = data.table(Name = c("ASML HOLDING NV", "ABN AMRO GROUP"), B = c("p", "q"))

When merging dt1 and dt2 on "Name", ASML HOLDING will be excluded due to the addition of "NV", while the actual data would be accurate.

The prefered final data output would look somthing like:

              Name A B
1:  ABN AMRO GROUP 2 q
2: ASML HOLDING NV 1 p

What I tried next was the following:

dt1 = dt1[, dt2_NAME := agrep(dt1$Name, dt2$Name, ignore.case = TRUE, value = TRUE, max.distance = 0.05, useBytes = TRUE)]

However, I get the following error,

argument 'pattern' has length > 1 and only the first element will be used

The error makes sense as dt1$Name is longer than 1, but I believe it would be a possible solution if it would consider dt1$Name on a row to row basis.

It might be a stupid mistake, but for some reason I just can't get my head around it. Furthermore, I prefer to use data.table as my dataset is fairly large and up till now it has worked splendidly. Additionally, I am new to stack overflow, so sorry if my question is somewhat off.

Lastly, I found a piece of code which does the job, but is too slow for practical usage. Fuzzy merge in R

dt1$Name_dt2 <- "" # Creating an empty column
for(i in 1:dim(dt1)[1]) {
  x <- agrep(dt1$Name[i], dt2$Name,
             ignore.case=TRUE, value=TRUE,
             max.distance = 0.05, useBytes = TRUE)
  x <- paste0(x,"")
  dt1$Name_dt2[i] <- x
}

回答1:


A possible solution using 'fuzzyjoin':

library(fuzzyjoin)
f <- Vectorize(function(x,y) agrepl(x, y,
                                   ignore.case=TRUE,
                                   max.distance = 0.05, useBytes = TRUE))

dt1 %>% fuzzy_inner_join(dt2, by="Name", match_fun=f)
#          Name.x A          Name.y B
#1   ASML HOLDING 1 ASML HOLDING NV p
#2 ABN AMRO GROUP 2  ABN AMRO GROUP q

NOTE : The main problem, that you encountered too, was that agrep and agrepl don't seem to expect the first argument to be a vector. That's the reason why I wrapped the call with Vectorize.

This method can be used together with an equi-join (mind the order of columns in the by!):

dt1 = data.frame(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2),Date=c(1,2))
dt2 = data.frame(Name = c("ASML HOLDING NV", "ABN AMRO GROUP", "ABN AMRO GROUP"), B = c("p", "q","r"),Date=c(1,2,3))

dt1 %>% fuzzy_inner_join(dt2, by=c("Date","Name"), match_fun=f) %>% filter(Date.x==Date.y)


来源:https://stackoverflow.com/questions/52402768/r-fuzzy-merge-using-agrep-and-data-table

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!